Monitoring and alerting

Posted in: Development

Almost all of the services the University provides are run on servers based on campus and which are managed by the Computing Services department.

Although many of these servers are now virtual, we're not yet at the point where the failure of one server prompts another to be automatically created to replace it, or where it is easy to do that replacement by hand at short notice.

This means that we sometimes have unplanned downtime.

Five months ago we expanded our usage of the off-site monitoring system Pingdom to check not just the availability of our homepage but also nine of our most popular pages and services.

It checks the web addresses of those pages and services every five minutes (or every minute in the case of www.bath.ac.uk) and if it detects a problem it emails our support desk and sends me a text message.

When we've found out what the problem is, we put a brief explanation up on our web status Tumblr and again when the problem is resolved. This supplements the information published by Computing Services on their Twitter feed but also allows us to provide more specific information when we need to.

This is early days in exposing our service availability, and we'd like to get to a point where we can summarise recent data in a way similar to GitHub's status page but we probably need a bit more research on what our users would like to see first. There's also lots more fundamental work we can do in ensuring that visitors to our services don't simply get a 404 or blank screen when something isn't available!

So, what information do you think we should be making available, and how should we be doing it?

Update: I forgot to mention that Pingdom lets us make our service availability public, and you can browse that data in detail.

Posted in: Development