Monitoring and auto-recovery of services
If you run your own web, email, or other services, you need to be notified if these services are not up and running. There are tons of great choices available; which is right for you depends on your needs (how many operating systems do you support, what is your tolerance for complex configuration, how large of a system are we talking, etc). Two that I’ve used and can recommend are Nagios and Monit.
I’ve used Nagios in mixed Unix/Linux/Windows/MacOSX environments, and although it’s fairly time consuming to configure, it’s definitely very powerful. The workhorse of a Nagios system are plugins, which are simple Unix commands (they just return an exit code and optionally an informational message to tell Nagios whether the service is OK, in a warning state, or critical).
There are tons of Nagios plugins already written, which can check disk space, load average, monitor a specific TCP port, etc. Custom plugins can be written in any programming language that you like.
Nagios has a bunch of features like escalation (e.g. paging an on-call person if the service is down after an initial email), attempting to restart services, a web interface to schedule planned downtime and acknowledge outages, etc.
Nagios is great, but as I said it takes a little while to come up to speed on configuration, and if you only have one host it might be a bit more than you need. A much simpler system that I’ve been using on standalone hosts lately is Monit, which primarily exists to attempt auto-recovery and alert when service outages happen.
For example, if you want to try restarting your MySQL server before being paged, that’s really simple to specify in monitrc (the Monit config file):
check process mysql with pidfile /var/run/mysqld/mysqld.pid start program = "/etc/init.d/mysql start" stop programĀ = "/etc/init.d/mysql stop" if failed port 3306 then restart if 2 restarts within 3 cycles then timeout
You can also use Monit to restart or stop an application if it uses too much CPU, spawns too many children (as Apache does for each incoming connection), starts taking too much memory, etc. which can help to mitigate bugs and deal with denial of service attacks.
Monit isn’t a client/server system like Nagios, but this does not necessarily preclude you from configuring it centrally if you use a good deployment system (that’s a subject for a future blog post, though).