Archive for the ‘monitoring’ Category

monitoring ubuntu web servers with nagios3

Saturday, October 17th, 2009

I have chosen Nagios to keep track of the anyhosting.com network. There are many alternatives (some I have explored and some not yet), what I like about Nagios:

  • I’ve been using it for a long time; familiarity
  • very simple/powerful plugin system
  • tons of users, so lots of examples and plugins already available

Nagios version 3 is provided in the Ubuntu repositories, and is quite simple to install:

root@admin:~# apt-get install nagios3

The default config comes set up to monitor a set of services on localhost; I don’t really like the default Ubuntu/Debian setup of having one config file per host/service/etc, so on the master I’ve replaced the config file structure:

root@admin:/etc/nagios3/conf.d# cd /etc/nagios3/conf.d/
root@admin:/etc/nagios3/conf.d# ls
contacts.cfg  extinfo.cfg  groups.cfg  hosts.cfg  services.cfg  timeperiods.cfg

groups.conf contains the set of server types that I care about:

# A list of your web servers
define hostgroup {
hostgroup_name  http-servers
alias           HTTP servers
members         localhost
}

# A list of your mysql servers
define hostgroup {
hostgroup_name  mysql-servers
alias           MySQL servers
}

# A list of your VHosts
define hostgroup {
hostgroup_name  http-vhosts
alias           Virtual Host HTTP servers
}

Note that the “http-servers” can define “members” (localhost in this case), however in general I do not add members in this file but instead in the hosts.cfg:

define host {
host_name   anyhosting1
address     1.2.3.4
use         generic-host
hostgroups  http-servers
}

define host {
host_name   example.com
address     1.2.3.4
use         generic-host
hostgroups  http-vhosts
}

Note the “hostgroups” line; anyhosting1 is the physical server (this monitor is really checking on the reverse proxy), and example.com is a vhost (which is really proxying to a user running Apache for the “example.com” domain). These two checks make sure that the whole system is working and proxying correctly.

Finally, services.cfg brings it all together by defining which groups should run which services:

# check that web services are running
define service {
hostgroup_name                  http-servers
service_description             HTTP
check_command                   check_http
use                             generic-service
notification_interval           0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name                  http-vhosts
service_description             Virtual Host HTTP
check_command                   check_httpname
use                             generic-service
notification_interval           0 ; set > 0 if you want to be renotified
}

The Ubuntu nagios-plugins package (which by default is installed along with the nagios3 package) contains plugins that can intelligently check MySQL databases, disk space, load average, etc. By default these only work on the local machine, but these can be made to run on remote machines by installing the nagios-nrpe-server package. I will cover this further in a future blog post.

centralized logging with syslog-ng

Tuesday, October 6th, 2009

Just wanted to point out another excellent post from the Blog O’ Matty on centralized logging with syslog-ng.

I actually helped to set up real-time web analysis with syslog-ng (using TCP) and a slightly hacked webalizer (it was ignoring multiple hits happening on the same second) from a FreeBSD/Apache web farm ~10 years ago, and have been looking into it again for my current logging needs.

His blog has consistently awesome posts (if you’re interested in systems administration), and as your doctor I highly suggest that you subscribe.

Monitoring and auto-recovery of services

Wednesday, December 26th, 2007

If you run your own web, email, or other services, you need to be notified if these services are not up and running. There are tons of great choices available; which is right for you depends on your needs (how many operating systems do you support, what is your tolerance for complex configuration, how large of a system are we talking, etc). Two that I’ve used and can recommend are Nagios and Monit.

I’ve used Nagios in mixed Unix/Linux/Windows/MacOSX environments, and although it’s fairly time consuming to configure, it’s definitely very powerful. The workhorse of a Nagios system are plugins, which are simple Unix commands (they just return an exit code and optionally an informational message to tell Nagios whether the service is OK, in a warning state, or critical).

There are tons of Nagios plugins already written, which can check disk space, load average, monitor a specific TCP port, etc. Custom plugins can be written in any programming language that you like.

Nagios has a bunch of features like escalation (e.g. paging an on-call person if the service is down after an initial email), attempting to restart services, a web interface to schedule planned downtime and acknowledge outages, etc.

Nagios is great, but as I said it takes a little while to come up to speed on configuration, and if you only have one host it might be a bit more than you need. A much simpler system that I’ve been using on standalone hosts lately is Monit, which primarily exists to attempt auto-recovery and alert when service outages happen.

For example, if you want to try restarting your MySQL server before being paged, that’s really simple to specify in monitrc (the Monit config file):

check process mysql with pidfile /var/run/mysqld/mysqld.pid
  start program = "/etc/init.d/mysql start"
  stop program  = "/etc/init.d/mysql stop"
  if failed port 3306 then restart
  if 2 restarts within 3 cycles then timeout

You can also use Monit to restart or stop an application if it uses too much CPU, spawns too many children (as Apache does for each incoming connection), starts taking too much memory, etc. which can help to mitigate bugs and deal with denial of service attacks.

Monit isn’t a client/server system like Nagios, but this does not necessarily preclude you from configuring it centrally if you use a good deployment system (that’s a subject for a future blog post, though).