====== Netsaint_/_Nagios ======

====== Nagios is a Network Monitoring Service :: Setup and Install ======
It can monitor several services on several hosts and notify by email etc. a certain group depending on the levels of measurement.
To keep it simple:

 <del>apt-get install nagios-text</del> <-> sarge and etch config
 apt-get install nagios3
FYI: In Debian squeeze, nagios requires php to be installed for the front end :-/ 

On the install process, a password for the default admin user is required.

 nagiosadmin
 password_chosen_at_install (This is for the Web Interface)

additional users: /etc/nagios3/htpasswd.users (add via apache htpasswd)

There is a ton of configuring to be done.
First off - apache2 site-enabled.

 ln -s /etc/nagios3/apache.conf /etc/apache2/sites-enabled/nagios
 (restart apache)

This will get the basics done at http://localhost/nagios. You will be able to login. The Default Gateway should get added in by default and will be monitored ok.
Copy the settings in /etc/nagios and put in another host etc...

Great Explaination at:
http://www.debian-administration.org/articles/299

====== Configuration of Nagios ======
There is quite a bit of configuration required for Nagios. If the following steps are carried out in order, things should be a lot easier. Although by default the "Default Gateway" (gw) is added in with its own group etc. it was put into a new hostgroup with updated contact details.

===== Overview of Nagios Config Files and Plugins =====
The main nagios config files are kept in: <del>/etc/nagios/</del> /etc/nagios3/
The plugin config files are kept in: /etc/nagios-plugins/config/
The executable plugins are kept in: /usr/lib/nagios/plugins/

===== 0. Additional Info Available =====
Please read http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#host for all details relating to the options/files below and their template. E.g. the following host config options are explained there: d,u,r. d=down. u=unreachable. r=recovered (note: there are more options available). Extended example configs are located at: /usr/share/doc/nagios-text/examples/template-object/

** All Configs for Nagios3 go into /etc/nagios3/conf.d/* I moved the existing files from /etc/nagios3/conf.d/* and added in the ones below. You can choose to edit and merge the configs below into the existing files if you wish. **

===== 1. Config all unique hosts =====
Note: Only specify different physical servers (ip's). Multiple http websites can be monitored on 1 host.  

 vi <del>/etc/nagios/hosts.cfg</del> /etc/nagios3/conf.d/hosts.cfg
 
 define host{
        name                            generic-host    ; The name of this host template....
        notifications_enabled           1       ; Host notifications are enabled
        event_handler_enabled           1       ; Host event handler is disabled
        flap_detection_enabled          0       ; Flap detection is disabled. Flap = prevents against intermittent network anomalies
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts. Turn this off (0) when testing and doing lots of restarts, otherwise some settings will be cached!
        retain_nonstatus_information    1       ; Retain non-status information across program restarts. This can be turned off also while testing and setting up.
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }
 
 # Default gateway host definition
 define host{
        use                     generic-host            ; Name of host template to use
        host_name               gateway
        alias                   Default Gateway
        address                 ip.address.or.domain.com.name
        check_command           check-host-alive
        max_check_attempts      20
        notification_interval   60
        notification_period     24x7
        notification_options    d,u,r
 }
 
 define host{
        use                     generic-host            ; Name of host template to use
        host_name               domain1.com
        alias                   Domain 1
        address                 ip.or.host.name
        check_command           check-host-alive
        max_check_attempts      20
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
 }
 
 define host{
        use                     generic-host            ; Name of host template to use
        host_name               domain2.com
        alias                   Domain 2
        address                 ip.address.or.host.name
        check_command           check-host-alive
        max_check_attempts      20
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
 }
 
 define host{
        use                     generic-host            ; Name of host template to use
        host_name               www.google.com
        alias                   Google Webserver
        address                 www.google.com
        check_command           check-host-alive
        max_check_attempts      20
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
 }
==== Disable Checking of a Host ====
I have been having problems with 1 host in particular, where nagios gets tied up checking TTL and does not wait between TTL checks. The errors were:
 [[06-24-2007|11:10:13]] HOST ALERT: host.com;DOWN;SOFT;19;CRITICAL - Time to live exceeded (82.195.144.16)
 [[06-24-2007|11:10:13]] HOST ALERT: host.com;DOWN;SOFT;18;CRITICAL - Time to live exceeded (82.195.144.16)
 [[06-24-2007|11:10:13]] HOST ALERT: host.com;DOWN;SOFT;17;CRITICAL - Time to live exceeded (82.195.144.16)
 [[06-24-2007|11:10:13]] HOST ALERT: host.com;DOWN;SOFT;16;CRITICAL - Time to live exceeded (82.195.144.16)
 #and so on for 20 checks with no wait
The same error has been discussed and described further here: http://readlist.com/lists/lists.sourceforge.net/nagios-users/0/2181.html
Instead of putting in some code to get nagios waiting between TTL checks, I simply chose to disable host checking, and to check just the service on that server instead. To disable checking of a host, add the following to the define host{ } code (as above):
 define host{
        use                     generic-host            ; Name of host template to use
        host_name               www.google.com
        alias                   Google Webserver
        address                 www.google.com
        check_command           check-host-alive
        max_check_attempts      20
        checks_enabled          0
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
 }
===== 2. Config Nagios hostgroups =====
Hostgroups quite simply group together all the hosts in hosts.cfg. They are mainly used to order and group services and hosts together. I created seperate hostgroups for various server clusters. I.e. 1 hostgroup for my own server cluster, and a second for my computer society servers, and a third for Commerical Hosting webservers.
 vi /etc/nagios3/conf.d/hostgroups.cfg

 define hostgroup{
        hostgroup_name  my_cluster
        alias           My Server Cluster
        contact_groups  root-my_cluster
        members         gateway, domain1.com, domain2.com
 }
 
 define hostgroup{
        hostgroup_name  other-webservers
        alias           Other Commercial Web Servers
        contact_groups  select-users-my_cluster
        members         www.google.com
 }

===== 3. Config Nagios Contacts =====
Note: As with hosts, the contacts config takes in specific names of people and their contact information. Various contacts are then grouped together in step 4. For this config, I am going to have 2 main contacts. 1 is going to be the root administrator and the second is going to be a general user (for recieving information on the non essential other-webservers). Again, look at http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#contact for specifics on notification options.
 vi /etc/nagios3/conf.d/contacts.cfg

 define contact{
        contact_name                    root
        alias                           Root Administrator
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,u,r
        service_notification_commands   notify-by-email
        host_notification_commands      host-notify-by-email
        email                           root@domain.com
 }
 
 define contact{
        contact_name                    sburke
        alias                           A Standard/Typical User
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,u,r
        service_notification_commands   notify-by-email
        host_notification_commands      host-notify-by-email
        email                           username@domain.com
 }

===== 4. Config Nagios Contactgroups =====
Again, all the various contacts as outlined in step 3 needs to be grouped together. The hostgroups.cfg and services.cfg send alert notifications to "contactgroups" and not individual contacts. Although all these seperate configs seem to be very awkward, they ensure that users and hosts and services can be added easily.
 vi /etc/nagios3/conf.d/contactgroups.cfg

 define contactgroup{
        contactgroup_name       root-my_cluster
        alias                   Root Admins on My Cluster
        members                 root
 }
 
 define contactgroup{
        contactgroup_name       select-users-my_cluster
        alias                   Users on Burkesys
        members                 sburke
 }
Note: "root-my_cluster", "root", "select-users-my_cluster" and "sburke" were selected from Steps 2 and 3.

===== 5. Config Nagios Services =====
This is the main and final configuration file (typically). All information in the previous 4 steps must be used and matched up correctly with the configs and information in this step, otherwise nagios will complain and give a helpful debug.
 vi /etc/nagios3/conf.d/services.cfg

 # Generic service definition template
 define service{
        ; The 'name' of this service template, referenced in other service definitions
        name                            generic-service
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/disabled
        parallelize_check               1       ; Active service checks should be parallelized
                                                ; (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are disabled
        event_handler_enabled           1       ; Service event handler is disabled
        flap_detection_enabled          0       ; Flap detection is disabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts Turn this off (0) when testing and doing lots of restarts, otherwise some settings will be cached!
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
 }
 
 # Service definition
 define service{
        use                             generic-service         ; Name of service template to use
        host_name                       domain1.com, domain2.com
        service_description             PING
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  root-my_cluster
        notifications_enabled           1
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   check_ping!100.0,20%!500.0,60%
        ;check_ping syntax: !warning if exceeds 100ms,warning if exceeds 20% packet loss!critical if exceeds 500ms,critical if exceeds 60% packet loss
 }
 
 define service{
        use                             generic-service
        host_name                       domain1.com
        service_description             HTTP
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  root-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_http
 }
 
 define service{
        use                             generic-service
        host_name                       domain1.com
        service_description             HTTP-vhost_name
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  root-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_http_url!http://vhost.domain1.com/path/to/application/page.php   ;please read Step 6 below for extra config required.
 }
 
 define service{
        use                             generic-service
        host_name                       domain2.com
        service_description             DNS
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  root-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_dns
 }
 
 define service{
        use                             generic-service
        host_name                       domain2.com
        service_description             MySQL
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  root-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_mysql_cmdlinecred!mysqluser!mysqlpassword
 }
 
 define service{
        use                             generic-service
        host_name                       domain2.com
        service_description             SMTP
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  root-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_smtp
 }
 ################################################################
 define service{
        use                             generic-service
        host_name                       www.google.com
        service_description             PING
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  select-users-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_ping!100.0,20%!500.0,60%
 }
 
 define service{
        use                             generic-service
        host_name                       www.google.com
        service_description             HTTP
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  select-users-my_cluster
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_http
 }

The services.cfg can get quite long indeed! Services can be grouped together in servicegroups.cfg, however I didnt bother with this step. It provides a better overview using the Web Front end when there are a large number of services.

===== 6. Extra Custom Plugin Configs =====
In the services.cfg, there is an "check_http_url" config added in. Currently nagios would give an error at this step. That is because "check_http_url" is a special config to monitor a vhost.domain1.com and prevents us from having to make a host for a virtual website to monitor.
 vi /etc/nagios-plugins/config/http.cfg

 # 'check_http3' command definition
 define command{
        command_name    check_http_url
        command_line    /usr/lib/nagios/plugins/check_http -I $HOSTADDRESS$ -u $ARG1$
 }
In order to see what options are available and the command line switches etc. do the following:
 /usr/lib/nagios/plugins/check_http --help
There are several options for all of the plugins within /usr/lib/nagios/plugins/ to monitor various specific levels of performance.


**Another config is /etc/nagios/escalations.cfg** however at the moment I feel it works ok without this step. I will revisit it at a later stage.

====== Send Nagios Notifications via SMS Text Messages ======
Although a simple config could be made for nagios to send sms's via vodasms (o2sms), I chose to do the sms handling at email delivery time using procmail. Read more here: [[Vodasms#Forward_Emails_via_SMS_Text_Message]]

====== References & Additional Info ======
Vhost & Website Monitoring: http://theories.darwinsys.com/2007/04/05/1175779980000.html <br>
Monitoring tomcat website: http://nagios.org/faqs/viewfaq.php?faq_id=310 <br>
http://www.kernel-panic.it/openbsd/nagios/nagios3.html <br>
Main Nagios Templates and Docs: http://nagios.sourceforge.net/docs/2_0/xodtemplate.html <br>
General: http://www.onlamp.com/pub/a/onlamp/2002/09/26/nagios.html?page=1 <br>
General and Good: http://www.debian-administration.org/articles/299 <br>
General with some mistakes: http://servers.linux.com/servers/04/09/14/2317206.shtml <br>
MySQL info and Nagios: http://www.gatorlug.org/files/GatorLUG.ppt

====== Monitor HTML via a Proxy ======
If nagios is running on a server which its firewall blocks outgoing http(s) requests, then you will have to use a proxy (if available) to check http on a remote host/server. Here is the configs and tweaks required:
vi /etc/nagios-plugins/config/http.cfg
 # 'check_http_via_proxy
 define command{
        command_name    check_http_via_proxy
        command_line    /usr/lib/nagios/plugins/check_http -H $ARG1$ -p $ARG2$ -u $ARG3$ -e 'HTTP/1.0 200 OK'
}

 vi /etc/nagios/services.cfg
 # edit the check_command for the particular service you require to:
 host_name                       externalserver.com
 check_command                   check_http_via_proxy!proxy.internalserver.com!3128!http://externalserver.com
 # note - sometimes the squid proxy would only serve a cached page. To get around this, the check_command was further tweaked to call a particlar webpage, i.e.:
 check_command                   check_http_via_proxy!proxy.internalserver.com!3128!http://externalserver.com/~userwebsite/

Hopefully that should work ok.