Nagios Core Administration Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

Specifying how frequently to check a host or service

In this recipe, we'll adjust the definition of a very important host to ensure that it checks whether the host is up in every three minutes and, if it finds that the host is down as a result of the check failing, it will check again after a minute before it sends a notification about the state to its defined contact. We'll do this by customizing the definition for an existing host.

Getting ready

You should have a Nagios Core 4.0 or newer server with at least one host configured already. We'll use the example of sparta.example.net, a host defined in its own file.

You should also understand the basics of commands and plugins, in particular the meaning of the check_command directive. These are covered in the recipes in Chapter 2, Working with Commands and Plugins.

How to do it...

We can customize the check frequency for a host as follows:

  1. Change to the objects configuration directory for Nagios Core. The default location for the objects for the objects is /usr/local/nagios/etc/objects. If you've put the definition of your host in a different file, move it to its directory instead:
    # cd /usr/local/nagios/etc/objects
    
  2. Edit the file containing your host definition and find the definition within the file:
    # vi sparta.example.net.cfg
    

    The host definition may look something like this:

    define host {
        use                 linux-server
        host_name           sparta.example.net
        alias               sparta
        address             192.0.2.21
    }
  3. Add or edit the value of the check_interval directive to 3:
    define host {
        use                 linux-server
        host_name           sparta.example.net
        alias               sparta
        address             192.0.2.21
     check_interval 3
    }
  4. Add or edit the value of the retry_interval directive to 1:
        use                 linux-server
        host_name           sparta.example.net
        alias               sparta
        address             192.0.2.21
        check_interval      3
     retry_interval 1
    }
  5. Add or edit the value of max_check_attempts to 2:
    define host {
        use                 linux-server
        host_name           sparta.example.net
        alias               sparta
        address             192.0.2.21
        check_interval      3
        retry_interval      1
     max_check_attempts 2
    }
  6. Validate the configuration and restart the Nagios Core server:
    # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # /etc/init.d/nagios reload
    

    With this done, Nagios Core will run the relevant check_command (probably something like check-host-alive) against this host every three minutes and, if it fails, will flag the host as down, check the same again after one minute, and only then send a notification to its defined contact if the second check fails as well.

How it works...

The preceding configuration changed three properties of the host object type to effect the changes we needed:

  • check_interval: This defines how long to wait between successive checks of the host under normal conditions. We set this to 3, or three minutes.
  • retry_interval: This defines how long to wait between follow-up checks of the host after first finding problems with it. We set this to 1, or one minute.
  • max_check_attempts: This defines how many total checks should we run before a notification is sent. We set this to 2 for two checks. This means that after the first failed check is run, Nagios Core will run another check a minute later and will only send a notification if this check fails as well. After two checks have been run and the host is still in a problem state, it will go from a SOFT state to a HARD state.

Note that setting these directives in a host that derives from a template, as is the case with our example, will override any of the same directives in the template.

There's more...

It's important to note that we can also define the units used by the check_interval and retry_interval commands. They only use minutes by default, checking the interval_length setting that's normally defined in the root configuration file for Nagios Core, by default, /usr/local/nagios/etc/nagios.cfg:

interval_length=60

If we wanted to specify these periods in seconds instead, we could set this value to 1 instead of 60:

interval_length=1

This would allow us, for example, to set check_interval to 15, to check a host every 15 seconds. Note that if we have a lot of hosts with such a tight checking schedule, it might overburden the Nagios Core process, particularly if the checks take a long time to complete.

Don't forget that changing these properties for a large number of hosts can be tedious, so if it's necessary to set these directives to some common value for more than a few hosts, it may be appropriate to set the values in a host template and then have these hosts inherit from it. Refer to the Using inheritance to simplify configuration recipe in Chapter 9, Managing Configuration, for more details. Note that the same three directives also work for service declarations and have the same meaning. We could define the same notification behavior for a service on sparta.example.net with a declaration like this:

define service {
    use                  generic-service
    host_name            sparta.example.net
    service_description  HTTP
    check_command        check_http
    address              192.0.2.21
 check_interval 3 
 retry_interval 1
 max_check_attempts 2
}

See also

  • The Scheduling downtime for a host section in this chapter
  • Using inheritance to simplify a configuration, Chapter 9, Managing Configuration