Specifying how frequently to check a host or service
In this recipe, we'll adjust the definition of a very important host to ensure that it checks whether the host is up in every three minutes and, if it finds that the host is down as a result of the check failing, it will check again after a minute before it sends a notification about the state to its defined contact. We'll do this by customizing the definition for an existing host.
Getting ready
You should have a Nagios Core 4.0 or newer server with at least one host configured already. We'll use the example of sparta.example.net
, a host defined in its own file.
You should also understand the basics of commands and plugins, in particular the meaning of the check_command
directive. These are covered in the recipes in Chapter 2, Working with Commands and Plugins.
How to do it...
We can customize the check frequency for a host as follows:
- Change to the objects configuration directory for Nagios Core. The default location for the objects for the objects is
/usr/local/nagios/etc/objects
. If you've put the definition of your host in a different file, move it to its directory instead:# cd /usr/local/nagios/etc/objects
- Edit the file containing your host definition and find the definition within the file:
# vi sparta.example.net.cfg
The host definition may look something like this:
define host { use linux-server host_name sparta.example.net alias sparta address 192.0.2.21 }
- Add or edit the value of the
check_interval
directive to3
:define host { use linux-server host_name sparta.example.net alias sparta address 192.0.2.21 check_interval 3 }
- Add or edit the value of the
retry_interval
directive to1
:use linux-server host_name sparta.example.net alias sparta address 192.0.2.21 check_interval 3 retry_interval 1 }
- Add or edit the value of
max_check_attempts
to2
:define host { use linux-server host_name sparta.example.net alias sparta address 192.0.2.21 check_interval 3 retry_interval 1 max_check_attempts 2 }
- Validate the configuration and restart the Nagios Core server:
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload
With this done, Nagios Core will run the relevant
check_command
(probably something likecheck-host-alive
) against this host every three minutes and, if it fails, will flag the host as down, check the same again after one minute, and only then send a notification to its defined contact if the second check fails as well.
How it works...
The preceding configuration changed three properties of the host
object type to effect the changes we needed:
check_interval
: This defines how long to wait between successive checks of the host under normal conditions. We set this to3
, or three minutes.retry_interval
: This defines how long to wait between follow-up checks of the host after first finding problems with it. We set this to1
, or one minute.max_check_attempts
: This defines how many total checks should we run before a notification is sent. We set this to2
for two checks. This means that after the first failed check is run, Nagios Core will run another check a minute later and will only send a notification if this check fails as well. After two checks have been run and the host is still in a problem state, it will go from aSOFT
state to aHARD
state.
Note that setting these directives in a host that derives from a template, as is the case with our example, will override any of the same directives in the template.
There's more...
It's important to note that we can also define the units used by the check_interval
and retry_interval
commands. They only use minutes by default, checking the interval_length
setting that's normally defined in the root configuration file for Nagios Core, by default, /usr/local/nagios/etc/nagios.cfg
:
interval_length=60
If we wanted to specify these periods in seconds instead, we could set this value to 1
instead of 60
:
interval_length=1
This would allow us, for example, to set check_interval
to 15
, to check a host every 15 seconds. Note that if we have a lot of hosts with such a tight checking schedule, it might overburden the Nagios Core process, particularly if the checks take a long time to complete.
Don't forget that changing these properties for a large number of hosts can be tedious, so if it's necessary to set these directives to some common value for more than a few hosts, it may be appropriate to set the values in a host template and then have these hosts inherit from it. Refer to the Using inheritance to simplify configuration recipe in Chapter 9, Managing Configuration, for more details. Note that the same three directives also work for service declarations and have the same meaning. We could define the same notification behavior for a service on sparta.example.net
with a declaration like this:
define service { use generic-service host_name sparta.example.net service_description HTTP check_command check_http address 192.0.2.21 check_interval 3 retry_interval 1 max_check_attempts 2 }
See also
- The Scheduling downtime for a host section in this chapter
- Using inheritance to simplify a configuration, Chapter 9, Managing Configuration