Beitrag von: chris
Today I had to debug a strange Nagios NRPE issue where the check_proc command didn't properly report the status of a running process ... The configured command was
/usr/lib/nagios/plugins/check_procs -u myuser -C myprogram -a "keyword" -w 1:1 -c 1:1
After rebooting the machine the status of this check turned red, even if the process in question was running properly. After restarting nrpe on that machine, the check turned green. Also manually checking always succeeded.
The cause was the value of the COLUMN environment variable. On boot the value was fairly small (COLUMN=96) and this was kept for the nrpe start:
# cat /proc/$(pidof nrpe)/environ | tr "[:cntrl:]" "n" | grep COLUMN
COLUMNS=96
The plug-in uses "/bin/ps axwo 'stat uid pid ppid vsz rss pcpu etime comm args'" to get the list of running processes. ps evaluates COLUMNS and cuts its result to 133 characters. This means: Any process command line that exceeds 64 characters will be lost.
There a several possible solutions:
- We could move the keyword to look for into the first 64 characters of the process command. Hence this might require changes to proprietary software.
- This is actually an issue of the check_procs. They could fix it (e.g. modifying the ps command/environment) [1]
- We could extend /etc/nagios/nrpe.cfg to "command[check_myprogram_proc]=COLUMN=256 /usr/lib/nagios/plugins/check_procs ..." - this actually seems to work.
Another point I found was that it matters, how the nrpe is restarted. Using "service nrpe restart" would use an empty environment with a small COLUMNS value. Using "/etc/init.d/nrpe restart" would keep the current shell settings. This is particularly funny as the result of the nrpe check then depends on the current size of the putty/xterm window.
[1] http://tracker.nagios.org/view.php?id=231
Kommentare (0)
Christophs Weblog
http://weblog.christoph-probst.com/article.php/20110718143604605