System Monitoring Overview

If it isn't measured and monitored, you aren't managing it.

Historical monitoring captures events that occur over a period of time, in order to determine trends. This is used for baselining.

Service availability monitoring shows events of interest as they occur, and is used to spot trouble. This needs an alerting system to be useful.

Monitor everything deemed important. A resource is important if you'll get into trouble with your PHB (pointy haired boss) if that resource runs out. Some things must be monitored in a regulated industry, others cannot be. Some things should be monitored historically for audit/security purposes.

You need a policy that:

Protects the data collected
Determines the data retention period
Says how to handle old data
Defines what personal identifying data is collected in accordance with the organization's data privacy policy

Your services (Web, SSH, DNS, Print, file sharing ...)
Free disk space
The state of your mirrors
Response time
Failures of any sort (reliability)
Memory (used, free, swap space used)
Login events
Mail events
Unusual penetration events
Network events
Storage subsystem events
Current, peak, and average load (and general trends) for anything of interest
Hardware data (disk speed, failure rate, age; system temperature.)

Basic tools (along with cron and some scripting):
- top
- df
- free
- sar
- ac
- w
- hdparm
- smartctl
- i2cdump
- sensors
To monitor your network interfaces:
- ip -s link show
- ifconfig
Other tools monitor log files, filter, condense the data, produce (GUI) reports, and include alerting features. Some of these log file monitoring/filtering tools are:
- logwatch
- logcheck
- swatch
- logsurfer
- SEC (one of the most comprehensive open/free too available)
- webalizer
- HP OpenView
- IBM Tivoli
- ftp.opensysmon.com
  (A collection of open source monitoring tools integrated into a web page)
- A Google search for "(Unix OR Linux) (System OR Network) Monitoring tools" will find tons of hits.
- You can find a reviewed collection of useful open source tools from www.ossim.net.

An important feature of monitoring and alerting tools is automatic escalation.

Be careful your network bandwidth isn't used up by your monitoring data. Aim for no more than 1% used, especially over slower links.

Use SNMP to monitor remote servers. Summarize remote network monitoring with RMON. (SNMP Components: GET, SET commands, a Managment Console, and a MIB for each type of component to monitor.) See the Cisco SNMP tutorial for more information. (Command line tools: snmp*, arpsnmp, net-snmp-config.)

Related to log files, process accounting was used to track system use to allocate system expenses (say by department). Full accounting can take a lot of CPU time, RAM, and disk space. See the sar command for details.

System Monitoring Tutorial

System (and Network) Monitoring Concepts:

Monitoring Policy:

Some commonly Monitored Resources on Production Servers:

Tools for Monitoring and Alerting:

Process Accounting