BCX Network Monitoring - Service level priorities and actions

Service level priorities and actions

Level	Display	Alert	Notes	Example
Major	Customer and Burconix Support Dashboard	Alert Customer and Burconix Support	Current major interruption to service	Core storage offline
High	Customer and Burconix Support Dashboard	Alert Customer and Burconix Support	Potential for future interruption to service	Single disk/psu fail in redundant configuration
Concern	Customer Dashboard	Alert Customer	Potential interruption to service	HTTPS service offline
Warning	Customer Dashboard	No Alert	Warning of potential issue	Free disk space less than 10GB
Information	Customer Dashboard	No Alert	Possible future action may be required	Toner less than 10%

Below shows a summary of the core events we monitor for each device type with the notification level.

Device Types

UPS and Power Protection

We monitor the UPS runtime status including load, battery and self-diagnostic status.

Level	Trigger
High	Battery needs replacing
High	UPS on battery
High	Runtime less than 10 mins
High	Load is critical 90%
Concern	Battery temperature is too high
Concern	No SMNP data received for 3 mins
Concern	Load is too high 80%
Concern	UPS has been restarted
Warning	Battery power currently too low to support load
Warning	Last diagnostic test failed

SAN and Storage

We monitor all aspects of the storage hardware and volume availability.

Level	Trigger
High	Storage array is offline
High	No SNMP/API data received for 3 mins
High	Physical disk failed
High	Controller health degraded
High	Virtual disk health degraded
High	Enclosure health degraded
Concern	Virtual disk is not fault tolerant
Concern	SAN has been restarted
Warning	Controller redundancy lost
Warning	Controller not responding to ping

Physical Server Hardware

We monitor all aspects of the server's physical hardware sensors.

Level	Trigger
High	System status is in warning or critical state
High	Power supply is in warning or critical state
High	Disk array controller is in warning or critical state
High	Disk array cache controller battery is in warning or critical state
High	Disk array cache controller is in warning or critical state
High	Physical disk failed
High	Virtual disk offline
High	Fan is in critical state
High	Ambient temperature is above critical threshold
High	No SNMP data received for 3 mins
Concern	Ambient temperature is above warning threshold
Concern	Ambient temperature is too low
Warning	Disk array cache controller non-optimal
Warning	Physical disk is in warning state
Warning	Virtual disk is in warning state
Warning	Fan is in warning state
Warning	System has been restarted

Windows/Linux Server Agent

We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.

Level	Trigger
High	Free disk space is less than 500MB on volume
Concern	Free disk space is less than 5% and under 5GB on volume
Concern	Agent is unreachable for 10 mins
Concern	Monitored windows service is not running
Warning	Free disk space is less than 10% and under 10GB on volume
Warning	Agent is unreachable for 3 mins
Warning	Server has been restarted

Network Switch

We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.
On edge switch devices we monitor the hardware status and availability.

Level	Trigger
High	Temperature is above critical threshold
High	Power supply status is in warning or critical state
Concern	Temperature above warning threshold
Concern	Temperature is too low
Concern	Fan is in critical state
Concern	High memory utilization
Concern	No SNMP data received for 3 mins
Concern	Core switch has been restarted
Warning	Core switch link down
Warning	Fan is in warning state
Warning	Edge switch has been restarted

Firewalls and Routers

We monitor traffic utilization, as well as service availability and interface link status.

Level	Trigger
High	Interface down
Concern	No SNMP data received for 3 mins
Warning	Device has been restarted

Network Attached Device (Ping)

We monitor service availability and verify ping response times.

Level	Trigger
Concern	Unavailable by ICMP ping for 3 mins
Warning	High ICMP ping loss
Warning	High ICMP ping response time

Web Services

We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.
This can be monitored from within your network, or from a remote location based in Nottingham.

Level	Trigger
High	SSL certificate has expired
High	SSL certificate expires in less than 7 days
Concern	Web service has been down for 3 mins
Concern	SSL certificate expires in less than 14 days
Warning	SSL certificate expires in less than 30 days
Information	SSL certificate expires in less than 60 days

Printers

We monitor printers for status and toner levels.

Level	Trigger
Warning	No SNMP data received for 3 mins
Warning	Printer is in error state
Warning	Consumable on printer is empty
Information	Consumable on printer is under 10%