Hardware Monitoring: Understanding Missing Devices and Connector Failures

This article helps understand missing devices and connector failures.

Related Topics

Objective

To provide insight into the Missing Device Detection and Connector Failure mechanisms, with a view to reducing unwanted "missing device" and "connector failed" alerts.

Solution

There are in general three sources of large amounts of missing/failed connectors:

1. On Linux/Windows systems with Hardware Manufacturer's Agent (Agent Failure):

  • The Hardware Manufacturer's agent starts to fail.  The Hardware Manufacturer's Agent answers but no longer provides status for all components.
  • "No Collect Value" error messages start to appear in the SOW, but do not trigger any alerts.
  • When the KM performs its discovery (every hour) to find new/missing components, several components are marked as missing.  These trigger alarms.
  • A few minutes later, the whole Hardware Manufacturer's Agent fails.  This causes the monitoring of all components (including missing ones) to go offline.  The connectors fail and a "connector failure alarm" is generated.
  • The Hardware Manufacturer's Agent Service will usually be automatically restarted by the OS KM at this time.  The connector will then re-activate and monitoring will go back to normal.

For the above sequence of events to happen, the discovery (which only happens every hour) needs to happen a few minutes before the Hardware Manufacturer's Agent.

If the failure occurs between discoveries, the only alert generated would be a "connector failure alarm"

2. On UNIX systems (Service/Login Failures):

  • The Prtpicl program is known to stop responding to requests for several minutes at a time.
  • MP/GSP cards are known to lock up on a regular basis and not accept any further logins.

3. On all systems (Timeouts):

There are many reasons why timeouts can occur (server overloaded, slow management cards).  Most timeout issues have been resolved by either extending timeouts through connector patches or other workarounds.

Understanding Missing/Connector Failures

Any group of missing/connector failures events from the Hardware KM, should generally be regarded as a failure of the monitoring mechanism and not genuine hardware faults.   These faults should be escalated to the Patrol Monitoring Team.

Single missing, especially when another component of that type, in that server, is not being reported as missing should be regarded as "real" missing.

The rare case that a specific connector is only monitoring a single component (e.g. a disk monitoring connector like WMI-Disks that monitors only one disk) the connector might report a connector failure instead of a component missing.

Other type of events (not missing or connector failures, but fan failure, logical disk degraded, etc...) should be regarded as real events even if they occur in groups.

Recommendations for customers

1. Ensure that all appropriate patches and KBs have been applied/followed.

2. Increase the n-times value for connectors: right mouse click on "Hardware" or "Hardware on Localhost", KM Commands, This System's Settings, Alert after N-Times.

Change the Connector Parameters value to an integer greater than one.

    To find an appropriate value to set this integer to, look through the parameter history in Patrol for the connectors on this system and figure out how many collects (by default every two minutes) the average connector failure last for, then add a couple collects as an error margin. Setting this value too high can result in real faults being missed as no alerts will be generated for any components being monitored by this connector during this period.

    This settings needs to be set for remote monitoring as well.

    The Patrol Agent Configuration variable for this is (in the following example, we have set the connectors n-times to 5):

  • /SENTRY/HARDWARE/localhost/parametersMaxOCC = 5;1;1 /SENTRY/HARDWARE/‹remotehostname›/parametersMaxOCC = 5;1;1
  • parametersMaxOCC:  n times values for connectors, numeric parameters and discrete parameters. The list is separated by commas.
    Default: 1;1;1 (Trigger an alert on all parameters as soon as the thresholds are reached)