Hardware Monitoring: Errors Triggering Alerts on the ErrorCount Parameter for Physical Disks on Sun Solaris Systems

On Sun Solaris systems, the ErrorCount parameter shows the number of errors that occurred on each physical disk since the last counter reset (either a reboot or a manual acknowledge and reset of the parameter). This article details which conditions are taken into account by the ErrorCount parameter and trigger an alarm.

Related Topics

Objective

On Sun Solaris systems, the ErrorCount parameter shows the number of errors that occurred on each physical disk since the last counter reset (either a reboot or a manual acknowledge and reset of the parameter). This article details which conditions are taken into account by the ErrorCount parameter and trigger an alarm.

Procedure

On Sun Solaris systems, the health of the physical disks is assessed by using two commands:

  1. “dd” to verify that the disk is accessible for read operations (this will be represented with the Status parameter)
  2. “iostat -E” to report the number of errors that occurred on the disk (this will be represented with the ErrorCount parameter)

The ErrorCount parameter represents the total number of errors since the last counter reset (i.e. a reboot, a PATROL Agent restart or RSM restart, a manual acknowledge and reset of the parameter through the KM Command). As new errors occur, the counter is increased one by one over time. The default alert threshold triggers an alarm as soon as an error is reported by “iostat”. The ErrorCount parameter then stays in the ALARM state until it is cleared by the operator. This also means that no new ALARM event is generated for additional errors that occur on the disk until the ErrorCount parameter is reset to zero.

inline
Hardware Sentry KM monitoring the physical disks of a Sun Blade T6300 machine under Solaris 10.

The “iostat” command actually returns a series of error counters for various types of events. The table below summarizes which types of errors are taken into account by Hardware Sentry KM and BMC Performance Manager Express for Hardware in the ErrorCount parameter.

“iostat” Error Description Triggers an ALARM on ErrorCount Comments
Device Not Ready The drive returned the sense key 0x2 (Not ready). Yes  
Media Error The drive returned the sense key 0x3 (Medium Error). Yes  
No Device The drive returned the sense key 0x6 (Unit Attention) or in the case of a removable device it must have happened multiple times. Yes CD and DVD drives are excluded from the monitoring
Hard Errors All the above conditions are counted as Hard Errors with the addition of the SCSI sense key 0x4 (Hardware Error). Yes  
Illegal Request The drive returned the sense key 0x5 (Illegal Request). This is treated as a Soft Error and kstat is also incremented. No  
Recoverable The drive returned the sense key 0x1 (Recovered Error) to indicate that the last command completed successfully but some recovery action had to be taken by the drive. This is treated as a Soft Error and kstat is also incremented. Yes While recovered, an actual error occurred and needs therefore to be reported
Predictive Failure Analysis The drive returned sense key 0x6 (Unit Attention) with and ASC (Additional Sense Code) of 0x5D indicating that the drive has exceeded its predictive failure threshold. This is treated as a Soft Error. Yes Because of the nature of the counter, it has been preferred to report the problem with the ErrorCount parameter rather than PredictedFailure (which triggers a WARNING) to allow the operator to acknowledge and reset the counter.
Transport Error This error occurs for a number of reasons all related to being unable to transport the command. The command could have been timed out or reset or the host bus adapter unable to put the command onto the SCSI bus. This is neither a Soft nor a Hard Error. Yes  
Soft Errors Illegal Requests, Recoverable errors and Predictive Failure Analysis are all summed up in the “Soft Errors” group, including various driver-related problems. No