RJ's SQL Server and MySQL Notes

Notes on SQL Server and MySQL

  • Advertisements

The difference between monitoring and diagnosis

Posted by rjssqlservernotes on November 21, 2010

Recently, I was running a PerfMon Data Collector on a Production instance that included the “typical” set of coarse-grain counters.  One of the counters Average Physical Disk Queue Length showed that there was a potential problem with IO to the disk subsystem.  The lead DBA responded that Average Physical Disk Queue Length was a “meaningless” statistic when measured in a SAN environment.  I attempted to reason with the lead who sent me an article downloaded from a WWW site as well as a technobabble email from a system engineer who also argued that a SAN-based SQL Server environment renders the Average Physical Disk Queue Length “meaningless”.  Both argued, unknowingly, that the metric did not provide “specific” information about disk latency.  I say unknowingly because neither bothered to read their own references which clearly stated that the average Physical Disk Queue Length is a generic measure of IO latency at the OS level that can not provide details about the specific disk or disks causing the issue; however, the metric does provide an early warning that suggests further investigation.

In the end after several email exchanges  the system engineer clarified that the Average Physical Disk Queue Length measures that I had collected were possibly effected by a disk failure in the SAN that he was already aware of.

The point here is that generic monitoring of a production system is best performed with high-level counters so as not to degrade performance; however, when a potential issue is identified then the more detailed counters should be used for further diagnosis.


Sorry, the comment form is closed at this time.

%d bloggers like this: