Blog

Check here for frequent updates with news and tips from the SQL Server Performance Pros at SQL Tuners.


All posts

Measuring Disk IO performance for SQL Servers

Let me preface this blog by saying that this is a huge topic. I remind clients that SQL Server tuning is more of an art than a science. While technically that is somewhat inaccurate, it does convey a valid point that it really takes a good inner-sense to quickly get to the heart of performance issues. In fact, I would even say that this skill (or artistic inclination) is valuable in so many areas of IT and not just SQL Server support.


And this is especially true when working with the underlying disk system. Our clients have a wide range of disk configurations we have to deal with. Each company has their own standards they follow and each has their own varying philosophy on serving storage to the production servers. We work with direct attached, PCIe, iSCSI and Fiber channel to a wide range of SAN equipment. Add on top of that the various disk types (SCSI, SSD, SATA) as well as RAID applications and you have quite a complex set of information to work with in rooting out performance issues. So this topic can be very large when encompassing all of those varying configurations.


BUT - there is good news. The bottom line is that none of that matters if you think about it from the HOST perspective. SQL Server doesn't care too much about the physical design of the storage - it pretty much operates the same regardless of the nuances underneath the covers. And from that perspective, it is quite easy to determine whether your SQL Server is experiencing disk IO bottlenecks. It is at that high level to which this blog is dedicated.


There are a two main ways to obtain information regarding the disk IO performance; 1) SQL Server internal statistics and 2) Windows Server perfmon counters. We'll talk about both of those in this blog.


First,
SQL Server information:


SQL Server stores great information for determining the core bottlenecks it is experiencing on a regular basis.The "bottlenecks" are typically referred to as waittypes and that refers directly to the resource as well as the amount of time spent waiting for that resource. For example, it records information regarding disk resource, CPU, memory and even things such as lock resources (i.e. blocking). This detailed information includes the amount of time it is waiting on said resource as well as the number of times it has done so. By querying directly those internal statistics, we can a get a great immediate understanding of the type of resources that are causing the largest impacts to performance.


The following is a standard query that will tell us high level information:


SELECT
    *
   ,wait_time_ms/waiting_tasks_count AS 'Avg Wait in ms'
FROM
   sys.dm_os_wait_stats 
WHERE
   waiting_tasks_count > 0
ORDER BY
   wait_time_ms DESC


For purposes of this blog, we will only refer to any resource type which contains *IO* as part of the name in the wait_type column. This would be items such as PAGEIOLATCH_SH. (For a list of the common types seen, refer to http://msdn.microsoft.com/en-us/library/ms179984.aspx). These waittypes refer to physical disk IO and is why we want to look at those specifically. (Note: The above query can produce a lot of information which can be very helpful in troubleshooting system wide performance issues. Not all of them are disk IO related. In fact, the majority of the items listed are not related to disk performance. For this bog, however, we are only focusing on the disk IO waittypes.)


So what are we looking for in the output of that query? As mentioned above, for disk IO specific investigations, we want to find waittypes that include *IO* within the name. Once we find those in the output, we really care about the [Avg Wait in ms] value. As a rule of thumb, this value should not be higher than 20 ms. Values higher than this are typically indicative of disk IO bottlenecks.
An example output from the above query:


From the above output, we can see that the listed *IO* waittypes from this server have average wait times of 8ms. and 5ms. respectively. Again, it is that 20ms. threshold that we look for as indicators of disk bottlenecks. This server appears to have great disk IO latency. Any performance issues on this server, would not generally be disk IO related.
But to verify our initial assesment, there are some counters outside SQL Server that we also want to explore.


Windows Server Perfmon counters:


(For this blog, we will be talking about Windows Server 2008 R2 specifically - although these counters exist in Windows Server 2003 as well and are just as valid there. The perfmon interface is slightly different between the two OS'.)


From Start->Run, type in "perfmon". Once perfmon is running, you will want to click on the "Performance Monitor" on the left hand side menu under "Monitoring Tools". Once there, you should see a graph beginning to form on the right pane window.
Using the big green + sign at the top, click "Add".






You will need to scroll to the "Physical Disk" set of counters and under there add the following counters for the drive you wish to monitor. (This should be the drive which contains the SQL Server *.mdf and *.ndf files for your databases. If they are on a shared drive with other applications, then the results of this monitoring will not reflect just the SQL Server IO. This is a good argument for creating disk partitions that only house SQL Server database files. I'll post our general partition standards in a later post.)






Then hit "OK" at the bottom.
Let's talk about these four counters in detail.


Avg. Disk sec/Transfer
This counter measures the average latency of each IO request to that disk. Again, this should be less than 20 ms. Note that the value returned by this perfmon counter is in seconds. So, you will want to see the value reported in perfmon as less than .020. This is the counter that measures the disk IO latency. Latency being the time from which the IO is requested to the time the requested data is returned to the requesting thread.


Avg. Disk Bytes/Transfer
This counter tells us the average size of the IO requests. This actually is an important counter because it tells us not only a great deal on how the drives are performing but it actually tells us a great deal on the index health and overall indexing strategy effectiveness within SQL Server. In general, SQL Server seeks are 8k in size (the size of a page). SQL Server scans, however, can be much larger as the Read Ahead manager within SQL Server can issue "pre-grabs" of data at 64k, 128k or even 256k in size. SQL Server does a great deal of optimization of disk IO requests in order to reduce disk head seek latencies, which in turn reduce overall IO latencies.


Current Disk Queue Length
This counter will tell us how far behind the disk currently is running. i.e. how many outstanding IO requests are queued up and awaiting service from the disk. A good rule of thumb here is that you don't want to see any more than 2 outstanding requests per spindle. (Note that disk partitions can be physically made up of more than one spindle. So we aren't talking about 2 requests per drive but rather spindles. If there are are 4 underlying spindles which comprise the logical drive G$, then we should see no more than 8 outstanding requests.)


With that being stated, we don't like to see any more than 1 outstanding IO request ever. But we are perfectionists.



Disk Transfers/sec
This counter tells us how many IO requests are being sent to that disk per second. This gives us an idea of the overall performance capacity of the drive. For example, in general, SCSI drives can deliver ~180 IOPS (specifically, reads of 4k or less and semi-random) at under 20 ms. This is per disk. So if we have 10 of those disks which comprise the logical drive, I would expect that logical disk to deliver something on the order of 1800 IOPS (or Disk Transfers/sec) without a significant Disk Queue Length and without those IOs taking longer than 20 ms. on average.


Keep in mind that each RAID configuration will affect this expectation; as well as whether those IOs are writes or reads. In addition, the size of each IO will also affect this expectation. i.e. the larger the Avg. Disk Bytes/Transfer, the lower this expectation becomes. (The disk counters available in perfmon also include Disk Write Bytes/sec and Disk Read Bytes/sec. This is critical for in-depth root cause analysis to understand the read:write ratio as writes are slower than reads in general and can help qualify these expectations.)
So, again - "What are we looking for here?" Well, number one, we are looking for those IO latencies (Avg. Disk sec/Transfer) to be less than .020 and the Current Disk Queue Length to be less than <spindles>*2. If either of those counters are reading high, then there is a significant chance that your SQL Server is suffering from disk subsystem bottlenecks.


Another valuable counter not listed in our example above is Disk Bytes/sec. This is a bit more complex because it is affected greatly by so many factors such as RAID configuration, Read vs. Write, size of IO and whether it is a sequential or random IO pattern. (See? It really is more of an art!)


The good news is that there is a counter for each one of those data points. The bad news is that it takes all of them together to truly evaluate how well the disk is performing. And even more importantly, the more detailed question of what exactly is not performing and why.


But - the four we mentioned above as well as the SQL Server query can at least answer the high-level question, "Does my SQL Server experience disk IO bottlenecks?" If the answer is "Yes.", then you should reach out and chat with us asap. :-)

 
 

Contact Us Today!

Speak with a representative now regarding your unique SQL Server needs:

1-800-2-MY-DBAS

email: support@sqltuners.net

web: contact form

"When you need someone who understands SQL Server performance from the database to the disks, this is the company you call. They'll keep your database running and SAN vendor honest."

- Bill Minton, Sr Director of Corporate Information Systems, ANPI