Blog

Check here for frequent updates with news and tips from the SQL Server Performance Pros at SQL Tuners.


All posts

SET based processing vs. Iterative

Surprisingly common, iterative processing is not exactly what one would expect when working with a SET based storage engine architecture such as SQL Server. Yet, I find myself looking at iterative design structures quite often when asked why a particular stored procedure is not performing as expected. So I thought I would delve into this a bit at a high level.

 

First - what is the difference between SET based and Iterative processing? Fundamentally, iterative refers to a processing methodology which focuses on one record at a time whereas set based refers to working with the entire record set at a time. That's the general and common acceptance of the difference. Now one, even myself, could argue that ALL processing is, in the very granular sense, iterative in nature as CPU's perform operations on one thing at once (quantum computing isn't quite here yet). But, let's stick with the high level understanding for this discussion.

 

Examples of iterative processing are CURSORS and WHILE loops. Whereas examples of SET based are straightforward JOINs. Iterative processing is quite common outside of a relational database system; think arrays and loops. Non relational software systems typically have no SET based functionality - hence the widespread integration of SQL Server, Oracle, DB2 etc. systems. And I think that this is where the problem comes in.

 

Back in college, I studied computer programming in languages such as PASCAL and BASIC (yes, that gives away my age somewhat). While at Microsoft,I took advantage of the "better yourself" environment that once was prevalent there and took courses on Visual Basic, C++, ASP, Script etc. The common thread in all of these languages is an approach to data processing from an iterative perspective. i.e. grab input (whether user, flat file or array supplied), perform calculations upon input, produce output, grab next input. Works great. And I think that from a programming and system design perspective, I see a lot of developers well versed in .NET, C++, ASP.net etc.coming from that mindset. But these days, most developers run into, at some point in their careers, having to now interact with an RDBMS such as SQL Server.

 

Tasked with engineering a system, developers rely on their known tool belt, which includes, not surprisingly, iterative based logic. I find that they often will write the stored procedures in an iterative manner believing this will work as equally well inside SQL Server as it does outside.



Unfortunately it doesn't. And it isn't supposed to either.

 

Dealing with large amounts of data requires a higher level of performance, manageability and accessibility than the once accepted mainframe style small pipe pushing through input one at a time. This is what Dr. Codd was aiming to create at IBM. A SET based way of processing data while also maintaining data integrity and reducing redundancies. SET based processing is far superior to iterative/non-relational based processing that had come before.And in these days, SQL Server does an amazing job of processing vast amounts of data in a relational manner. So why push it to use iterative instead?

 

A recent example of this came in the form of a client application trying desperately to delete records out of a multi-terabyte table. I was asked to intervene and find out why such a simple task was requiring hours of processing. Here is what I found:

 

The developer had created an application which built the delete statement dynamically. There was an array which he used to grab input which identified group ids of records that needed to be deleted from the table. The statement that was built and issued to SQL went something like this:

 

DELETE FROM<target_table>

WHERE <colA>IN('123123','1236456', '9877238', '0987724389', '8792347'... etc.)



Very straight forward. Except that there must have been a hundred or so literals within the IN() clause. So how did SQL Server deal with this? On the surface this looks like a simple DML statement. But looking closer under the covers revealed something else:







Notice that SQL Server took the list of literals and created an internal "table" of those values and from there it appears to show it being treated very similar to any other table which a plan involves a NESTED LOOP operator. But there is something missing here.

 

To illustrate one of the more common optimizations that can be done when working from a SET based perspective, I took the list of literals contained within the original IN() clause and instead put them within a newly created table with a simple clustered index upon it. I then re-wrote the DELETE statement to do a JOIN to that table instead. 



DELETE

   h 

FROM

   inv_number a 

JOIN <target_table> h

   ON (a.invoice_number =h.invoice_number)





Here is howthe query plan changed:







Notice there are a couple of changes to the plan? First, there is an actual table involved as the outer record set. And secondly there is an added SORT operator before the delete on the clustered index is commenced. The SORT operator optimizes the deletes from the clustered index by sorting the input record set in the logical order of the clustered index on the target table. In this case, this was incredibly important as the physical IO to the disk was in the logical order of the clustered index. As long as the target table was defragmented, this meant that the physical head of the drives wrote to disk in a forward fashion as opposed to the scattered fashion of the query plan from the original query. This optimized disk head movement and reduced the overall cumulative effects of disk seek times inherent in the spindle disks upon which the database sat.

 

Now, this particular case is a simple one and to be honest - only speaks to the physical performance improvements that can be made by the optimizer when it is dealing in a SET based query. Unfortunately, we don't have access to the internal algorithms that were different between the two queries as well. The graphical interface I am sure does not show us the finer differences at that level. However, the general iterative vs. set based perspectives can be seen here.

 

In this case, the developer had not chosen the even more iterative based approach that I see often such as placing the literals within an array outside SQL Server and one by one calling a DELETE statement with a single argument WHERE clause. And yes, I see this a lot as well.

 

Unfortunately, the other aspects of the application beg to be optimized as well due to the architecture using an iterative approach to read input from flat files one at a time and processing them one at a time outside SQL Server as opposed to loading them all into SQL Server and performing SET based processing on the entire record set at one time.

 

Other examples of iterative processing include stored procedures which use CURSORS to process data in this fashion as well. In most performance cases, changing this to SET based produces much higher performance.However,this is not always the case. There are times in which CURSOR based processing is beneficial - namely when it is needed to increase concurrency on a high volume OLTP system. While the overall processing time for the entire set will be much longer with this method, the primary concern is to reduce contention by processing smaller sets of records with each batch thus keeping the tables available for other real time processing.

 

In general, it can be difficult for non SQL developers to get used to thinking in terms of SETs of data rather than a single data point. And luckily, SQL Server does a decent job with iterative processing requests - so in that sense, it is quite forgiving. But to reach the top tiers of processing performance, SET based is the best way to go.



 
 

Contact Us Today!

Speak with a representative now regarding your unique SQL Server needs:

1-800-2-MY-DBAS

email: support@sqltuners.net

web: contact form

"When you need someone who understands SQL Server performance from the database to the disks, this is the company you call. They'll keep your database running and SAN vendor honest."

- Bill Minton, Sr Director of Corporate Information Systems, ANPI