During an interview, the following question usually comes up, “How would you troubleshoot a long running job?” and most of the time the answers goes into details such as:
1. open up perfmon and identify the bottlenecks
2. use sql profiler or server side trace to identify the offending queries
3. obtain an execution plan and determine if any index are missing or one need to be created
4. create the needed index and test to see if it helps.
Those are some of the acceptable answers; however, there are times when none of those are needed, although it would help greatly. Last week, I was tasked at trying to reduce a log running job from 5hrs to 2.5hrs or less. As any DBA not knowing the system would do, I was able to identify the long running query and got an estimated execution plan. It wasn’t a long runner query per-say but a long running INSERT using a SELECT * FROM vw_someView statement. INSERTS as noted in a post of mine states that it does not make use of parallelism and is single threaded, unless you are using INSERT INTO … SELECT * FROM TABLE/VIEW.
In this case, the SELECT statement didn’t show any parallelism in the execution plan, so I know I don’t have to bother with any maxdop setting in sp_configure. I did find out that they are using a mixture of uniqueidentifier and uniqueidentifier + nvarchar(4000) as their primary key/clustered index. We know this is stupid because the index is not as narrow as it should be. On top of that, there would be a lot of page splitting and fragmentation on the uniqueidentifier column. In addition, cause a lot of random I/O and stresses the disk more than it needs to. The uniqueidentifier wouldn’t be so bad had they partition that column out using Table Partitioning.
My goal was not to suggest a change in the database structure or add any new technology into the existing system but to advise on how to make the job run faster. Therefore, I found and advised a list of things they can do to make the job run faster. I was not given access to a physical server with similar spec (the have such an env) but was given a VM with 4GB of memory. This VM has the offending job and the databases. Prior run took 11hrs.
The existing jobs basically called a store procedure that makes a bunch of inserts using the INSERT INTO Table SELECT * FROM vw_someView. So I asked the question, are there any dependency on any of these inserts? The answer was no. So my #1 suggestion was to break down the job and run it parallel. After identifying which tables were taking the longest time, I created 6 new jobs.
1. Load Data
2. Inserts Table1
3. Inserts Table2
4. Inserts Table3
5. Inserts Table4
6. Call End Process
Load Data will call job 2-5 via exec msdb..sp_start_job @job_name = ‘Inserts Table1′ and so on. So now all 4 jobs are running in parallel. After each of the Inserts job are done, it will check to see if any of the other Inserts job are still running. If yes, then it will not start the job ‘Call End Process.’ To determine if the jobs are still running or now, I used a bit of Euclid reverse method.
After my job ran in this VM environment, I was able to reduce the job by a good 4 hrs, from 11hrs to 7hrs. It is definitely not a 50% reduction but I can only do so much on a VM with very little CPU and Memory.
So the next time you are at a job interview, instead of giving answer 1-4, maybe you can use this as a 5th and land that job
.