How to Diagnose Blue Screens BSOD
Crash dump files
A crash dump is essentially a snapshot of what was held in memory at the point in time when Windows crashed, so it contains a lot of useful information that can be used to identify what caused the crash.
So the first thing you need is to obtain the crash dump file for the BSOD you are troubleshooting. This will be located in the following location if you have mini dumps enabled: C:Windowsminidump<date of crash>.dmp
If you have full dumps enabled then the latest crash dump file will be here: C:WindowsMemory.dmp
The right tool for the job
So now that we’ve got our crash dump file, we need a tool that can read it. Once such program is WinDbg — a free tool from Microsoft. Annoyingly, you have to download the entire Windows SDK package and then during install select a custom install and tick just the Windows Debugging Tools and nothing else if that’s all you want (which it probably will be if you’re not a programmer).
The first time you use WinDbg, you’ll need to go to File > Symbol File Path, and enter this in the box that appears:
SRV*C:Symbols*http://msdl.microsoft.com/download/symbols
You can substitute C:Symbols for whatever folder you want it to download symbols to — note that the * characters are required and that there shouldn’t be any spaces between them and the other parts of the text. Symbols are basically files that let you see the correct names of functions within DLLs. When you try to debug a crash dump, WinDbg will download the symbols for any DLLs that were loaded at the time of the BSOD. But as this is a Microsoft symbols server, it generally only works for DLLs that are part of Windows.
Finding out what caused a BSOD
So now we have the tools and information ready, all we need to do is open the crash dump file in WinDbg, get it to analyse it. We do this by going to File > Open Crash Dump, and browsing to the .dmp file we got from the machine that blue screened.
Now you’ll see some details about the machine that crashed appear on screen, and then not a lot will happen. Basically, you just need to wait until you’re able to type into the text box at the very bottom of the white screen with all the details on it. It will say “debugee not connected” in it before you can type into it. Once you’re able to type into it, enter this:
!analyze –v
This command tells WinDbg to perform an analysis of the crash dump and provide verbose details of the results. Luckily for us, those details often tell us exactly what caused the BSOD. There are four main things I look at:
- The actual blue screen error code and description. You may have to scroll up a bit after running !analyze –v to see this, but it will be the first thing after the big *** BUGCHECK ANALYSIS *** title. In the case of the BSOD I was debugging, it was: “VIDEO_TDR_FAILURE (116) Attempt to reset the display driver and recover from timeout failed.” In this case, it is already looking likely that it’s a graphics driver issue, but most error codes and descriptions aren’t quite as helpful.
- The process name, listed in the WinDbg output as PROCESS_NAME. Don’t always jump on this as the cause though, as this can often just be the program that directly or indirectly makes use of some other lower-level component that then caused the crash. In my case, the process name showed the name of the program I developed (ADPhotoEdit.exe) but I knew AD Photo Edit didn’t directly do anything that could cause a BSOD, so it was likely something low level that is indirectly used by AD Photo Edit that was the cause (e.g., a Windows kernel component, a graphics driver, or the hard disk).
- The file name, shown as MODULE_NAME and IMAGE_NAME. This is usually more useful than the process name, as it shows the exact file that code was executing in at the time of the crash. In my case this was “igdkmd32.sys.” A quick search online reveals this is an Intel graphics driver and adding the words “blue screen” to the search found a lot of people having the same problem.
- The stack trace text (shown as STACK_TEXT). From points No. 1 and 3 it seems extremely likely that an Intel graphics driver caused the BSOD under investigation. Hopefully there’s an updated version of the driver that fixes the issue, but for those scenarios where you’re still not sure on the culprit or want further clarification, you can see if the stack trace tells you anything further. This is essentially a list of the functions that were called leading up to the crash, in descending order. The most recent entry is first (which will usually be “nt!KeBugCheckEx,” as that’s the function that actually shows the BSOD on screen). The first part of the text is the name of the file without its extension (nt = nt.dll in the case of the nt!KeBugCheckEx example) and then there will be an exclamation mark followed by the name of the function within that file that was being called. In my example, I could see lots of functions from dxgkrnl being called. A quick search for that file name shows it’s the DirectX Graphics Kernel library — again helping to prove the theory that the crash was caused by something to do with the graphics driver.
One thing to note is that not ALL crash dumps are this straightforward to diagnose. Sometimes a BSOD is caused by corruption that happened long before Windows actually discovered it and halted the system, so the crash dump doesn’t really help in that scenario (but thankfully those cases are fairly rare in my experience). Also, remember that it is only drivers and low-level Windows kernel components that can directly cause a BSOD.
So there we go. With just a few seconds of work I’ve found the real culprit, cleared my name, and hopefully helped the person out that reported the problem (as now they know to just look for updated graphics drivers).
During my time as an IT admin I certainly found it useful knowing how to do this. For example, when all of our terminal servers started throwing BSODs whenever someone logged off, a quick bit of WinDbg inspection led me to a specific Windows system file and I was able to download a hotfix for it that instantly resolved the problem. Hopefully this guide will help you diagnose problematic blue screens on the desktops and servers in your networks as well. Also, I’ve put up a video walkthrough of this process here.
(this article was originally posted on SpiceWorks and reposted here so I can find it when I need it)
Don’t you think you should at least give credit to the website you copied and pasted this from? Looks awfully similar to my article here: http://community.spiceworks.com/topic/356288-blue-screens-of-death-finding-the-culprit
Nevermind, I just saw the note at the very bottom of the article that mentions Spiceworks. My bad 🙂