Anatomy of a Lock-Up

Note: I'd rather have linked to the original, but I copied this from a newsgroup posting a couple of years ago, and failed to get the author's handle.

1)  Processor goes into a loop, and never gets out (may test an increasing register for equality rather than equal-or-over, and increment steps over the test value i.e. another variety of the "fencepost error").

2)  Processor waits for external hardware event that never happens, and the programmer hasn't anticipated this by including a time-out counter in the polling loop.

3)  Processor disables interrupts, gets lost in a loop, and never enables them again.

4)  One process waits for another process to complete, but that process is waiting for the first process to complete ("deadly embrace", can apply to record locking in multi-user/multitasking).

5)  Hardware other than the processor storms the bus and locks up the system e.g. DMA errors.

6)  Something else stops the clock, e.g. an inappropriate suspend mode during the shutdown process, so that processing can never complete.

7)  Processor enters an "undefined state", and never comes out.

8)  Interrupt flooding, where hardware generates interrupts faster than the system can process them, and the foreground task is never returned to.  Looks like a hard lock (Ctrl-Alt-Del is ignored not because the CPU is too lost to respond to the keyboard interrupt, it's just too busy).  Shows up as slowdown on benchmarks, maybe missed characters when typing in DOS prompt, and sometimes a "Stack overflow" halt if you wait long enough.

9)  Program is waiting for user input, but user is unaware of this because the dialog is not "on top", or the DOS text screen output is redirected or masked (that's why the Win95 boot logo goes away while Config.sys and AutoExec.bat are processed).

10)  Input is ignored because you are typing on the wrong keyboard, or you are watching the wrong monitor (happens in a busy shop... horrible feeling when you realize you have reset the file server while it was defragging, and not the games machine you were playing on), keyboard isn't plugged in, AT/XT switch is wrong, keyboard is nackered etc.

All of the above can cause (or, in the last three cases, appear to cause) lock-ups, as opposed to resets, blue screens or other error conditions.  Things that cause those include:

1)  Unbalancing of the stack, so that a data value is popped as a return address.

2)  Underruning of the stack; popping where there is nothing to pop.

3)  Calling a process recursively, but never returning from that process, so that the stack overflows.

4)  Out-of-bounds indexing error, so that data is added outside of the structure allocated for it; sanity-checking of externally-acquired index values is good programming practice as a result.

5)  Use of a variable that has not been initialized; common as the "null pointer" problem in C/C++.

6)  Use of data that is inappropriate for the data type, e.g. assuming a byte containing 255 is +255 rather than -127, or reading a 16-bit value from a location that actually holds an 8-bit value.

7)  Data insanity; missing the "stop here" limit when filling a memory structure, reading a data structure that should start with a "length of data" value but doesn't, reading data until a delimiting value that never occurs, using zero as a data value where this is insane etc.

8)  Memory corruption; where your code or data is splatted by another process, or another part of your own process.

9)  Wild jump tables; reading a "jump address" off the end of a lookup table (another kind of bounds error) and jumping into data; often gives a "Divide by zero" error.

10)  Generally; making an assumption that is unfounded.  Examples include; FAT is always 16-bit, Windows is always installed in C:\Windows, AutoExec.bat is always present, the boot drive can always be written to, there is always more virtual memory available, if a query for free disk space gets the answer "yes", the file write will succeed (not always true with disk compression), 100 times instruction A will take less time to execute than 25 times instruction B (breaks on different CPU designs e.g. 686), a chip that ID's itself as a Pentium will support.../behave... (why Cyrix discourages using the "enable CPU-ID" workaround so that software sees the 686 as P5).

You can get lockups of several depths: 

1)  In some cases, the software is running fine but the screen isn't displayed for some reason (loose plug, VGA card crash, black-ink-on-black-screen, wrong area of display memory in use, hardware screen saver or suspend mode active).

2)  Some will respond to Ctrl-C, Esc, or some other magic key (or Alt-Tab, Ctrl-Esc in the GUI). 

3)  Some will respond to Ctrl-Alt-Del, i.e. the processor is capable of detecting and responding to that interrupt, and the code that processes it is OK.

4)  Some will cause keystrokes to beep after about 16 keystrokes.  In this case, the hardware keyboard interrupt is working, causing the keyboard buffer to be checked for overflow and the keystroke to be added if it is not, but the keystrokes are not being "removed" by the foreground application so eventually the buffer fills up and the keyboard hardware interrupt routing beeps to tell you it's full.

5)  Some crashes are deep and silent, i.e. pressing over 16 keys does not cause a beep because the processor isn't moving, isn't responding to interrupts, or the interrupt service routines are snowed.

6)  Most crashes will reset on pressing the reset button, which asserts the Reset control line that is monitored by the CPU and is passed to all expansion slots.  Cards that respond to this line will reset their hardware, which is why the reset button "unblocks" lost UARTs and paralytic parallel ports, whereas Ctrl-Alt-Del doesn't.

7)  Locks that don't respond to the reset button may indicate a disconnected reset button, or a card/peripheral that does not monitor the reset line, or bad hardware.

Before you conclude that the Reset button is the way to go (as it works more often than Ctrl-Alt-Del), bear in mind that (like a bullet through the brain) it cuts through everything, so that the reset "request" cannot be intercepted as required to "clean up" the system by flushing pending disk cache writes to disk etc.

Mind you, sometimes that's exactly what you want...

<Back>