What is System Hang and How to Handle it?


BigTree Group Declaration

BigTree was a research group founded and led by Yue Li who was a student persuing his Master degree in Computer Science in NPU, China at that time. The group was dismissed after the graduation of Yue Li. The members of BigTree were Tian Tan, Jialong Shi, Yang Shen and Yue Li.

We developed Shfh, a self-healing tool for automatically detecting, diagnosing and fixing system hang failure. For technique details, please see our ISSRE'12 paper titled What is System Hang and How to Handle it.

NOTE THAT according to our experience, Shfh is able to detect and heal real system hang failure:

We have encountered a scenario that when selecting all the files on the desktop in a version of Ubuntu (cannot remember which version) and then clicking the right button, then the system would be in a hang state (the machine entered into a freeze state and did not answer any keyboard and mouse inputs for a very long time). We tried this at least two times and got the same result. Unfortunately we did not report this as a bug to Ubuntu development team.

However, when we loaded Shfh first and then did the same operation (may lead to hang) and at the same time we turned on the system monitor to see the metrics curves, we found that all the stuffs displayed on the screen freezed (for a second) but then immediately moved later (the curves in the monitor showed the changes clearly). So Shfh successfully detected and fixed this real hang failure.

Since Shfh has false positives (may kill innocent thread), although it is rare, this may introduce unexpected side effect (we met a special scenario that although the hang which was caused by our fault injection experiment, was detected and healed by Shfh, but later the files in the computer did not respond to the double click anymore).

As a result, using Shfh as a research tool rather than a production if you cannot tolerate the unexpected results :)

When we packed the artifact (including demo, injected faults, experimental data and the source code of our tool) in April, 2012, we did not prepare a guide for it (sorry for that). Now we just upload it as a open source tool (see Downloads) and if you are familiar with Linux kernel module, we believe you can figure out how to run it with the help of our ISSRE'12 paper.

Good Luck.

Yue and Tian

in Sydney, Aug 2014.