All Bugs Are Shallow… Except This One

In his essay “The Cathe­dral and the Bazaar”, Eric S. Ray­mond coins the phrase “given enough eye­balls, all bugs are shal­low” — mean­ing that with enough testers and enough pro­gram­mers, it is pos­si­ble to diag­nose and fix any soft­ware bug.

So why can’t my com­puter sus­pend and resume properly?

The con­cept of ‘sus­pend’ — or ‘sleep’, or ‘standby’ — mode, whereby the com­puter dumps its inter­nal state to RAM then enters a low-power state with its proces­sor and other hard­ware turned off, is not new. The ACPI stan­dard has been kick­ing around for 14 years now, a very long time com­pared to the life cycle of an oper­at­ing sys­tem. These days, with lap­top use on the rise, it’s a very com­mon thing for users to want to do. And yet resum­ing from sus­pend is still hit-and-miss.

Why do I find it more reli­able in Ubuntu than open­SUSE for the same base ker­nel? Why does GNOME fare bet­ter than KDE? Why does my WiFi some­times not come back? Why, with Microsoft’s mil­lion– if not billion-dollar oper­at­ing sys­tem bud­gets, with Intel and AMD and nVidia’s decades’ of dri­ver expe­ri­ence, is sus­pend and resume still fre­quently an issue even on Windows?

Only Apple, with its closed hard­ware / soft­ware ecosys­tem, seems to have cracked it.

I’d hate to think of that as the only way to a bug-free exis­tence — I’m very fond of the idea of an open ecosys­tem where I can run what­ever soft­ware I want on what­ever hard­ware I want. But I’m wor­ried. Is the range of (IBM-compatible, ACPI-supporting) hard­ware out there just too diverse and too widely dif­fer­ent in its sup­port for suspend-and-resume? Is it just infea­si­ble for soft­ware to per­fectly imple­ment it on all devices?

Has hard­ware cre­ated the one soft­ware bug that, for any rea­son­able num­ber of eye­balls, isn’t shal­low?

Never Work With Hardware

Beware, techie rant­ing ahead.

I have learned one impor­tant les­son over the last three weeks: Never, ever work with hardware.

We have this board — I shan’t say exactly what it does for obvi­ous rea­sons, but suf­fice to say that it gen­er­ates sig­nals and sends them, and some­times receives sig­nals and processes them. And, three weeks ago, it started crash­ing. Ran­domly. Some­times after a few sec­onds, some­times after a few hours, but it would always crash. Yeah, *that* kind of bug. Also, the devel­op­ment envi­ron­ment fea­tures what I will call “quan­tum break­points”. These are just like nor­mal break­points, where you can stop your code at some point and exam­ine the state of vari­ables, only they change things in tiny imper­cep­ti­ble ways, and usu­ally when you start the pro­gram again from a cer­tain break­point it just won’t work.

So, I wrote most of the code for the main signal-processing chip on this board, so I got to try and sort this bug out. Thanks to the lovely combo above, it took me about a week to nar­row down the bug to a cer­tain bit of pro­cess­ing code. To make mat­ters even harder, it turns out that this is not a bit of code that I wrote. So I call in the guy who wrote that code, and we spend nearly another week trac­ing through his code. By this point we’ve also dis­cov­ered that there are sev­eral dif­fer­ent ways in which the crash hap­pens, in some of which the chip’s Pro­gram Counter is not even point­ing at a mem­ory loca­tion that even exists, let alone actual code.

We start won­der­ing if it’s a mem­ory address­ing issue, so we write all kinds of test pro­grams, all of which work flaw­lessly. So we call in the FPGA guy, whose chip also accesses the same mem­ory, and he has at the prob­lem for sev­eral days, also get­ting nowhere. With all of us plus elec­tron­ics guys and other peo­ple who have used the same chips before, we’ve now got half the project team sit­ting in a lab flail­ing wildly at what most be the most obscure soft­ware bug in the world.

Until some­one checks the power line to the chip. And it’s about 10 mil­li­Volts too low. Ten mil­li­Volts! Turns out the sig­nal pro­cess­ing func­tion that made it crash didn’t have any soft­ware bugs at all. It just hap­pened to tax the proces­sor quite a bit, so it drew more cur­rent, so the volt­age dropped a tiny bit — not enough to stop the proces­sor, but just enough to make it cor­rupt its own inter­nal mem­ory and crash horribly.

THREE BLOODY WEEKS chas­ing a soft­ware bug and… the volt­age sup­ply to the chip is slightly too low.

So, er.… ARGHARGH PROPRIETARY HARDWARE. From now on I’m only deal­ing with things with x86 chips in them, and oper­at­ing sys­tems. In fact, even OS-native code can go hang. Vir­tual machines. No, wait, fuck it, inter­preted code. Is there a lan­guage where I can run inter­preted code in a vir­tual machine? I have the sud­den urge to stay as far away from hard­ware as possible.