Testing times: Between some IoT code and a hard place
Embedded bugs are hard to catch
Radbot Every company has its ups and downs. Those downs could be exploding phones or a sudden unmovable overstock of Clinton merchandise (or conversely an uptick in Trump-the-statesman t-shirt demand). Bigger organisations can better absorb the illness of a member of staff or a surge in demand beyond all expectations, although no one is totally immune.
In smaller companies and startups the pain can be intense. I remember a job before going to university where I sweated for a week over hardware that stopped working just before a critical demo in the US. It was for a robot toy for a couple of the world's largest purveyors, and no one would have died… except for my employer. We slept under the desks (my landlady thought I must be out doing drugs and summoned scary relatives to hound me) and we tore out our hair, some of which may even have turned grey.
It all turned out to be down to one tiny misplaced wire on our development electronics board, discovered at the 11th hour when rechecking “everything".
We did the demo, the project continued, and my employer lived.
Later, for one of my huge clients we had stuff fall over that stopped production, with costs of probably millions per day, and though there was a lot of shouting I don't know of anyone being fired or even reprimanded.
At the moment we have some horrible problems with our main working prototypes: extra unattributable current draw reducing battery life hugely. Part of that was a PCB schematic error, but a big chunk of it is yet to be hunted down and squished.
We've stopped a rollout until these can be resolved, and have holding plans in place for kit already out in the field.
One key difference between banking and radiator valves turns out to be test harnesses and the supporting dev infrastructure. In banking I wrap huge and detailed test harnesses around production code and everything is usually on tap to do so easily.
Testing has virtues beyond avoiding fear and surprise. Firstly, it is much more rare to ship code with a bug, especially a subtle one. And if a bug crops up that was not covered by a test case, add a test case or three for next time.
Next, when the code is tested to death it is much easier to fix or improve other things (re-factor) without being terrified of or actually breaking apparently unrelated things.
Lots of people grumble about writing and maintaining tests, but it's a lot cheaper and less unpleasant to catch something horrible before release than to do a product recall if that bug is embedded in IoT hardware, for example.
Yes, test harnesses for Java (and C++) in banking apps are annoying and require some work.
Test harnesses for embedded code are much, much harder, because often the code is doing bare-metal stuff that can only run on the hardware, and not on Jenkins or Travis hosts or whatever, and no one has written an emulator for your particular hardware and peripherals.
And even if one lovingly crafts a test harness that can be run on the target hardware (am I the only saddo to have built and run a unit-test framework for the Arduino Uno?), it's clear that unless tests are very easy and quick to run, they simply won't be.
So, grasshopper, thus was our mound of software technical debt built, that has surely come back to bite us in the valves.
Gtest, Eclipse, Nirvana
As the code base has grown, so has the opportunity for bugs – outright brain farts – and unexpected interactions between separate complex parts.
And without test harnesses some of those faults manifest themselves only in users' bedrooms in the dark, grinding motors and preventing sleep. Yes, that has really happened. Also a good reason for the CEO to go on installation visits, to hear unfiltered user tales of woe first-hand.
So we have been busy digging out our core logic and wrapping it in Googletest (gtest) wrapper and wondering how some of our stuff ran at all, while being forced to clarify our thinking and specs. And we are feeling at liberty finally to re-factor some of the really crufty bits.
These tests are still running just in the IDE, and we need to move them to also run on servers in the cloud ASAP to catch unexpected clashes between separate check-ins. We're getting there.
Then we hit continuous integration (CI) Nirvana.
Some of the bugs are currently still under the radar in the glue logic that binds the tested bits together. We can't test the bare-metal stuff easily either, though we can get close with some cleverness. And no amount of gtest will find the excessive current draw if it's a PCB problem.
But we have been talking for ages about a set of hardware CI servers that run PCBs in Faraday cages to test radios, battery drain, sensors and so on that would have stopped one of our current nightmares early.
I will talk more about this at Building IoT London in March so you can come and get chapter and verse there. Some of the rest of you can just look into our open source git repositories and JIRA and weep… um, cheer.
And I haven't even talked about investment and regulatory setbacks.
Code monsters in your bedroom, ahoy!
This is not meant to be a running product defence column but to address a point in last week's comments. For the vast majority of homes with newish mechanical TRVs already in place, no drain down is necessary or special tools to fit Radbot, or to put the old TRV head back in place when a tenant leaves, for example. It's vital to us that almost anyone can use our gizmos without a PhD, a smartphone, any special tools or permits.
Are we ready to press the launch button yet? That’s for next time. ®