Slack bots have the keys to your processes. What could go wrong? Well...
They're bots, not freakin' Skynet
Posted in DevOps, 28th February 2018 09:04 GMT
It’s almost impossible to talk about Agile software development without mentioning bots. If you think that's a bit of a stretch, maybe try and talk about DevOps without some sort of collaboration tool. Then realise that the two are beautifully linked.
The collaborative nature of Agile means solutions like Slack are a natural fit, and Slack is also a wonderful, easy platform for bots. All of our processes are now automated, our build issues scream alerts into channels and in the same workspace we can throw a .gif into #fun for some comic relief. Happy days. When it all works.
There's Slack for that
It seems that Slack has become the answer to everything. It's where you communicate with your co-workers, regardless of where they are. It's how you tell your boss that you're staying at home because you're sick. And it's now where you modify JIRA issues, view Commit summaries and view application exceptions. Nearly 250 Slack bots live under the Developer Tools category, to help software devs do all their things. There are definite benefits to consolidating all your systems to one central notification interface, especially when that isn't your Inbox. But are bots just holding together a house of cards?
The dangers of obscuring the details
On the surface, it seems that feeding bots into Slack is raising the level of visibility of what's going on during the software development process. That must be a good thing. It is, to a point, and we usually reach that point when things go wrong. Remember, it's a bot – it's not AI. It takes input from one system and puts it into another system. And systems can fail.
Process steps can get stuck. The worst part about "stuck" is it doesn't always result in an error. The hardest thing to alert on is something that hasn't happened when it should have. It's much easier to alert that a process ran and resulted in an error. But "stuck" can mean "I went off and started this process... and the human hasn't noticed that I haven't returned any results yet, 3 hours later." Stuck happens for all kinds of reasons, including when part of a system is down, if a bot doesn't correctly alert that as an issue. It's also more likely when just part of a system is down, instead of the whole thing. In that case, an API call may be received, but one component failing to execute may not trigger an alert.
Bots are also rather precious and don't handle failure well. If the system does throw an exception error, sometimes you'll just get a failed message in return. It's then up to you to get your hands dirty and figure out what broke the process and how to get it back on track. That sounds like a normal, expected troubleshooting process, but now you'll have to figure out if the bot is broken, if the system is broken, or if your code has broken everything. You might even be faced with a manual rollback, if that's not automated, and that's trickier to do when the bots have been automating all the builds.
One of the scariest scenarios is what you can do when Slack is down. It doesn't happen often, but it does happen. If your team is used to doing everything through this one communication channel, they might be playing Solitaire until it comes back online. You only have to watch social media to see how many organisations are now relying heavily on Slack's availability. That's not helped by it being the "go-to" platform for bot developers and software tool integrations, most of whom aren't even looking at building for any alternative platforms. The proliferation of bots on Slack is a blessing and a curse. So much to choose from... as long as it's on Slack.
So what's a dev to do?
You could ignore the risk completely and fight the fires when they happen. Or you could take a leaf from the book of systems administration and prepare for the worst.
Back up all the things. Yes, I know your software repositories and build environments are in the Cloud. But when the Cloud is just someone else's computer, I wouldn't be relying on their backups to protect my IP (and all that time we've spent). This isn't really about backing up Slack, it's about knowing you have a copy of the data that sits inside of all of those connected systems (including GitHub), just in case a human does something stupid. No, GitHub itself is not a backup. Also check if their external backup capability is automated, or if they just make your data accessible via an API, which you'll have to script to run. And check it does run. And test you can restore it.
Understand the processes behind the automation. Learn what happens behind the scenes when you tell Bot X to execute Y command. Then teach that to the other software developers. Document it, even. Understanding the processes makes it much easier to know where to start looking if things go wrong and know what Google search results are irrelevant.
Understand the systems behind the bots. What does Team City actually do? How does GitHub work? In the event of a bot freak-out or a Slack outage, you might have to run some of these steps manually. Knowing how to use the interfaces of the systems themselves could come in handy.
Have your manual checks and balances in place. Benchmark and understand what is normal. If you execute a bot command, should you see a response within two minutes, within half an hour, or at all? What other thing can you check to validate the output of that command? Also identify whether some steps are too complicated to be trusted to a bot and need a human eye, like some bug testing, because a human eye would capture any collateral damage even though this bug is technically fixed.
Teach the children
Bots are great for automating menial tasks and long-winded processes, but they're also really good at preventing us from having to think. Without passing our understanding of how the work happens, we risk repeating a history of human "single points of failure", where that one person in the team is the only one who knows how to fix things when they break. You'll still find that person in large Enterprise environments, clinging onto legacy systems that they've been patching since before Y2K. We owe it to our junior developers to teach them not just what to do, but why we do it and what takes place that makes it all happen – with or without a bot.
We'll be covering DevOps at our Continuous Lifecycle London 2018 event. Full details, including early bird tickets, right here.