This article is more than 1 year old

Mine all the data, they said. It will be worth your while, they said

When instrumentation goes too far

Good developers instrument their applications. Good ops teams monitor everything. This near-fetishisation of telemetry has been extended to DevOps, where it now risks becoming something of a problem.

It's good to collect data on your applications and your infrastructure. An important part of Doing IT Unstupidly is baselining: gathering a bunch of data that tells you how things behave when they are doing what you expect them to, monitor them in production, and then freak out they deviate from your baseline.

Telemetry can be complicated. A DevOps team would be responsible for capturing application and infrastructure events, trapping various errors, examining network connections, looking at endless streams of real-time performance monitoring data and collecting feedback from security solutions.

But even a modest IT estate is capable of producing gigabytes of raw telemetry data a day. Some big data advocates – and I'm talking about the sales and marketing types, not engineers – will tell you that you should take all of that data, shove it into database and then run lots of analytics against it. There's gold in that data, you'll be told. Somehow, magically, it will save you money. This is not exactly correct.

Signal-to-noise ratio

We've met the Signal-to-Noise Ratio demon face-to-face when something goes splork and you become inundated with alerts that tell you nothing. Inventing new curses while cutting through the monitoring software-generated insanity is a rite of passage.

Fine-tuning some monitoring application and building filters is annoying but doable if you're a specialist working on an IT team with a narrow remit. When you're working on a DevOps team whose remit is "all the things" and the scale of your deployments is "do not do without scripts", this becomes another conversation entirely.

Most conversations about Security Information and Event Management (SIEM) are loud ones. Usually these conversations contain inventive new curse words.

In Klingon.

Diet Big Data

To those who live and breathe DevOps, none of this is news. All one needs to do to verify this is look at what has become the tool for telemetry collection in the DevOps world: the open-source Elastic Stack. Once upon a time, Logstash gathered all the telemetry and dumped it into to Elasticsearch. Kibana read the information out of Elasticsearch and displayed pretty analytics. This was the ELK (Elasticsearch, Logstash and Kibana) stack.

This was great, because Logstash comes with many input plugins and lots of options to filter inputs. The SNR problem became less of a problem.

Unfortunately, Logstash – which is built in Java – is a bit of a resource hog so Beats were developed to be lightweight telemetry collectors. Beats collects the data, Logstash aggregates, transforms, and adds structure to the data.

Ultimately, even this proved to be too much data. Beats gained the ability to filter the data before sending it Logstash.

In addition, the dataset in Elasticsearch can easily become so large that indexing becomes a problem. For this reason multiple node types exist, with the ingest node being a special type that pre-processes documents before indexing takes place.

A history in evolution

When ops teams ruled the IT landscape with an iron fist, monitoring tools like Splunk, SolarWinds, ArcSight were en vogue. These applications were resource-hungry behemoths designed to live on their own hardware, behind the corporate firewall, with unfettered access to the infrastructure systems they monitored.

And infrastructure monitoring was the point: operating systems and a few key enterprise systems like databases and mail servers were instrumented, but none of these made great platforms for application developers. Back around the turn of the millennium it was more than enough to be an infrastructure monitoring solution.

Eventually, monitoring broke into two paths. The first took a bigger security focus and became SIEM tools. AlienVault gets at least some of the blame for stirring the pot and forcing everyone else to evolve.

SIEM tools collect as much data as possible and then perform aggressive analytics on that data. They all run into (or have run into) problems with both performance and SNR. Most SIEM vendors have built some of this capability into their product, though it may come with an additional cost or even be an entirely separate product to subscribe to.

Truly solving the problem, however, would require redesigning their applications. For some vendors this would mean jettisoning decades of code as they moved to a more modular platform. Some SIEM vendors are relying on third-party options like Big Panda or Elastic Stack to provide interim filtering capability. Others are engaged in significant redesign efforts.

The CERN approach

The second group of tools went in a completely different direction. They nerded about building the ultimate tool for collecting information of various sorts, indexing it, searching it and displaying it.

This group took the approach to data adopted by CERN, host of the Large Hadron Collider. That approach is: don't collect data that you don't need. CERN would love to have ridiculosity computers made out of unobtainium. In reality, however, they throw away more data than they store.

CERN can generate 1PB of data every second, which has resulted in them turning to machine learning to help them sort through it. In fact, data scientists only get to record about 1 in 40,000 collisions at the Large Hadron Collider (LHC).

So if CERN – probably the pre-eminent, non-spook example of big data on the planet – is throwing away 0.999975, then the Elastic Stack approach of "lots of layered filters with varying levels of efficiency and capability" might have some legs to it.

Especially when you think about what it would take to handle the complete instrumentation of something like Netflix. In fact, don't think about that too hard. That's just... you know what? I'm going to go have a little lie down.

Logging at Netflix scale

Speaking of Netflix, they use the Elastic Stack. The reason, they joke, is that they're "a log generating company that happens to stream movies". In January 2014, Netflix was generating over 70 billion log events a day, with a growth curve that looked exponential.

One of Netflix's challenges was that they needed a telemetry platform that could cope with the creation of new types of events every day. They also needed a solution that could do automatic sharding and replication. Elasticsearch has proved useful here.

This isn't to say that the Elastic Stack solved all of Netflix's problems. In addition to the Elastic Stack, Netflix has relied on more traditional Big Data tools such as Hadoop, Hive and more. Netflix also heavily customised Elasticsearch in order for it to meet their needs.

Netflix has also committed considerable resources to creating their own tools. Tools they have in turn shared with the world. Enter Suro, Edda, the Netflix Simian Army, FIDO, and many, many more.

And just as the existence of the Elastic Stack is forcing SIEM tools to evolve, Netflix's engineers taking the time out to explain how they cope with their problems has forced the Elastic Stack to evolve.

The salient bit

The important take away from all of this is that when you start dealing in data volumes like this you have to change how you think about telemetry. We are now well beyond the days where simply hoovering up as much data as you can and shoving into big, static databases will get the job done.

Today we not only need to be judicious about the data we collect, but we need to structure that data before we store it. The storage solution we choose has to be flexible enough that we can change the schema on the fly as we introduce new event types, or it has to be a schema-less storage solution.

Even if you hate the term DevOps, none of us can escape IT automation. In big shops and small, IT is getting automated, and this means both an increase in instrumentation and the collapse of IT specialties into broader, more generic categories.

In other words: this application and infrastructure instrumentation and telemetry problem is going to park itself on all of our doorsteps eventually. So it's probably worth having a serious think about how you plan to solve it sooner rather than later. ®

We'll be covering DevOps at our Continuous Lifecycle London 2018 event. Full details right here.

More about

TIP US OFF

Send us news


Other stories you might like