Heroku tech change leaves customer with bill-shock
PaaS clouds not easy to run if you don't RTFM
The cloud is not as easy or as simple as its providers' marketing departments may want you to believe – that's the moral of the story of a startup and its platform provider Heroku.
"In mid-2010, Heroku quietly redesigned its routing system, and the change – nowhere documented, nowhere instrumented – radically degraded throughput on the platform," James Somers, an engineer at Rap Genius, wrote. "Dollar for dollar, a dyno became worth a fraction of its former self."
A dyno is the fundamental resource unit that Heroku deals in. Any time spent waiting for a request to reach a dyno directly impacts performance.
The change that caused cost-overruns at Rap Genius was Heroku's switch from an "intelligent routing" network structure to a "random routing" structure in mid-2010. This altered how requests were sent to dynos.
Instead of making sure that requests immediately found their way to an available compute resource – a dyno – Heroku began assigning the requests to random dynos, Somers wrote. This meant that some requests would end up being queued, which could lower throughput – a critical issue for large web applications.
The company asked Heroku what was going on, and an engineer told them that the numbers reflected the time it took to serve a request after it had passed out of the queue. Rap Genius had presumed Heroku's "intelligent routing" meant there would never be any queuing.
As of this Thursday, Heroku's site says this about how it routes requests: "Incoming web traffic is automatically routed to web dynos, with intelligent distribution of load instantly as you scale." If you click through to the routing page, however, it clearly states "the routing mesh uses a random selection algorithm for HTTP request load balancing across web processes."
Where the situation gets complicated is that Heroku did – slowly – reference the change from intelligent routing to random routing in its documentation. But Rap Genius feels they got a raw deal because Heroku did not explicitly reach out to the young startup and tell it about the change.
Along with this, Heroku continues to use the "intelligent routing" term previously associated with distributed clever routing, though it now routes requests randomly.
Other developers have run into this problem, and Heroku has acknowledged that it has done a poor job at telegraphing the change to users. Developer Tim Watson ran into the same issue as Rap Genius in mid-2011, queried Heroku, and the company's CTO Adam Wiggins said:
You're correct, the routing mesh does not behave in quite the way described by the docs. We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we're on our way to a new model which we'll document fully once it's done.
Some industry insiders were sympathetic to Rap Genius's performance hit, but bridled at the aggressive tone the startup used in its blog post.
"Pay attention to your app and performance," former Operations director at Heroku Mark Imbriaco told The Register via Twitter. "This shouldn't sneak up on you after that long," he said.
"I understand why [Rap Genius] is upset with the performance they see, it sucks," he wrote.
Rap Genius's cofounder Tom Lehman admitted in conversation with The Register that "there are definitely ways to mitigate this and Rap Genius should do more of them, but the problem is still really bad."
The startup feels misled by Heroku and wrote the blog post after attempts to get a price reduction while they changed the structure of their site were rebuffed.
The Register has seen an email from Heroku's CTO to Rap Genius that indicates Heroku advised the company to re-engineer its application.
"I'm convinced that the best path forward is for one of your developers to work closely with [name redacted by Reg to preserve anonymity] to modernize and optimize your web stack," Wiggins writes. "If you invest this time I think it's very likely you'll end up with an app that performs the way you want it to at a price within your budget."
At present the situation is unresolved: Rap Genius is clamoring for developers to email Heroku's support desk, while Heroku has remained silent apart from a single statement which was sent to The Register.
"Our customers' success is our top priority," the company wrote. "We are working hard to get to the bottom of this situation and give our customers a clear and transparent understanding of our next steps. We'll provide more information as soon as possible on our blog."
From The Reg's point of view, the events illustrate the troubling nature of modern cloud infrastructure: platform providers promise to take on much of the development work a company would have to do themselves, but if the company does not pay close attention to their platform provider, then architectural changes can cause cost overruns.
It's a problem that's only going to get worse, and if – like the startup in this story – a company has discovered a fault once its application has garnered significant traction, then moving away from the platform can be difficult.
"Moving off of Heroku only gets harder," Lehman said.
Rap Genius is a community wiki for the etymology and cultural significance of lyrics in rap music and other works of art. It has around 15 million users a month, according to the company. Rappers such as Kidd Kidd, Pusha T, and Bryant Dope are verified members of the site.