Microsoft's Azure Kubernetes Service mucked my cluster!

Redmond blames user error, invites further feedback to improve its service

A burning dumpster

Microsoft's Azure Kubernetes Service (AKS) was launched to world+dog in June, however, a few disgruntled customers say the managed container confection isn't fully baked yet.

In a blog post published on Monday, Prashant Deva, creator of an app and infrastructure monitoring service called DripStat, savaged AKS, calling it "an alpha service marked as GA [generally available] by Microsoft."

Deva said he moved his company's production workload to AKS last month, and has been plagued by random DNS failures for domains outside of Azure and hostnames inside the Azure Virtual Network.

He characterized the response from Microsoft support – advice not to use excessive memory and CPU resources – as ridiculous, and said Microsoft failed to respond when told the DNS issues occurred mainly during application startup when memory and CPU usage is minimal.

Then there was the AKS Kubernetes Dashboard, which crashed after a few days and required a reboot of the Kubernetes API Server to fix. And this happened, Deva said, on a daily basis, which meant the constant filing of support tickets.

Have you tried turning your infrastructure off and on again?

When Docker containers crashed, the underlying virtual machine would fail too, according to Deva. Recovery required manually rebooting the VM from the Azure portal. He described the response he got from Azure Support thus: "Yeah this is your problem. Just make sure your containers never crash."

He recounts an unrecoverable cluster crash, and claims the service-level agreement (SLA), coving the VMs underlying AKS but not AKS itself, was violated.

"Azure Support has been the worst support experience of my life," he said, noting that he's moved to Google Cloud Platform for its Kubernetes service. "...Ignoring the SLA violation is downright fraudulent behavior."

Reached via Twitter's private message system, Deva said his experience was limited to AKS and didn't reflect other Azure services.

"This has been very poorly handled by Microsoft," he told The Register. "The worst part is them trying to blame the user for issues on their end."

In an email to El Reg, a Microsoft spokesperson attributed the problem to Deva running workloads without a memory limit:

In the course of an in-depth engagement by our engineering team, we determined that the customer’s workloads had been overscheduled on the nodes in his cluster, crowding out system services and causing undesirable behavior.

We provided recommendations for how the customer could prevent this from reoccurring and have made corresponding improvements in AKS to ensure that customers cannot inadvertently get into this situation again. We are also continuing to invest in providing better diagnostic and monitoring tools so that customers and our own support engineers can more quickly determine what might be causing problems in a customer’s environment. We are always concerned if a customer has an issue with AKS and we will use this feedback to continue to improve the service and our support process.

An individual posting under the name QiKe, claiming to be an engineering lead on AKS, offered a similar explanation in a post to Hacker News.

Deva is not the only AKS customer to report misadventures. Colin Jemmott, senior data scientist at Seismic Software, observed via Twitter, "This matches my experience with @Azure managed Kubernetes (AKS)."

In late June, Wojciech Barczyński, a senior software engineer at SMACC, a deep learning and finance biz, described a number of issues that arose using AKS. He hasn't jumped ship, however, he advises people to skip "first bumpy GA months" and wait until the service becomes more stable.

"The AKS team gets more and more experience with time and the growing number of clients," he observed. "So, the service improves fast."

At the same time, AKS has fans. One person chiming in on the Hacker News thread remarked, "I've had wildly different results. My shop wasn't large by any means but Azure worked pretty much perfectly for us."

We should all be so fortunate. ®




Biting the hand that feeds IT © 1998–2018