November 2, 2018: It could have been another Friday morning for Murali Krishnan. But it wasn’t! This Friday started with a bang – there was an alert email and text message informing about a Code Red situation at work.
“App outage. All systems down. Houston, we have a problem.”
Murali was then the VP of Digital Platforms (Software Engineering) at Starbucks. He had been partying to a few systemic glitches before. Nothing matched the intensity of this November 2.
The elated marketing division at the coffee giant was on a planned effort to launch a new line of holiday drinks and cups. Millions of doting, loyal customers rushed to be the first to grab their hands on the freshly inducted coffees and cups, taking down the app in the process. The engineering team had anticipated the holiday rush. They updated the architecture and increased the capacity to support the load.
For retail companies, the holidays are when Wall Street and Main Street join forces. Either to elevate a brand or to plunge it into despair. The direction depends on the implementation of merit points. The business earns for itself, in advance anticipation of the start of the holiday.
Swathes of eager customers welcome new brands into their considerations. This happens if their current favorites slow down. Holiday results also weigh heavily on the sentiment around the stock for the rest of the New Year. A small deviation in earnings per share is enough for the stock to lose millions of market capitalization within minutes. Starbucks, with all the right intentions, had challenges with mobile orders on this day boat this time.
Murali and team would not have it.
About systems and scale
By midday, things had started to cool down a bit. Traffic had subsided. Systems slowly recovered from the shocks they had to bear early in the morning.
The post-mortem had begun.
“We were looking at a web that binds billions of requests from millions of users against thousands of stores. There was a complex software system of front-end, back-end, middle-ware, and store edge systems. A 50 member crew, who specialized in their niches, huddled in the aftermath. We started analyzing our engineering foundations. Like system diagrams, software documentation, et al”, says Murali. There were many theories on what caused the failures.
Murali recalled, “I remember one of our MVP engineers being on a vacation in Vietnam. Unfortunately, she had to log-in remotely to investigate.”
He continued, “During our reflections in the postmortem, I realized we didn’t have the quality of the conversations we wanted. In scenarios like these, I have always relied on the 6-Sigma approach to uncover the root causes of failure.”
We probed him more on his interpretations of quality conversations.
“You see, our system diagrams were outdated. The team of 50, they were experts in their fields, but their knowledge was in their brain, not on paper. Also, our metrics were not standardized. From the HQ, we could see that one store had 1000 orders waiting in a queue outside. The store would tell us they were expecting just 300. Not that the team wasn’t working hard, but their directions were not aligned. Everybody was speaking a language that only they could understand. Communication protocols were outdated. The heroic moments like these need a 24, 48, 72 hour communication SLA, within which we could share the root causes of the failure, with executives. The absence of a guiding framework meant that we were slipping behind.”
Through the sheer might of engineering will, the team was able to restore all systems within 72 hours. The residual impact lingered on in Murali’s mind.
Enter a continuous learning paradigm
There were many techno-functional reasons behind the disruption. More importantly, this was a learning opportunity, according to Murali.
Some key conclusions Murali arrived at after the postmortem exercise were
- On-call engineers had to be rotated. Before the outage, they had only one engineer serving these calls. The result was silo-ed information.
- The engineering unit of more than 200 people had to exchange memos on critical incidents with everybody. This ensured the organization had a detailed view of the root causes behind all failures.
- Not just ICs, Directors and VPs also needed to master the inner workings of the systems and the ways in which they interacted.
- The systems are complex and therefore needed better documents and collaboration to solve problems.
Many industry pundits have observed that working in a dynamic engineering environment is funny and challenging. By the time you finish lunch, the deterministic nature of business ensures that the snapshot of systems, before lunch, has changed dramatically.
This meant continuous learning was the only legitimate way to transform the engineering culture. The static nature of mere classroom training was not sufficient for the dynamic, inherited, and complex engineering workflows. Worst yet, despite having several online courses available, none will train his team for company, product, and process specific skills.
Most learning experts have measured that when people learn in a hurry, they tend to forget in a hurry too.
Like all visionary learning programs, stakeholders have a clear hypothesis of point A leading all the way to point B; in Starbucks’s context, points A to B were extremely black and white
The 4 needle-moving metrics were
1: Number of orders processed through digital channels (web + mobile)
2: Hours of downtime
3: Number of failures
4: Mean time to failure (improvement from days → weeks → months)
To improve these metrics, Murali was acutely aware that Starbucks had to adopt some foundational principles; a standardized documentation framework, updated system diagrams to reflect the architectural realities of today, and reporting consistent engineering metrics across the board et al.
He also knew that each new product release, bug fix, patch and upgrade is an opportunity to learn, reflect and communicate. He now wanted his engineers to be everyday learning nodes for their teams.
Introduction to marginal gains
“Murali, I think you might find this interesting.”
Dawn Larsen, VP, TechOps at Starbucks, was Murali’s comrade in arms. Since that fateful crash day, Murali and Dawn were working together to discover an orchestration schema for their continuous learning agenda.
That’s when Dawn stumbled upon the inspirational story of Dave Brailsford, who applied the construct of ‘marginal gains’ to lead the British cycling team to their first Tour De France in 100 years, and their first Olympic Gold since 1908.
During his ten year helm from 2007 to 2017, British cyclists won 178 world championships and 66 Olympic or Paralympic gold medals, and captured 5 Tour de France victories in what is widely regarded as the most successful run in cycling history.
But what was this marginal gains concept?
The math behind marginal gains is profound – if you improve by 1% every day for a year, you’ll end up 37 times better by the time you’re done.
Here’s an excerpt from “Atomic Habits” that explains how the concept worked out in the cycling team.
“Brailsford and his coaches began by making small adjustments you might expect from a professional cycling team. They redesigned the bike seats to make them more comfortable.
When these and hundreds of other small improvements accumulated, the results came faster than anyone could have imagined.”
Murali was hooked. The more he researched the marginal gains concept, the more he was convinced that the approach would reap sweet fruit for his endeavor. 1% gains were synonymous with continuous learning, and incremental improvements were his key to compounding performance enhancement.
And thus began the journey that would lead to a zero-outage environment in 2019, 2020 and 2021.
Partnering with L&D
“When we shared our dilemmas with the learning team, the first thing that came to us was to review all existing training material. After auditing the material for a day, we realized there were two big issues with existing literature. One, all material was authored for the purpose of classroom training only, and two, it was dated.”
Having concluded that just classroom training would not be conducive to Murali’s vision of continuous learning, he started with baby steps towards everyday marginal gains.
“I started writing weekly emails to the team. These emails focused on our goals to increase orders & reduce downtime hours.”
“The email also articulated all the highlights and lowlights of the week. New initiatives were celebrated, and an orderly compliance framework was established for Site Reliability Engineers.”
Next up were weekly workshops around various engineering themes relevant to Starbucks’s systems. Murali, along with the L&D team, ensured there was sufficient buzz around each workshop.
“These workshops were methodical reflections of the week that went by. Outages hadn’t stopped, but each instance was capsuled into bite-sized learnings, which were delivered via workshops. The way we built a climax around these events was four-pronged. The workshop is just around the corner, the workshop has started, the workshop has so many participants, and finally, the workshop is over – here are the learnings.”
Few days down the road, these workshops were coupled with weekly service learnings with on-call engineers. These are strong parallels to peer-code reviews, but for Site Reliability operations.
Murali is proud of one initiative – Ask Murali Anything
“You see, Starbucks being the coffee innovator, had a cultural pillar that united employees, a coffee tasting ritual. Once every fortnight, employees would cheerfully gather around a round table and take sips of the latest brew on offer. We found an opportunity to turn this into an AMA, but with a twist. We named it “Ask Murali Anything”
“In every AMA, we redesigned our goals based on our progress last week. We celebrated small and large wins, made heroes out of on-call engineers, and micro-wins were celebrated as if we scaled Everest…”
“…what started as sparsely attended AMAs turned into full-houses soon.”
There is sufficient evidence that marginal gains were a viable strategy in Murali’s arsenal to build a sturdy engineering culture. For the next three years, Starbucks never faced an app failure during peak season. Would such a transformation be possible if Murali opted for the tried-and-tested route of conventional classroom learning? Debates could be rife. For now, it’s safe to say that everyday 1% improvement demonstrates principles and profit for Starbucks.