Microsoft’s Devops story

Posted by: Sandeep Chadda , on 3/22/2018, in Category DevOps
Views: 32507
Abstract: Learn how Microsoft built VSTS (now known as Azure DevOps) based on the DevOps learning of its users, and get inspired to implement DevOps in your organization.

This is a story about the Cloud & Enterprise engineering group at Microsoft that encompasses products & services such as PowerBI, Azure, Visual Studio, Visual Studio Team Services (now Azure DevOps), Windows Server, and System Center. It is a story about transformation of Microsoft's culture, organization structure, technology, and processes to allow enterprises to deliver faster, continuously learn, and experiment.

Editorial Note: VSTS has now been renamed to Azure DevOps. Please read "Azure DevOps" wherever you encounter VSTS. Read more at:

"VSTS is now Azure DevOps. What has changed and why?" 

This is also a story of 84,000+ Microsoft engineers (as of Feb 2018) who have moved to Visual Studio Team Services (VSTS  (now known as Azure DevOps)) to create unparalleled value for their customers using DevOps.

DevOps – Frequently Asked Questions

Few frequently asked questions around Devops that this article attempts to answer:

What is DevOps?

  • Where do we start with Devops?
  • Is Agile based product management enough for DevOps
  • If we continuously deploy, would that constitute DevOps
  • Can we achieve DevOps in 1-2 years?

Organization changes

  • Can Devops be a top down approach or a bottoms-up?
  • Do we need to modify our team structure to achieve Devops?
  • Can the teams continue to operate as-is without any major changes or do we need to relook the org structure?

Engineering investments

  • How do we ensure that our teams can write code and deploy it faster? How would testing look in this new world of faster delivery?
  • How do we work on features that span multiple sprints?
  • How do you expose features to few early adopters to get early feedback?
  • If there are many teams working on a product then how do they make sure that we are always shippable?
  • How does integration testing work in a DevOps enabled engineering system?
  • How long does it take you to deploy a change of one line of code or configuration?

How do we measure success?

  • Our intent was to experiment, learn, and fail fast but how do we measure failure or success.
  • If we want to ensure we deploy often, then how do we measure the service metrics.

PS: Some of the anecdotes used in this article are from the author's own experiences at Microsoft.

 

Microsoft’s Mindset

I start with Microsoft’s mindset.

Though this is not the first thing that we tackled at Microsoft, it is important to understand the mindset of the Microsoft employees to better appreciate the transformation in the organization.

As Satya writes in his book "Hit refresh", the new Microsoft is built on the scaffolding of growth mindset and collaboration.

There is a transformation from "I know everything" to "I want to learn everything".

All employees in Microsoft are encouraged every day to experiment, learn, and continuously improve. It is a transformation from celebrating success to celebrating failure and learning from those mistakes and course correct to add value to customers. This has fostered a start-up culture within Microsoft which lives and breathes on maximizing user value.

Now you really can't experiment much and react fast if the people are not thinking that ways, if the processes are not tuned for this, and the toolset is not an enabler.

This process, toolset, and mindset is what I refer as DevOps.

Agile Transformation | Kanban

Across Microsoft and particularly within the Cloud + Enterprise engineering group, we have been focusing on shipping software more frequently. Therefore, it was critical for us to gradually transform from a traditional waterfall model with lots of phases and a multi-year release cycle, to a rapid 3-week sprint cycle.

So, let's first understand why Agile?

We were earlier planning our product release many years in advance and contemplating what customers asks would be, when the product releases. Time and again we found ourselves trying to answer genuine feature requests from our customers and giving them an ETA (Estimated time of arrival) several years ahead of time. This was not bad, since the whole software industry was working this way. However, the question is whether there is a scope of improvement or is there wastage in the system?

We frequently found ourselves in uncomfortable territories where a genuine customer request would only be fulfilled after few years since our release cycle were years apart. It was also an interesting challenge to ensure that we were able to adhere to release deadlines as well.

We soon realized that the biggest issue with the waterfall model was that we were delaying value that we delivered to our customers. Our first prerogative was to reduce the cycle time from ideation to delivering customer value, and thus we adopted Agile.

We trained everyone on Agile and these trainings continue to happen for those joining the organization afresh. The key focus has been on the flow of value to our customers.

Agile did not happen to us overnight. When we first moved to Agile, we experimented with two weeks and four-week sprints. We also experimented with 5 sprints of three weeks each followed by a stabilization sprint. This was an interesting experiment since it had some interesting outcomes.

stabilization-phase

Figure 1: Experiment with stabilization phase was not successful

Because of the stabilization phase, one of the teams, "TEAM A" continued to develop features and kept the debt in check. Another team, "TEAM B" went overboard with feature work and looked to reduce the debt in the stabilization phase. You would assume that "Team B" worked to get their technical debt down in the stabilization phase while "Team A" continued the path of feature development, but it did not turn out that ways.

Even "Team A" had to reduce "Team B's” debt.

Now imagine what was Team A’s motivation to do the right thing. We were almost penalizing “Team A”. In this experiment we learnt quickly and got rid of the explicit “stabilization phase”. Now, the teams are expected to keep the debt in check on an ongoing basis with each sprint. We have a staunch foundation of measuring team health to ensure teams do not pile on debt.

I will talk about this later in the “Build, Measure, Deploy” section of this article.

Team structure and planning

We follow Scrum, but we customized it to our needs to fit the scales that we needed. For example, a team is often referred to as feature crew that is responsible for owning a product feature. Just to give an example, we have a feature crew for “Kanban boards” and another that manages “pull requests” and all git related workflows.

A feature crew (FCs) typically comprises of a product owner and the engineering team. All FCs in Microsoft are driven by the dual principle of team autonomy and organizational alignment.

Team Autonomy: It is the freedom that each team in Microsoft enjoys defining its own goals to maximize the customer impact

Organization Alignment: It is the alignment of team’s goals with the organization’s objectives. Think of it as the glue between the operations and strategy that acts as the guiding light of each Microsoft employee.

To give an example.

A feature crew owns its own backlog. It decides what will be shipped in the next few sprints. It decides the backlog based on the conversations that each team is having with its customers and partners. This is team autonomy.

While at the same time, all feature crews within a product are aligned to speak the same taxonomy in the organization. e.g. when I am writing this article, all 45 feature crews for Visual Studio Team Services are on Sprint 131. We all know what we will ship in sprint 132 and 133 and this is the information we share with our customers in our release notes and plans. This makes us efficient in terms of processes when managing conversations with our customers or other teams.

Reference: Sprint 130 release notes

https://docs.microsoft.com/en-us/vsts/release-notes/2018/feb-14-vsts

All feature crews are multidisciplinary and typically comprise of 8-12 engineers, 1 Engineering manager, and 1 program manager. Mostly a feature crew stays together for 1 to 1.5 years and many stay together much longer. There is a support structure with each feature crew that may comprise of user experience designers, researchers etc. that works alongside the feature crew.

microsoft-crew-redmond

Figure 2: A typical feature crew seated together in a team room - Microsoft, Redmond office

While feature crews are formed with a specific objective, there are moments when another feature crew may be asked to balance work for a short duration.

A feature crew always sits together in a team room. Think of team room as a dedicated room where all members of the feature crew sit together and can talk freely. Each team room has a focus room for quick huddles of 3-4 members. We encourage conversations in the team room but focus rooms help having a focused conversation if the whole team does not need to be involved or disturbed. Each team room has a television on display that projects the team's Kanban board or important metrics that team wants to track. The Kanban board comes handy during stand-ups when we simply stand-up in our rooms to discuss the day and work.

The product manager is the thread between customers and the feature crew. She is responsible for ensuring the product backlog is up to date. We have a sprintly ritual of sending the sprint email to all feature crews and leadership. Each sprint mail typically contains 3 sections:

1. What we did in this sprint

2. What we plan to do in the next sprint

3. Sprint videos celebrating the value we delivered in the sprint

Here is mine from few sprints back:

sample-sprint-email

Figure 3: Sample sprint email including customer value delivered to our customers

The planning exercise is very lightweight. The feature crew is driven by an 18-month vision, which lays the foundation or objective of the feature crew's stint. This 18-month epic may be split into chunks of features or spikes that are the fundamental drivers for a team's backlog. The 18-month epic is typically split into 6-month planning cycles.

After every three sprints, the feature crew leads get together for a “feature chat,” to discuss what is coming up next. These feature chats are also shared with leadership and serves to help keep feature crew priorities in alignment with overall product objectives. You would often hear feature crews saying that certain feature asks are not in the next 3 sprint plan or 6-month plan since we plan this way.

The feature crew then delivers features every sprint. With each sprint we communicate to our customers regarding the work completed and this also serves as our feedback channel.

User voice - When the feature crew completes a feature from the user voice you will typically see comments updated by the product manager

You can view the Wiki full text search user voice here:

https://visualstudio.uservoice.com/forums/330519-visual-studio-team-services/suggestions/19952845-wiki-fulltextsearch

vsts-user-voice

Figure 4: VSTS User Voice that helps capture user feature requests (sample: Wiki search)

Release notes - All feature crews update release notes for all features that get shipped every sprint

https://docs.microsoft.com/en-us/vsts/release-notes/2017/nov-28-vsts#wiki

wiki-search

Figure 5: Release notes published publicly (sample: wiki search)

Blogs - All major announcements are blogged as well. Here’s an example:

https://blogs.msdn.microsoft.com/devops/2017/12/01/announcing-public-preview-of-wiki-search/

blog-post-snippet

Figure 6: Blog post snippet (sample: wiki search)

This may look like a lot of work, but it is not since all this information flows right into my work items on the Kanban board in VSTS. I can prioritize work based on the user voice of the features and my release notes from the work item flow into the public forum, so I can get feedback from my customers and reach out to them by just updating my work item.

wiki-search-work-item

Figure 7: Wiki search work item shows integration with user voice and release notes

 

 

Engineering discipline and branching strategy

Earlier, we had an elaborate branch structure and developers could only promote code that met a stringent definition and it included gated check-ins that included the following steps:

  1. “Get latest” from the trunk
  2. Build the system with the new changesets
  3. Run the build policies
  4. Merge the changes

The complex branch structure resulted in several issues:

  1. we had long periods of code sitting in branches unmerged.
  2. created significant merge debt
  3. massive amount of merge conflicts by the time code was ready to be merged
  4. long reconciliation process

There was a lot of wastage in the system. Eventually we flattened the branch structure and had the guidance for the temporary branches. We also minimized the time between a check-in and the changeset becoming available to every other developer. I talk more about this in the “continuous everything” section.

The next step was to move to Git. This move to a distributed source control from a centralized version control was a big step in our journey. Typically, the code flow would happen as follows:

  1. A work item is created for a bug or a user story
  2. Developer creates a topic branch from within the work item
  3. Commits are made to the branch and then the branch is cleaned up when the changes are merged into the master.

This way all the code lives in master branch when committed, and the pull-request workflow combines both code review and the policy gates.

work-item-status

Figure 8: Work item showing commit, PR, build, and release status from the work item itself.

This ensures 100% traceability of work while ensuring that merging remains easy, continuous, and in small batches. This not only ensures that the master branch remains pristine but also that the code is fresh in everyone’s mind. This helps in resolving bugs quickly.

Control the exposure of the feature

One problem that we faced going down this route of short lived branches and sprintly cadence of deployment was - "How do I manage master when I am working on a big feature that spans multiple sprints"?

We had two options:

1. Continue to work on the feature on another feature branch and merge it to master when done. But that would take us back in time when we had those merge conflicts, an unstable master, and long reconciliation time.

2. Therefore, we started to use the feature flags. A feature flag is a mechanism to control production exposure of any feature to any user or group of users. As a team working on features that would span multiple sprints, we register a feature flag with the feature flag service and set the default to OFF. This basically means, the visibility of this feature is turned OFF for everyone. At the same time, the developer keeps merging changes to the master branch even though the feature is incomplete behind the feature flag. This allows us to save the cost of integration tests or merge conflicts when eventually the feature goes live. When we are ready to let the feature go live, we turn the flag ON in production or the environments of our choice.

You would often hear us say, the feature is currently being dogfooded and it will be enabled for other customers soon. Basically, what this means is that we have turned ON the feature flag for the feature on our dogfood environment and we are getting feedback. The feature is nearly production ready. We will react to the feedback and turn ON the feature flag for the rest of the world soon. Once the FF is turned ON for everyone, we clean it up.

The power of feature flag is beyond comprehension. It not only ensures that our master branch stays pristine, but also empowers us to modify the feature without any visible production impact. It can also be an effective roll back strategy without deploying a single iota of code in production.

By allowing progressive exposure control, feature flags also provide one form of testing in production. We will typically expose new capabilities initially to ourselves, then to our early adopters, and then to increasingly larger circles of customers. Monitoring the performance and usage allows us to ensure that there is no issue at scale in the new service components.

You can also try one of the extension on our marketplace to manage feature flags:

https://marketplace.visualstudio.com/items?itemName=launchdarkly.launchdarkly-extension

Read more about feature flags

https://www.visualstudio.com/learn/progressive-experimentation-feature-flags/

Organization structure

Our organizational chart was by discipline i.e. PMs report to PMs. Developers report to Developers. Testers report to Testers.

As we got into smaller cycles and quicker iterations, there were barriers to having a separate discipline for developers and testers. We started seeing wastage in the system during the handshake between testers and developers.

Some of issues that we faced were:

  • If testers are out, developers didn’t know how to test
  • Some testers knew how to fix the issue, but they could not since they were “testers” and not "developers"
  • If the developer is out, the "tester" had not much to do

All this created wastage in the system and slowed the delivery of customer value.

As a company, we made a strategic change to transform into combined engineering. We took the developer discipline and tester discipline and put them together as one engineering.

Now engineering is responsible for

  • how we are building features
  • and we are building with quality

Now, we hire engineers who write code, test code and deploy code. This has helped us to reduce or should I say remove the unnecessary handshake that was causing severe delays in the system, especially when the need of the hour was to go fast and deliver faster.

PS: This strategy worked great for Microsoft however when you plan to implement similar strategies at your end, you should see whether you are able to identify wastage in your system and see what can fit best to your organization needs.

The old model left us with a lot of functional tests that took extremely long to run and were flaky. When the developers were made accountable for testing code, we simultaneously made investments in making the test infrastructure better. We also made investments in making tests faster and reliable. We soon created zero tolerance for flaky tests and the testing became closer to the code. This is often referred to SHIFT LEFT in Microsoft parlance.

This again ties back to the initial requirement to keep the master branch clean & shippable and we were able to do so effectively by shifting our tests left. Now we run around nearly 71k+ L0 tests with every PR that merges code onto the master branch. It takes approximately 6 min 30 seconds to run all these tests.

test-results

Figure 9: There are more than 71k tests that run with each PR. Screenshot from VSTS.

The number was a little over 60,000 few months back so the rate of increase is significant. This level of automation helps us keep the master branch always shippable. We constantly measure the health of the master branch as you can see in the image below.

release-branch-runs

Figure 10: We always track the health of our release branch and master branch. This widget is hosted on a VSTS dashboard.

Build measure learn

If you are in Microsoft, then you would be inundated with acronyms. BML (Build measure learn) is one of them.

BML is both a process and a mindset.

When I joined Microsoft, I was told not to be afraid of taking risks and experimenting as long as I can build, measure, and learn. Also, I was told that making mistakes is fine if we expect, respect, and investigate them.

And that is the philosophy of BML as well. Whatever we build, we measure it and learn from it.

Build - The experiments we run are based on hypothesis. Once the hypothesis is established and accepted, the experiment is executed keeping in mind that we do not spend many engineering cycles in implementing the experiment since we would like to fail fast.

Measure - So what do we measure? Pretty much everything. When we create features or experiments, we measure as much as we can to validate availability, performance, usage, and troubleshooting. On any given day we capture almost 60GB of data and we use this data to enhance our features, to troubleshoot issues, or measure the health of our service.

Learn - Learning from what we measure is the most integral aspect of the whole journey.

For example, when I was creating VSTS Wiki (an easy way to write engineering and project related content in VSTS), I had to set an attachment size limit for adding content to Wiki. This limit would determine the size of the images and docs you can attach to a wiki page. I did not know what size of attachments users would prefer in a wiki page, therefore, I experimented. I went with a meager 5MB attachment size with the hypothesis that users would only add images and small docs to a wiki page. My hypothesis was invalidated in the first month of Wiki launch and we started seeing a pattern where 90% of the attachments were less than 18MB therefore we found the magical number for our attachment size. Now VSTSWiki supports 18MB of attachment size and most of my users are happy with it.

You can read about all these investments in the DID YOU KNOW section of my blog here: https://blogs.msdn.microsoft.com/devops/2018/01/12/link-wiki-pages-and-work-items-write-math-formulas-in-wiki-keyboard-shortcuts-and-more/

We also continuously monitor health. It could be the service health, team health, engineering health, or feature health.

TEAM HEALTH

Just to give some perspective, here is the engineering health of my team.

team-health-widget

Figure 11: Team health widget in VSTS Dashboards

As you can see, we are at 5 bugs per engineer which is a max threshold we maintain for all feature crews. It basically means that I will have to ask my team to stop all feature work and get our bug count to as low as possible. By the time this article was published, we are down to 2 bugs per engineer.

USER HEALTH

We have integrated all user feedback on public forums such as https://developercommunity.visualstudio.com/spaces/21/index.html etc. into VSTS. The below pie chart indicates all VSTS Wiki feedback from my users. You can see that my user conversations are healthy since we have only 11 feedback tickets out of which all are either Fixed and pending a release or under investigation.

vsts-dashboard

Figure 12: VSTS Dashboard widget indicating user feedback health

USER VALUE FLOW RATE

The below chart shows the rate of delivery of user value by my team. It shows that I am delivering bugs and stories at the rate of 55 days per work item. This is not as good as I would have liked and I need to improve it.

team-cycle-time

Figure 13: VSTS Dashboard widget indicating team cycle time

While writing this article I investigated that this is an issue with stale bugs that are not closed and are in resolved state. I have added another widget on the VSTS dashboard to identify stale items on my board. This will help to ensure that my cycle time is not as high as 55 going forward.

Continuous everything

I already talked about continuously testing our code with every merge to master. You can call it continuous testing which is powered by continuously integration. We also continuously deploy our code on the Monday after each sprint.

We always deploy our code in the canary instance or the dogfood environment first. Our dogfood environment is as pristine as production since that is the environment on which VSTS is built. If we have a buggy code in VSTS, then we can't use VSTS to fix VSTS therefore the stakes are high even for our dogfood environment.

devops-environment

Figure 14: We deploy to Ring0 (our canary instance) first and you can see that there is a deliberate delay of a day between Ring0 and Ring1 which is our next set of users.

We deliberately leave few hours before continuing the deployment on other environments so that we give feature crews enough time to react to any issues. Typically, the deployment across all our VSTS users completes in 5-6 days while we ensure that there is 99.9% service availability.

The goal of continuous everything is to shorten cycle time. Continuous Integration drives the ongoing merging and testing of code, which leads to finding defects early. Other benefits include less time wasted on fighting merge issues and rapid feedback for development teams. Continuous Delivery of software solutions to production and testing environments helps us to quickly fix bugs and respond to ever-changing business requirements.

There is much more to talk regarding security, collecting data, our KPIs, live site culture, limiting impact of failure using circuit breakers etc. I would love to gain feedback from this write-up to see what else would you be interested in and I will write about it. Consider this article as my MVP of DevOps.

Take away

If you had the patience to read so far, then first I want to applaud you. Next, I want to leave you with some key takeaway:

This is Microsoft's journey based on the gaps we found in our system and the priorities were set based on what we felt was hurting us the most. If you or your customers are trying to onboard to DevOps, then it is important to understand your needs and pain points. The problems or pain points should drive the solutions.

1. DevOps is a journey of continuous improvement and we are still on that journey. We have come a long way in this journey to be where we are however we are still finding flaws and wastages and we improve them … every day

2. DevOps is about your people, processes, and tools which is soaked in the culture and mindset of the organization and driven by business goals. We need all these ingredients to come up with a great DevOps success story.

Who wrote this article?

An entire feature crew that was aligned that it was important for others to learn from the mistakes we made at Microsoft. Happy DevOps journey!!

 

This article was reviewed by Roopesh Nair, Aseem Bansal, Ravi Shanker, Subodh Sohoni and Suprotim Agarwal.

This article has been editorially reviewed by Suprotim Agarwal.

Absolutely Awesome Book on C# and .NET

C# and .NET have been around for a very long time, but their constant growth means there’s always more to learn.

We at DotNetCurry are very excited to announce The Absolutely Awesome Book on C# and .NET. This is a 500 pages concise technical eBook available in PDF, ePub (iPad), and Mobi (Kindle).

Organized around concepts, this Book aims to provide a concise, yet solid foundation in C# and .NET, covering C# 6.0, C# 7.0 and .NET Core, with chapters on the latest .NET Core 3.0, .NET Standard and C# 8.0 (final release) too. Use these concepts to deepen your existing knowledge of C# and .NET, to have a solid grasp of the latest in C# and .NET OR to crack your next .NET Interview.

Click here to Explore the Table of Contents or Download Sample Chapters!

What Others Are Reading!
Was this article worth reading? Share it with fellow developers too. Thanks!
Share on LinkedIn
Share on Google+

Author
Sandeep is a practicing program manager with Microsoft | Visual Studio Team Services. When he is not running or spending time with his family, he would like to believe that he is playing a small part in changing the world by creating products that make the world more productive. You can connect with him on twitter: @sandeepchads. You can view what he is working on right now by subscribing to his blog: aka.ms/devopswiki/.


Page copy protected against web site content infringement 	by Copyscape




Feedback - Leave us some adulation, criticism and everything in between!