Azure DevOps Search – Deep Dive

Posted by: Imran Siddique , on 2/24/2020, in Category DevOps

Views: 85884

Abstract: Azure DevOps Search (Search) service is one of the microservices of Azure DevOps that powers its search functionality and makes it easy to locate information across all your projects using just a web browser. This article discusses the inner workings of DevOps Search.

Search service of Azure DevOps makes it easy to locate information across all your projects, from any computer or mobile device, using just a web browser.

In this article, we will do a deep dive to see:

how the Azure DevOps Search service is designed,
how it functions,
the different capabilities it has to offer and
how DevOps Search can be leveraged to build more impactful enterprise apps.

Azure DevOps Search – Introduction

The Azure DevOps service consists of dozens of microservices communicating with each other to give the user a consistent and feature rich experience.

Azure DevOps Search (Search) service is one of the microservices of Azure DevOps that powers its search functionality.

Search service provides support for searching different entities of Azure DevOps like code, work item, wiki, packages to name a few. Its unique proposition comes from providing semantic relevance for query results and deep filters during query. More examples and information can be found at search documentation.

Figure 1: A sample code search result window

Search service is a completely hosted solution, that supports a scale of billions of documents running into peta bytes of index data spread across multiple Azure regions and Elasticsearch clusters. The platform also supports critical Enterprise-ready features like honoring security permissions, multi-tenancy and GDPR compliance.

In this article, we will talk more about how the platform is architected to support both the search functionality as well as the service fundamentals at scale.

Why Azure DevOps Search?

Four of the biggest needs that Azure DevOps Search faced were:

1) Availability: Search is such an integral part of the product capability that there is an inherent need to ensure it is almost always available

2) Scale: With the large scale of users and usage in Azure DevOps, scale was always at the back of our minds (I am from the Azure DevOps Search team) while we were designing the architecture

3) Performance: While we were achieving high availability and tremendous scale, we could not compromise on performance and wanted most of the important queries to open in sub seconds

4) Complexity: The way users search for code and work items is very different from your average search requests. There are several complex scenarios that search supports such as: in code search, you can search a code file based on a comment you wrote in the file by just typing “comment:todo” or in work item you can search a bug, user story, or feature based on its assignment, state, creation time and other thousands of filters that you would associate with a work item .

Azure DevOps Search – Architecture

Search service platform is based on a common framework layer, that powers all the other Azure DevOps services. The index data is maintained on Elasticsearch indices.

Search service has two major processing pipelines – Indexing pipeline and Query pipeline.

Indexing pipeline is the set of components that come together to support pulling content from other Azure DevOps services, processing it to add annotated semantic information, and pushing it to the Elasticsearch indices.

Figure 2: Azure DevOps Search higher level architecture

The Query pipeline provides a REST endpoint for the Azure DevOps portal and external tools to search. It performs key functions such as identity validations, authorization checks and retrieving the relevant and accessible content from the Elasticsearch index. The Elasticsearch indices themselves are hosted on Azure VMs that handle the ingestion and query of documents.

Indexing pipeline

Figure 3: Index Pipeline workflow

Crawlers

The first stage of indexing is to trigger the crawling of the contents (code in case of code search, work-item details in case of work-item search and so on) once an account is onboarded. Onboarding in context of Azure DevOps account means enabling the search functionality for the users of the account.

Multiple types of crawlers are available and hosted in the Search service – each entity has its own crawler implementation; and for some entities there are more than one implementation.

Crawling happens on Azure job agents, which are Azure worker roles where all our background processing happens. The crawling either happens in chunks (split across multiple executions of the same job) or in a single execution of the job. This is to ensure fairness across multiple accounts running in parallel and ensuring resources are utilized efficiently.

Incremental crawling is triggered using notifications whenever there are changes in the system. For instance, in case of code search whenever there are pushes or changes on a given repository, there is a notification which is sent from Version Control service of Azure DevOps to Search service. Search service then reacts to the notification and retrieves the contents of the code files that are changed.

Incremental crawling can be also be processed in chunks. Once the contents are crawled, they are processed by the next layer – parsing.

Parsers

Once the contents are available from crawler, the documents are passed through parsing layer. In this phase, the documents are parsed from different angles to extract some more meaningful information so the same can be indexed as well.

For example, in case of code search, Search service uses language specific parsers for C, C++, Java and C# that generate the partial Abstract Syntax Tree (AST) for each file in the repository during indexing time. Parsers take the bare files and generate semantic token information for the file and add them to the document content that needs to be indexed.

For example, when a C++ code file is being processed, the class, method tokens within the file are also parsed, and added to the document mapping information for Elasticsearch. The document mapping for code files in Elasticsearch today holds not just the content of the file, but also a per-term code token information.

Parsers run out-of-process to ensure isolation (for security reasons) as well as the ability to host language specific runtimes. Parsing failures cause fallback to text parsing to ensure that the file is still text searchable.

Feeders

Once the parsed content is available, the documents are fed into the Elasticsearch indices via the feeders. Feeders convert the parsed content into an Elasticsearch compatible mapping, batch multiple parsed files into an Elasticsearch indexing request, and index them.

To ensure that the cluster doesn’t get overwhelmed with a huge set of indexing requests at the same time, there are throttling mechanisms to control the indexing throughput across multiple job agents.

Query pipeline

The search experience is available for Azure DevOps users both from the Azure DevOps portal as well as the REST APIs. The Azure DevOps portal experience is built on top of these REST APIs exposed by the Search service.

The incoming search requests go through multiple processing stages like validations and transformation. The request is first validated to ensure the information available is correct, supported and meets all the security/throttling criteria. The request is transformed and optimized so the same has information around the index and shard where the search will happen, the filters that needs to be applied, the boosting that will be carried out, the fields that will be retrieved and so on.

Figure 4: Query Pipeline workflow

Search service supports full fidelity read permission on searches!

This means even preview of search results are not allowed if the users don’t have permission to the same. This is supported for queries that are scoped across multiple projects and repos as well. The results returned from Elasticsearch are filtered to ensure only the results the users have access to, are returned.

Search service supports queries scoped at different levels like account / project for most of the entity types and in some cases even more granular scopes like repository/path. It also supports searching across multi-selected entity instances at the same time.

Elasticsearch based Indexing Platform

Figure 5: A typical Azure DevOps Elasticsearch cluster

Cluster topology

Elasticsearch indices are stored in Azure Premium storage blobs and supported via nodes hosted on Windows based Azure IaaS VMs.

Each Elasticsearch cluster contains 3 master nodes, 3+ client nodes (on the indexing and query load on the cluster) and 3+ data nodes (depending on the size of the indices). Our largest clusters have 80+ data nodes and have an index utilization (amount of indexed data that is queried) of ~70%.

To ensure Elasticsearch runs smoothly with Azure, Elasticsearch’s node allocation awareness attributes are configured to honor the availability sets (fault domain and update domain) within Azure. These settings ensure that a given set of primary + replica is always available during unplanned outages or planned upgrades.

Data nodes: 8 cores, 28 GB RAM, 56 GB SSD

Master nodes: 2 cores, 7 GB RAM, 14 GB SSD

client nodes: 2 cores, 7 GB RAM, 14 GB SSD

Indices have primary + 2 replicas, with a quorum based write consistency model. Index refresh is set to a minute.

The Search service has Elasticsearch clusters deployed in multiple Azure regions, at least one cluster per each region supported by Azure DevOps. This helps ensure data sovereignty is honored, as the index data for accounts within a given Azure DevOps region is stored within the same region.

Index/Data model

The mapping for documents inside Elasticsearch contains some information that is similar for all entity types in the Search service and some which are entity specific. All the documents have metadata information like account/project they belong to. Each entity can have additional metadata information like the repository a document belongs to, in case of code.

Each document also has a set of information that uniquely identifies it from other documents. For instance, work-items have a work-item Id associated with them that uniquely identifies a work-item in an account. Similarly, a combination of branch name, file path, file name and content hash uniquely identify the code file in a given repository of Azure DevOps account. Document Id of the Elasticsearch document is built using some of the information mentioned above.

The mapping also contains entity specific information that helps in enabling the search experience for the given entity type.

For instance, in case of code search, the code token information for a given term (say class “Foo”), along with its positional information is stored as a term vector payload in the index. The entire content of the file, including operators, is stored in the file content to support full text search.

Routing

The default index routing ensures that data in a single entity instance goes to the same shard, and wherever possible, data from multiple entity instances of a given account go to the same shard as well. This doesn’t suffice for very large entity instances or accounts, which have millions of documents that can’t sit on the same shard. Based on different heuristics, when certain entity instances are deemed large, those repositories are split across multiple shards, to ensure shards don’t become too big.

Handling growth

A single account typically sits in a single index on Elasticsearch split across multiple shards of that index based on size. It is also possible for some very large accounts to have multiple indices dedicated to them.

At any given point of time, there are a few tens of indices that are marked “active”, so new accounts can be indexed into them. Based on certain account/entity instance size heuristics, indices are deemed “full” and are closed to addition of new accounts. Existing accounts continue to grow within the same indices once assigned. When there are no active indices available, new set of active indices are created automatically to support new account additions.

Periodically, jobs run to determine if some shards/indices are really “large” because of high growth of accounts on that shard, and selected accounts on that shard are marked for “move” to a new index to ensure they don’t become a bottleneck and influence other accounts on that index. These moves are then orchestrated by the trigger/monitor job that handles re-indexing, to ensure the number of moves at any given point of time is regulated/throttled.

Monitoring is built in to indicate capacity crunch or spare capacity. This helps react by increasing/decreasing nodes in the cluster.

Search Service Fundamentals

Multi-tenancy

Event/Job processing pipeline

The indexing pipeline is a shared job execution model across multiple accounts that are hosted within the Search service. Jobs are scheduled per entity instance (for example – repository in case of code, project in case of work-item and so on) to handle any complete/incremental changes that are detected for that entity instance.

To ensure index consistency, the event processing pipeline has robust locking semantics that ensures that only a single operation (indexing, metadata change processing etc.) is running for a given entity instance at a given point of time.

Metadata changes, addition of new projects and repositories are also controlled at a per account level, to ensure semantic consistency of the account’s information.

Each entity treats its accounts differently, so the locking semantics don’t span across entities for the same account. Indexing is typically done in a single job for an entity instance, but it can be dynamically expanded to multiple parallel jobs (if the change to be processed or the entity instance itself is very large).

Resource Utilization Management

Multi-tenancy (support for multiple accounts within the same service) is handled throughout the indexing pipeline, and inside Elasticsearch indices. The service has support for ensuring effective resource sharing across accounts (basically avoidance of a single account hogging the entire pipeline, starvation etc.) through job schedulers that look at the current job load, pending job queue and total available resources, before allocating new job resources to an account for indexing.

Every job run also executes in a time-bound manner to ensure it doesn’t continue to hog resources while starving another account for a very long time, yielding every so often to ensure that a job resource can be allocated to another account if needed.

Similar mechanisms are applied at entity level as well, to ensure jobs for a given entity type doesn’t hog resources needed by jobs of another entity type.

Shared Indices

Inside the Elasticsearch indices, data across multiple accounts/entity instance is shared and stored in a single index. This helps with reducing the total number of indices and shards (partitions) that need to be managed and caters to many small accounts that don’t have a lot of data.

At the same time, for large accounts or entity instances, the Search service scales to support dedicated indices, thus the effects of noisy neighbors are minimized. Large is determined as a heuristic. Shared indices have a cap on the max number of accounts/entity instances that are accepted, to ensure there is room for growth.

The indices at entity type level are different for each entity and the same is not shared across entity types. This gives room for each entity type to have its own indexing and querying characteristics; also, how it wants to group the accounts’ data for optimal query performance.

Monitoring and Deployment

Logs from Elasticsearch are pumped into the Azure core monitoring system and Microsoft homegrown log analysis systems. Deployments are completely integrated into Azure DevOps Release management system

Azure DevOps Search – Best Practices

Code Search

You can use code type filters to search for specific kinds of code such as definitions, references, functions, comments, strings, namespaces, and more. You can use Code Search to narrow down your results to exact code type matches. This will be useful when all you want to do is just get quickly to the implementation of (say) an API your code might be taking dependency on!
You can narrow your search by using project, repository, path, file name, and other filter operators. This will help you achieve your desired results even faster. Start with a higher-level search if you don’t know where the results would be and keep filtering till you have a subset of results to browse through and work on.
You can use wildcards to widen your search and Boolean operators to fine-tune it. This will ensure you get to the results you desire even when you are not sure of the exact term you are looking for.
When you find an item of interest, simply place the cursor on it and use the shortcut menu to quickly search for that text across all your projects and files. This will help you find more information about an item of interest faster and with minimal efforts.
Similarly, you can also easily trace how your code works by using the shortcut menu to search for related items such as definitions and references – directly from inside a file or from the search results.

Work Item Search

You can use a text search across all fields to efficiently locate relevant work items. This will be useful when you are trying to (say) search for all work-items that had similar exception trace!
You can also use the quick in-line search filters on any work item field to narrow down to a list of work items in seconds. The dropdown list of suggestions helps complete your search faster.

Wiki Search

When you search from Wiki, you’ll automatically navigate to wiki search results. Text search across the wiki is supported by the search platform.

Know More

If you would like to see how Search looks like in action, you can watch the video here! In this video, Biju Venugopal (Principal PM Manager @ Microsoft) walks us through the demo of Search and talks through important aspects about the service.

This article was co-authored by Imran and Mahathi and technically reviewed by Subodh Sohoni.

This article has been editorially reviewed by Suprotim Agarwal.

C# and .NET have been around for a very long time, but their constant growth means there’s always more to learn.

We at DotNetCurry are very excited to announce The Absolutely Awesome Book on C# and .NET. This is a 500 pages concise technical eBook available in PDF, ePub (iPad), and Mobi (Kindle).

Organized around concepts, this Book aims to provide a concise, yet solid foundation in C# and .NET, covering C# 6.0, C# 7.0 and .NET Core, with chapters on the latest .NET Core 3.0, .NET Standard and C# 8.0 (final release) too. Use these concepts to deepen your existing knowledge of C# and .NET, to have a solid grasp of the latest in C# and .NET OR to crack your next .NET Interview.

Click here to Explore the Table of Contents or Download Sample Chapters!

What Others Are Reading!

Mastering Azure Kubernetes Service

Understanding Kubernetes: A Developer's Guide to Containerized Applications

Continuous Integration and Continuous Delivery–Some Concepts

Customization of Work Items in Azure DevOps and Azure DevOps Server 2020

Azure DevOps Delivery Plan 2.0

Source Control in Azure DevOps (Best practices)

Was this article worth reading? Share it with fellow developers too. Thanks!

Author

Imran Siddique is a software engineer specializing in software & distributed systems development. He has over 11 years of experience designing and architecting different Microsoft cloud services. Imran is passionate about distributed systems, designing at scale and engineering improvements. Connect with him on LinkedIn.

Page copy protected against web site content infringement by Copyscape

Feedback - Leave us some adulation, criticism and everything in between!

Click here to post your Comments

Featured Tools

Azure DevOps Search – Deep Dive