Inside Google‘s 2 Billion Line Codebase Monorepo

Google is renowned for its massive scale in terms of users, infrastructure, and products. But one of the most mind-boggling statistics about the tech giant is the sheer size of its codebase. Google‘s single source code repository, which includes code for all its projects, totals a staggering 2 billion lines of code.

To put that in perspective, if you printed out Google‘s codebase, the stack of paper would be over 1,400 feet tall – higher than the Empire State Building. The codebase is so large that it‘s difficult to wrap one‘s head around its scale. It encompasses code for all of Google‘s products and services, from Search and Maps to Gmail and YouTube.

What‘s even more remarkable is that virtually all of this code lives in a single repository. Google uses a monolithic source control architecture, known as a "monorepo", rather than splitting its codebase across hundreds or thousands of smaller repositories. This unorthodox approach offers some compelling advantages at immense scale.

The Monorepo Advantage

Maintaining a single repository for billions of lines of code is no easy feat, but Google believes the benefits outweigh the challenges. Key advantages of a monorepo include:

  1. Unified versioning – With all code versioned together, it‘s easier to manage API changes, coordinate refactors, and track down bugs that span multiple systems. There‘s a single source of truth.

  2. Simplified dependency management – Libraries and frameworks can be shared across the entire codebase with consistent versioning. This reduces code duplication and dependency conflicts.

  3. Universal code search – Developers can search and analyze the entire codebase at once, making it easier to find examples, uncover usage patterns, and catch issues.

  4. Large-scale refactoring – Having all the code in one place enables codebase-wide cleanup, updates, and API migrations. Refactors can be made with confidence.

  5. Extensive code reuse – With the entire codebase at their fingertips, engineers can more easily discover and leverage existing solutions. This speeds up development and encourages best practice sharing.

Google‘s monorepo approach is enabled by heavy investments in custom tooling and infrastructure. Let‘s take a look at the scale involved.

Monorepo by the Numbers

The scale of Google‘s monorepo is almost inconceivable, as illustrated by these stats:

Metric Value
Lines of code 2 billion+
Code commits per day 40,000+
Presubmit checks per day 150,000+
Automated refactorings per day 5000+
Unique source files 9 million+
Unique directories 1 million+
Source file size (uncompressed) 20+ terabytes

Keeping the monorepo running smoothly requires a staggering amount of compute power and storage. Changes undergo extensive automated testing at scale before being committed. Google‘s tooling leverages thousands of machines to perform parallel builds and testing.

Programming Language Breakdown

While Google‘s monorepo contains code in dozens of programming languages, here‘s an approximate breakdown of the most common ones:

Language Share of Codebase
C++ 40-50%
Java 30-35%
Python 8-10%
Go 5-7%
JavaScript 3-5%
Others 3-5%

The monorepo‘s language composition has evolved over time as different languages waxed and waned in popularity. For instance, languages like Go and Kotlin have seen strong growth in recent years.

Custom Build and Version Control

To wrangle its monorepo at multi-billion line scale, Google has developed highly optimized proprietary build and version control systems.

Google‘s build system distributes compilation and testing across hundreds of thousands of machines, executing builds at massive parallelism. It can perform incremental builds that only recompile and retest code that has changed. The build system exposes a configuration language that lets developers express complex build dependencies and rules.

For version control, Google created a custom system called Piper, which is optimized for housing ultra-large repositories. Piper scales by sharding the monorepo‘s metadata and content across hundreds of servers in Google‘s data centers. This distributed architecture allows it to handle the monorepo‘s 40,000+ commits per day, and makes checkout and browsing responsive even at multi-billion line scale.

Code Ownership and Review

With 25,000+ engineers all committing to the same codebase, Google relies on well-defined code ownership policies and procedures to maintain order and quality.

Every directory in the monorepo has a designated owner or set of owners who are responsible for its code. Owners have the final say on design decisions and are accountable for their directory‘s code health. Google uses a custom code ownership database to keep track of who owns what.

All code commits in the monorepo are subject to peer review before being merged. Google has built powerful code review tooling that allows reviewers to see the full context of changes, including dependencies and prior refactorings. Reviewers look for bugs, style nits, and opportunities to improve code health. Code reviews serve as an important quality gate and knowledge sharing mechanism.

Testing at Scale

Comprehensive automated testing is essential to maintaining confidence in changes made to Google‘s enormous monorepo. Code commits are required to include tests, which are run at scale before and after the code is merged.

Google‘s TAP (Test Automation Platform) runs hundreds of millions of test cases per day across its entire codebase. TAP can run unit tests, integration tests, fuzz tests, performance tests, and more. It provides detailed feedback to developers about failures and code coverage.

Google also makes extensive use of presubmit checks that are run before a commit is merged. These include tests, linters, and static analysis tools that check for common issues. Commits that fail presubmit checks are blocked from being merged.

Enabling Collaboration and Code Health

Google‘s monorepo approach is as much about enabling a collaborative engineering culture as it is about technical advantages. With all of Google‘s code in one place, any engineer can discover, read, and learn from the work of teams across the company.

The monorepo also allows Google to maintain a high bar for code health across the entire codebase. Google has a dedicated "Code Health" team that builds tools for improving the maintainability, understandability, and efficiency of its code. Some examples of code health tooling include:

  • Static analysis tools that detect bugs, performance issues, and style violations
  • Automated refactoring tools that can make sweeping codebase updates
  • Dead code detection and removal
  • Dependency analysis and visualization
  • Documentation generation and standardization

By investing heavily in code health tooling and best practices, Google can keep its monorepo maintainable as it grows by millions of lines of code each year.

Monorepo Challenges and Tradeoffs

While Google has found great success with its monorepo, this model comes with some notable challenges and downsides:

  • Scaling version control and build systems to billion-line scale requires major investments in custom tooling and infrastructure. Not every organization can afford that.
  • Codebase-wide changes, while powerful, require careful coordination and can be hard to stage. Mistakes can have far-reaching impact.
  • Maintaining code health in an enormous monorepo requires active effort and tooling. Without proper maintenance, the codebase can accumulate cruft over time.
  • Onboarding new hires to navigate and contribute to the monorepo can involve a steep learning curve. Documentation and mentoring is key.
  • The flexibility to choose per-project toolchains, dependencies, and release cycles is more limited in a monorepo model.

Companies considering adopting a monorepo model have to weigh these tradeoffs against the potential benefits. Many companies run a hybrid model, with some code in a monorepo and some in separate repositories. The right model depends on an organization‘s scale, culture, and development practices.

Monorepo Pioneers

Google is not alone in adopting the monorepo approach. Several other prominent tech companies also use monorepos for major parts of their codebase:

  • Facebook – The main Facebook app is a gigantic monorepo of over 6 million files.
  • Twitter – Twitter‘s core services, including its main app and API service, live in a monorepo.
  • Microsoft – Major parts of Office365 and Azure DevOps Services share monorepos.
  • Etsy – Etsy‘s main website and services are a single monorepo with over 7.5 million lines of code.

These companies have also built extensive tooling and infrastructure to make their monorepos manageable at scale. While the monorepo model is still relatively uncommon, it has gained mindshare in recent years as a viable approach to large-scale codebase management.

The Future of the Monorepo

As Google‘s codebase continues to grow by hundreds of millions of lines per year, the company continues to evolve its monorepo tooling and practices. Some key areas of investment include:

  • Scaling code search and analysis to handle the ever-expanding codebase
  • Improving the speed and resource efficiency of the build system
  • Enhancing static analysis to catch more bugs and anti-patterns
  • Automating more codebase refactorings to improve code health
  • Streamlining the developer experience with better tools and documentation

Looking ahead, it‘s clear that monorepos will continue to be a key part of Google‘s engineering culture and practices. The monorepo model has served Google well in enabling its rapid growth and innovation. As the tech giant‘s codebase marches towards 3 billion lines and beyond, its monorepo will be the foundation that empowers Google‘s engineers to collaborate and build at massive scale.

Similar Posts