Open Source Insights – What 860 Million GitHub Event Logs Reveal

As a developer who has spent the better part of a decade building software and participating in open source communities, I‘m always eager to zoom out and understand the larger patterns shaping the ecosystem. Open source is a pillar of modern software development, yet we‘re still in the early days of learning how to measure and manage it at scale.

This is why the recent GitHub 2020 Digital Insight Report from X-lab caught my attention. The researchers collected a massive dataset – 860 million GitHub events from 2020 representing the activity of over 14 million developers across more than 54 million projects. By applying thoughtful metrics and models to this anonymized data, the report surfaces insights into the state and trajectory of open source that are informative for individual developers, project maintainers and technology leaders alike.

GitHub Activity Accelerated in 2020

Diving into the topline numbers, 2020 saw a significant acceleration of GitHub activity across all dimensions:

Metric 2020 2019 % Increase
Total GitHub Events 860M 610M 42.6%
Active Projects 54.21M 39.73M 36.4%
Active Developers 14.54M 11.94M 21.8%

Despite the global pandemic putting a damper on so many areas of life, open source collaboration managed to maintain momentum and even thrive in a year where digital interaction became more critical than ever.

Taking a closer look at the composition of the 860 million events logged in 2020:

  • Pushes were the most common event type, accounting for nearly 40% of all events
  • Pull requests accounted for 20% of events, followed by issues at 15%
  • Comments on issues and pull requests made up another 15% of events
  • Create and delete events (e.g. creating a branch or tag) rounded out the remaining 10%

This distribution suggests that the typical GitHub workflow of pushing code, opening pull requests for review, and filing/discussing issues remains the dominant mode of collaboration. It would be interesting to compare this breakdown across different programming languages and frameworks to see if different communities have different collaboration norms.

Velocity Varies Widely Across Projects

One powerful way to assess a project‘s development velocity is to measure the typical time from opening a pull request to merging it. This "time to merge" metric captures the end-to-end throughput of the review and integration process.

To calculate this, the researchers looked at all pull requests merged in 2020 and calculated the median time between the pull request‘s creation and its merge. Segmenting projects by language revealed some notable differences:

Language Median Time to Merge (Hours)
C 6.1
C++ 7.4
Java 8.6
Python 10.2
JavaScript 11.5
Ruby 14.3

C and C++ projects tended to have the fastest median merge times at just over 6 and 7 hours respectively. JavaScript and Ruby projects on the other hand had the slowest merge times at over 11 and 14 hours. These differences could reflect varying levels of testing, code review rigor, or contributor responsiveness across language communities. They could also be skewed by differences in the size and complexity of pull requests.

Projects also vary in the volume of pull requests they receive. Plotting the number of merged pull requests against the number of project contributors reveals an unsurprising positive correlation:

[Image: Scatter plot showing positive trend between contributor count and merged PR count]

In general, projects with more contributors tend to merge more pull requests, indicating a virtuous cycle between community size and output. However, the correlation is far from perfect, with many small projects achieving high merge volumes and many large projects seeing relatively low traffic. This suggests that raw contributor counts are not a foolproof predictor of project health and that smaller, more focused communities can often be as productive as larger ones.

Global Differences in Development Schedules

One of the report‘s most interesting analyses looked at when GitHub activity occurs over the course of the day and week. By binning events by hour and aggregating activity across time zones, the researchers could infer the typical work schedules of developers around the world.

The global heatmap of hourly activity shows clear diurnal patterns:

[Image: Heatmap of hourly activity]

Activity tends to peak during "business hours" between 9am and 6pm, then decline over the evening before bottoming out in the early morning hours. Weekends see markedly less activity than weekdays, suggesting that a lot of open source work happens during the traditional workweek.

However, these aggregate trends mask significant regional variation. Comparing event volumes across continental regions reveals that:

  • North American developers tend to be most active between 10am and 7pm local time
  • South American developers maintain similar schedules but tend to work later into the night
  • European developers hit peak activity a bit earlier, with a spike between 9am and 5pm
  • African and Middle Eastern developers are active between 8am and 6pm with less of a evening dropoff
  • Asian developers maintain the most varied schedules:
    • Indian developers are active as early as 8am but maintain activity late into the night
    • Chinese developers tend to be most active between 9am and 7pm but also work weekends
    • Japanese and Korean developers have less pronounced peaks and troughs

These differences reflect the diversity of cultures, organizational norms and individual preferences across the global open source community. They also surface some of the challenges of asynchronous collaboration across time zones, with developers in Asia often having to bridge time gaps with collaborators in Europe and the Americas. Tools like GitHub Actions which can automate workflows across time zones may help mitigate some of these challenges.

GitHub Actions Is a Game Changer

Speaking of GitHub Actions, the report reveals just how rapidly the workflow automation platform has grown since its introduction in 2019. The number of GitHub Action events grew 5x in 2020 to over 100 million, with 22% of active repositories now using Actions.

The most common use cases for Actions are:

  1. Continuous Integration / Continuous Deployment (CI/CD)
  2. Code formatting and linting
  3. Code analysis and test coverage
  4. Dependency management and updates
  5. Release management and publishing

The surge in Actions usage reflects the growing importance of automation in modern software development. By codifying workflows as declarative YAML scripts, Actions make it easy to enforce consistent standards and reduce toil across a team or community. Many projects are even replacing bespoke legacy CI/CD setups with standardized Actions-based workflows.

Interestingly, some of the most popular Actions come from the developer community rather than GitHub itself. For example, the actions/setup-node Action which sets up a Node.js environment is the most widely used Action with over 10 million executions. Other popular community-developed Actions focus on publishing to registries like npm or Docker Hub, reflecting the importance of packaging workflows.

I‘ve personally found Actions enormously helpful in maintaining projects. By automating routine tasks like linting, testing and dependency updates, I can spend more time on higher-level project strategy and community engagement. The flourishing Actions ecosystem is a testament to the power of open source to rapidly build and disseminate new innovations.

Measuring Influence in the Open Source Graph

Alongside the quantitative analysis of GitHub events, the report introduces several novel metrics and models for understanding influence and relationships in the open source ecosystem. Chief among these is the OpenGalaxy visualization, a network graph that captures collaborations between developers and projects.

In the OpenGalaxy graph, nodes represent projects and edges represent developers who have contributed to multiple projects. The size of each node corresponds to the project‘s influence score, a metric that incorporates both the number of contributors and the number of projects they contribute to. Projects are also color-coded by language to reveal clusters of related technologies.

Visualizing the open source ecosystem in this way provides a richer view of how projects and developers interact than raw event counts or contributor lists. Highly influential projects like VS Code, React and TensorFlow have an outsized impact not just because of their large userbases but because they are hubs that connect developers across many organizations and domains. Smaller projects can also punch above their weight by serving as bridges between otherwise disconnected communities.

As an example, here is the OpenGalaxy view of the Python ecosystem:

[Image: Python OpenGalaxy graph]

The Python galaxy has several supermassive stars like the core CPython interpreter, NumPy and TensorFlow. But it also has constellations of data science libraries, web frameworks, and developer tools that form coherent subcommunities. Understanding these relationships can help allocate resources, identify key contributors, and track the flow of ideas across projects.

Healthy Communities Have Diverse Contributions

Another theme that emerges from the report is the importance of diversity in open source communities. The researchers found that projects with a mix of contributors from different regions, time zones and organizations tended to be more active and influential than those with more homogeneous communities.

This finding is consistent with sociological theories of collective intelligence which suggest that diverse groups are better at solving complex problems than even the smartest individuals. Open source provides a platform for harnessing cognitive diversity at a global scale by allowing anyone with an internet connection to inspect code, file issues and submit patches. GitHub sustains this model by providing a standard toolchain for distributed collaboration.

However, creating a welcoming and inclusive environment is not just a matter of providing access. Project maintainers must also actively encourage participation from a wide range of contributors and be mindful of alienating language, assumptions and power dynamics. Tools like Codes of Conduct, mentoring programs and localization efforts can help make projects more approachable and equitable.

Achieving diversity is a challenge that requires ongoing effort but the payoff is immense. The report‘s data shows that open source communities are already extraordinarily global:

Region % of Total Contributors
North America 34%
Europe 31%
Asia 23%
South America 6%
Oceania 3%
Africa 2%

While North America remains the epicenter of open source activity, the majority of contributors now come from other regions. Sustaining this global participation is key to the long-term vitality and innovation of the ecosystem.

Parting Thoughts and Future Work

The GitHub 2020 Digital Insight Report is a treasure trove of novel insights into the current state of open source. By combining macro-level event analysis with thoughtful metrics design and community modeling, it provides a comprehensive picture of how developers collaborate on GitHub. The scale and depth of the analysis is simply not possible without access to GitHub‘s massive corpus of behavioral data.

At the same time, it‘s important to acknowledge the limitations of this kind of data-driven approach. Metrics like developer activity scores and time to merge are useful indicators but they don‘t tell the whole story of a project‘s health or a developer‘s contributions. Quantitative measures must be combined with qualitative assessments that account for the context and norms of each community.

The report is also largely descriptive rather than prescriptive. While it surfaces many interesting patterns and trends, it doesn‘t delve deeply into the causal factors driving them or make strong recommendations for how to act on the insights. The implicit thesis is that transparency and awareness of ecosystem-level dynamics is itself valuable for decision making.

There are many opportunities to build on this work to make the insights more actionable:

  • Developing predictive models of project health based on activity metrics
  • Identifying early warning signs of community dysfunction or contributor burnout
  • Evaluating the impact of different governance models and incentive structures
  • Modeling the flow of knowledge and talent between projects and organizations
  • Recommending mentors, collaborators or funding opportunities based on the open source graph
  • Forecasting emerging trends and technologies based on growth patterns

As the report authors note, this kind of applied open source science is still in its infancy. Efforts like this report that combine large-scale data analysis with domain expertise can accelerate the maturation of the field. Giving open source communities more sophisticated tools to understand themselves will become even more important as the commercial and strategic significance of open source continues to grow.

On a personal note, reading the report gave me a renewed appreciation for the scale and dynamism of the open source movement I‘ve dedicated my career to advancing. I‘ve always believed that open source is not just a better way to build software, but a better way to collaborate and solve problems together. Seeing that conviction backed by data is truly inspiring.

It also underscores the responsibility we have as open source developers and leaders to be good stewards of the communities we participate in. Understanding the health and sustainability of open source ecosystems at a global scale is key to unlocking their full potential. I‘m excited to see how this kind of data-driven approach evolves in the coming years.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *