How I used Python to help me choose an organization for Google Summer of Code ‘19

Laptop with code

As an aspiring open source contributor and software developer, the Google Summer of Code (GSoC) program presents an exciting opportunity to gain experience, make an impact, and jumpstart your career. With over 200 participating organizations each year, the hardest part can be figuring out where to direct your efforts for the best chance of success.

When I decided to pursue GSoC in 2019, I knew I wanted to be strategic in my organization selection to maximize my odds of getting accepted. But the thought of manually researching hundreds of organizations, reading through their past projects, and tracking stats was daunting. There had to be a better way!

As a full-stack developer with a love for Python, a lightbulb went off: What if I used my programming skills to automate the research process and uncover insights about each organization? I could work smarter, not harder, and use data to guide my decision. Code to the rescue!

Understanding the GSoC Landscape

Before diving into the technical details, let‘s take a step back and outline how the GSoC program works. GSoC is an annual program sponsored by Google that matches student developers with open source organizations. Accepted students spend the summer writing code and contributing to the organization‘s projects, while receiving a stipend and gaining invaluable experience.

Organizations range from well-known names like Apache and Mozilla to smaller projects across all areas of technology. Each org creates its own list of project ideas and has a certain number of seats to allocate to students. Students submit proposals to their desired orgs, and then the orgs evaluate the proposals and choose which students to accept.

Competition can be fierce, with some orgs receiving hundreds of proposals for only a handful of seats. Therefore, it‘s crucial to apply to orgs that are not only a good technical fit for your skills and interests, but also where you have a realistic shot of standing out and getting selected.

Some factors that can influence an organization‘s competitiveness and selection process:

  • Popularity and brand recognition of the org
  • Number of seats available
  • Number of proposals submitted
  • Breadth and difficulty of project ideas
  • Past acceptance rates
  • Emphasis on student qualification vs proposal quality

Having data on these factors across multiple years could provide valuable insight into which organizations to target. The stage was set for my Python-powered research project!

Leveraging Python for Data Extraction and Analysis

Faced with the challenge of gathering GSoC data for analysis, I knew Python would be the perfect tool for the job. The extensive ecosystem of libraries makes tasks like web scraping and data manipulation a breeze. Here are the key tools I used:

  • Requests: A simple and elegant library for making HTTP requests in Python. This allowed me to programmatically download the HTML content of GSoC web pages.

  • Beautiful Soup: A library for parsing HTML and XML documents. Once I had the raw HTML of a page, Beautiful Soup made it easy to extract the relevant data by searching for specific tags and attributes.

  • Pandas: A powerful data manipulation library that provides DataFrame objects for working with structured data. Pandas allowed me to take the extracted data and organize it into a tabular format for analysis.

With my toolkit assembled, I was ready to embark on my data expedition. The journey consisted of three main steps:

  1. Scrape the list of organizations for each year
  2. For each organization, scrape the list of project ideas
  3. Combine the data into a structured format for analysis

Step 1: Scraping the organization lists

To get a list of the participating organizations for each year, I needed to scrape the GSoC archive pages. The URL structure conveniently included the year, making it easy to programmatically access each page.

Here‘s a simplified version of the code to extract the organization names and links for a given year:


import requests
from bs4 import BeautifulSoup

def get_orgs(year): url = f‘https://summerofcode.withgoogle.com/archive/{year}/organizations/‘ response = requests.get(url) soup = BeautifulSoup(response.text, ‘lxml‘)

orgs = []
for li in soup.select(‘li.organization-card__container‘):
    org = {
        ‘name‘: li.select_one(‘h4.organization-card__name‘).text,
        ‘link‘: ‘https://summerofcode.withgoogle.com‘ + li.select_one(‘a‘)[‘href‘]
    }
    orgs.append(org)

return orgs

The process breaks down as follows:

  1. Build the archive page URL for the given year
  2. Send a GET request to fetch the HTML content
  3. Parse the HTML using Beautiful Soup
  4. Find all the <li> elements with class organization-card__container, which correspond to each org
  5. For each org <li>, extract the name from the <h4> element and the relative link from the <a> element
  6. Combine the relative link with the base URL to get the full URL for the org‘s page
  7. Append the org name and link to the orgs list
  8. Return the list of orgs for the given year

By wrapping this in a loop and merging the results, I was able to compile the full list of organizations across all years of the GSoC archive. The real code includes some extra functionality, like error handling and storing the results, but this conveys the core idea.

Step 2: Scraping the project idea lists

Armed with the list of organization names and links, the next step was to scrape the project ideas proposed by each org. Fortunately, the project lists were conveniently located on each org‘s page with a consistent structure.

Here‘s the simplified code to extract the project titles and descriptions for a given org:


def get_org_projects(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, ‘lxml‘)
projects = []
for li in soup.select(‘li.project-card__container‘):
    project = {
        ‘title‘: li.select_one(‘h4.project-card__name‘).text,
        ‘description‘: li.select_one(‘div.project-card__description‘).text
    }
    projects.append(project)

return projects

The steps mirror the process for scraping the org list:

  1. Send a GET request to the org‘s URL to fetch the HTML content
  2. Parse the HTML using Beautiful Soup
  3. Find all the <li> elements with class project-card__container, corresponding to each project idea
  4. For each project <li>, extract the title from the <h4> element and the description from the <div> element
  5. Append the project title and description to the projects list
  6. Return the list of projects for the given org

Then, I simply had to loop through the orgs, call get_org_projects() for each org link, and store the results alongside the org name and year.

Step 3: Combining the data for analysis

With the hard part of data collection complete, the final step was to combine the scraped data into a structured format suitable for analysis. This is where the Pandas library shines.

I created a DataFrame (essentially a table) with columns for year, organization name, project title, and project description. Each row represented a single project idea, with the corresponding org and year information included.


import pandas as pd

df = pd.DataFrame(columns=[‘Year‘, ‘Organization‘, ‘Project Title‘, ‘Project Description‘])

for year, orgs in data.items(): for org, projects in orgs.items(): for project in projects: df = df.append({ ‘Year‘: year, ‘Organization‘: org, ‘Project Title‘: project[‘title‘], ‘Project Description‘: project[‘description‘] }, ignore_index=True)

df.head()

The resulting DataFrame provided a tidy and query-able representation of the GSoC project landscape over the years. With the full power of Pandas now at my fingertips, it was time to dig into the data and uncover insights!

Analyzing the Results

The DataFrame made it trivial to slice and dice the data to answer key questions about each organization, such as:

  • How many years has the org participated in GSoC?
  • How many project ideas do they typically propose each year?
  • What types of projects and technologies do they focus on?

To determine an org‘s consistency, I grouped the DataFrame by organization and counted the number of distinct years. Orgs with a count equal to the number of years analyzed were deemed "consistent" participants.


consistency = df.groupby(‘Organization‘)[‘Year‘].nunique()

To gauge an org‘s level of activity, I grouped by organization and year and counted the number of projects. This yielded a breakdown of how many project ideas each org proposed each year.


activity = df.groupby([‘Organization‘, ‘Year‘]).size()

Finally, to summarize the types of projects each org focuses on, I performed keyword extraction on the project descriptions using the NLTK library. This revealed the most frequent themes and technologies across an org‘s projects.


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def extract_keywords(text): words = wordtokenize(text) words = [w.lower() for w in words if w.isalpha() and w not in stopwords.words(‘english‘)] freq = nltk.FreqDist(words) return [w for (w, ) in freq.most_common(10)]

df[‘Keywords‘] = df[‘Project Description‘].apply(extract_keywords) keywords = df.groupby(‘Organization‘)[‘Keywords‘].sum()

These insights painted a much clearer picture of each organization‘s involvement in GSoC. I could now easily identify the most consistent orgs, gauge their capacity for students based on the number of project ideas, and determine if my skills and interests aligned with their focus areas.

Conclusion

By leveraging my Python skills to scrape and analyze GSoC data, I transformed the overwhelming task of researching hundreds of organizations into a data-driven decision process. The insights I gained guided me toward the orgs that best fit my background and goals, maximizing my chances of success in the GSoC application process.

But more than that, this project demonstrates the power of combining programming skills with open data to work smarter and make better decisions. In a world increasingly defined by data, the ability to efficiently gather, process, and draw meaning from information is invaluable.

Whether you‘re a student trying to navigate the GSoC landscape, a researcher seeking to understand trends, or a professional looking to hone your skills, I encourage you to think about how you can leverage your programming abilities to simplify and enhance your pursuits. The possibilities are endless, and the benefits are real.

As for me, my GSoC organization selection process was a resounding success thanks to this project. I was accepted to my top choice and had an incredible summer of learning and growth.

But the real reward was the experience of using my programming powers for good. And that‘s a feeling I‘ll carry with me far beyond GSoC.

So go forth, harness the power of Python and data, and work smarter in all that you do!

Similar Posts