Using BigQuery & GitHub to Build a Giant Dataset of JavaScript Functions

One of the creative perks of working at BrightMinded is the opportunity to experience the future and work with hot technology. There is an exciting culture of learning and collaboration that encourages us to find bright solutions to some of the biggest challenges. 

In this project, I got to explore the gigantic GitHub dataset of open source code and extract a database of JavaScript functions with the aim of exploring patterns in functional code.

How Big is BigQuery?

Not only does BigQuery expose a familiar SQL API, but it is also possible to include regular expressions, run external JavaScript through user-defined functions and even call C functions using WebAssembly!

The GitHub Dataset

The GitHub dataset contains all the source code that is clearly marked with an open-source license, including commit messages and file metadata. There have been a lot of fun studies of the dataset ever since its release including: Expressions of Emotion in Commit Messages, Profanity by Programming Language and, of course, the infamous Spaces vs Tabs.

Key Stats:

  • Over 2 billion file paths
  • 165 million open source code file contents
  • 145 million commits
  • 48 thousand F-bombs

A sample query to find all the code containing the words “This should never happen”:

SELECT count(*) FROM [bigquery-public-data:github_repos.sample_contents]
WHERE NOT binary
AND content CONTAINS 'This should never happen'

Extracting JavaScript Functions

Step 1 — All The JavaScript

Queries are priced based on the number of rows read, regardless of the number of rows returned, so it is advisable to build a smaller table containing only the data needed.

Query to find all the JavaScript (.js) files:

SELECT (c.content) FROM [bigquery-public-data:github_repos.contents] contents
INNER JOIN [bigquery-public-data:github_repos.files] files
ON files.id = contents.id
WHERE files.path LIKE '%.js'
AND contents.binary = false

Step 2 — Function Extraction

The naive approach to resolving JavaScript functions would be to track the opening and closing of their curly braces, following the keyword “function”. However, this would quickly fall down due to the nature of JS and it’s ‘nested’ and ‘pass by reference’ functions, JSON notation, and regular expressions.

The solution is to parse the code and transverse the resulting syntax tree. Luckily there are some great resources out there that do just this, such as Esprima.

An example NPM module for extracting functions from a string.

functionExtractor= require("function-extractor");
const f = `var a=0; function test(a, b){b=function(){return 99}; return a+b} function test2(a){a=2; return a+ 4}`;
var functions = functionExtractor.parse(f);

Step 3 — The Final Query

With some modifications to the syntax tree generator, I was able to write a “function extractor” module which parsed the functions out of a string of code, including the function names, parameters, code block, and surrounding comments. This was then included in the SQL query which used the BigQuery JSON functions to build a table of results.

CREATE TEMP FUNCTION extractFunctionsWithComments(str STRING)
RETURNS ARRAY<STRING>
LANGUAGE JS AS """
    return extractFunctions(str);
"""
OPTIONS (library="gs://github-data/functionExtractor.js")

SELECT JSON_EXTRACT_SCALAR(func, "$.name") AS name,
JSON_EXTRACT_SCALAR(func, "$.params") AS params,
JSON_EXTRACT_SCALAR(func, "$.comments") AS comments,
JSON_EXTRACT_SCALAR(func, "$.blockStart") AS blockStart,
JSON_EXTRACT_SCALAR(func, "$.block") AS block,
contents.path,
FROM 'github_js.github_contents' as contents
CROSS JOIN UNNEST(extractFunctionsWithComments(contents.content)) AS FUNC

Results

The query took just under 4 hours to run and returned about 8 billion functions of which 65 million were unique with a total of 7 million function names!

Big Query Results Table

Longest JavaScript Function Names

With the results, it was easy to find the longest function names used in JavaScript. Here’s a selection of the longest function names (discarding anything which contained underscores or special characters and appeared to be machine generated).

The Future

We have only just scratched the surface of this huge dataset and I look forward to running further experiments which could use a similar process including:

  • The use of BigQuery ML to explore relationships between functions and their comments.
  • An exploration of “All the JSON” and “All the Arrays”
  • Code statistics after minification using UglifyJS