Introduction

Today github announced new BigQuery tables that let you query data about github repos using the BigQuery. I found about it from hacker news. Also see announcement from Google. The killer new feature is the ability to query the contents of all files stored on github.

I will use it to list top packages required in elisp files on github. If you are impatient you can jump directly to Results. To run a query go to BigQuery console and click “Compose Query”.

In contrast to old BigQuery with top starred repos I will list all packages required in any elisp file, rather than just repositories written in emacs lisp. This will include built in emacs packages or packages not stored on github.

Revision history of this blog post is stored on github.

Bias towards repos with license

As this hacker news post mentions, only repos with license are indexed. That will bias the result towards packages rather than personal emacs configs people put on github.

Query on sample_contents

sample_contents only contains 10% sample files from top 130k repos according to stars. It also contain pre-joined data between tables contents and files. It’s good for data exploration before running the bigger query on the full dataset.

SELECT
  line,
  COUNT(DISTINCT(sample_repo_name)) as count
FROM (
  SELECT
    SPLIT(content, '\n') line,
    sample_repo_name
  FROM
    [bigquery-public-data:github_repos.sample_contents]
  WHERE
    (sample_path LIKE '%.el'
      OR sample_path LIKE '%.emacs')
  HAVING
    line LIKE '(require%)')
GROUP BY
  1
ORDER BY
  count DESC
LIMIT
  10;

Top 10 results:

line count
(require ‘cl-lib) 32
(require ‘ert) 22
(require ‘cl) 20
(require ‘comint) 19
(require ‘org) 14
(require ‘font-lock) 13
(require ‘compile) 11
(require ‘eieio) 10
(require ‘package) 9
(require ‘dash) 9

Since the top entry got only 32 hits it made me worry that sample size is too small.

Only 266 emacs repos are included in sample_contents:

SELECT
  COUNT(DISTINCT(sample_repo_name))
FROM [bigquery-public-data:github_repos.sample_contents]
WHERE
(sample_path LIKE '%.el' OR
    sample_path LIKE '%.emacs')
266

That made me think that I need to query the unsampled data.

Query on full, unsampled, data

contents contains all files, but require join with files to get file path. My query:

SELECT
  TOP(line, 5000),
  COUNT(*) AS c
FROM (
  SELECT
    SPLIT(contents.content, '\n') line,
    contents.id AS id
  FROM
    [bigquery-public-data:github_repos.contents] AS contents
  JOIN (
    SELECT
      path,
      id
    FROM
      [bigquery-public-data:github_repos.files]
    WHERE
      path LIKE '%.el'
      OR path LIKE '%.emacs') AS files
  ON
    (contents.id == files.id)
  WHERE
    contents.content CONTAINS '(require \''
  HAVING
    LEFT(line, 10) == "(require '"
    AND RIGHT(line, 1) == ")");

Thanks to all optimizations getting top 5000 packages takes only 60 seconds.

I have been getting a few errors, including Error: Resources exceeded during query execution. for a while. It required a couple of BigQuery tricks to make it work

  • TOP/COUNT is faster than GROUP BY/ORDER/LIMIT.
  • Filtering data prior to join in sub-query reduces memory usage.
  • Regexps and globs are expensive. LEFT/RIGHT is much faster.
  • Avoid scanning files without .el or .emacs extension (thanks to filtering done by join).

Results

Full list of top 5000 is at https://github.com/kozikow/kozikow-blog/blob/master/github_emacs_packages_5000.csv .

Packages with 1000+ usages:

package count
(require ‘cl-lib) 18901
(require ‘org) 12144
(require ‘cl) 11695
(require ‘ob) 11073
(require ‘ert) 7978
(require ‘semantic) 6684
(require ‘easymenu) 5821
(require ‘comint) 5255
(require ‘eieio) 4691
(require ‘gnus) 4446
(require ‘font-lock) 4272
(require ‘quail) 3991
(require ‘helm) 3839
(require ‘dash) 3650
(require ‘calc-macs) 3602
(require ‘compile) 3524
(require ‘gnus-util) 3418
(require ‘erc) 3371
(require ‘calc-ext) 3295
(require ‘thingatpt) 3294
(require ‘org-compat) 3282
(require ‘org-macs) 3268
(require ‘shimbun) 3019
(require ‘url-parse) 2921
(require ‘format-spec) 2696
(require ‘company) 2607
(require ‘package) 2512
(require ‘ring) 2399
(require ‘message) 2334
(require ‘cc-mode) 2329
(require ‘python) 2243
(require ‘nnheader) 2232
(require ‘mm-util) 2227
(require ‘custom) 2223
(require ‘calendar) 2190
(require ‘nnoo) 2187
(require ‘gnus-sum) 2044
(require ‘mh-e) 2022
(require ‘ox) 1978
(require ‘gyp) 1964
(require ‘magit) 1953
(require ‘ede) 1900
(require ‘json) 1874
(require ‘dired) 1849
(require ‘nnmail) 1799
(require ‘button) 1781
(require ‘tramp) 1693
(require ‘slime) 1629
(require ‘etags) 1619
(require ‘ansi-color) 1618
(require ‘wid-edit) 1611
(require ‘url) 1566
(require ‘mm-decode) 1562
(require ‘gnus-art) 1543
(require ‘helm-help) 1538
(require ‘semantic/format) 1504
(require ‘outline) 1495
(require ‘imenu) 1493
(require ‘ob-eval) 1457
(require ‘ob-core) 1419
(require ‘url-util) 1396
(require ‘ecb-util) 1374
(require ‘pcomplete) 1357
(require ‘url-vars) 1337
(require ‘speedbar) 1336
(require ‘widget) 1273
(require ‘esh-util) 1260
(require ‘s) 1239
(require ‘helm-utils) 1234
(require ‘auto-complete) 1232
(require ‘xml) 1231
(require ‘semantic/db) 1219
(require ‘haskell-mode) 1201
(require ‘semantic/analyze) 1195
(require ‘ewoc) 1152
(require ‘gnus-range) 1151
(require ‘eshell) 1065
(require ‘cider-client) 1040
(require ‘eieio-base) 1037
(require ‘find-func) 1032
(require ‘semantic/ctxt) 1008
(require ‘help-mode) 1006
(require ‘gnus-int) 1001

Things to improve

In sampled query I counted unique repos that reference given library. In full query I just counted number of times given line appeared in any emacs file. It is better to count unique repos, but calculating distinct repos on a full contents would require additional join.

7 responses

  1. […] github contents recently got onto big query. See my other post about it: https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-used-in-github-repos/ […]

  2. […] file contents query-able on Google BigQuery. My first post with the general introduction about it: https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-used-in-github-repos/ […]

  3. […] github contents recently got query-able by the Google BigQuery. See my other post about it: https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-used-in-github-repos/ […]

  4. […] Top emacs packages used in github repos (Reddit) […]

  5. Oscar Najera Avatar

    I would also test against the use-package keyword. Many of us use it more often than require

    1. kozikow Avatar

      Thanks! I’ll fix. I am writing new post with clustering packages, so I will do it correctly there. I expect results will be similar (except if use-package users would use different packages than require users).

Leave a comment