What are the most popular OSS data projects of 2021?

Pete Soderling
2 min readApr 4, 2021

We successfully ran our Data Council 2021 OSS Data Tools Community Survey in February, and were thrilled with the response from the community. We received more than 500 individual responses with 1,133 tool entries from more than a dozen countries.

By our count, there were 214 different tools mentioned. For me, this was one of the largest confirmations that we’re still in a time of massive innovation at all layers of the data stack. (Might be time for another one of these posts!)

A couple notes on how we cleaned & interpret the data:

  1. We attempted to normalize the project names for accuracy wherever possible. However, in limited cases respondents mentioned a commercial company name vs. a specific OSS project. If there is only one known OSS project associated with that company, we normalized that vote into a vote for the OSS project (i.e. ‘Coiled’ became ‘Dask’). However, if there were multiple OSS projects related to the company (i.e. a vote for “Databricks”, which is ambiguous as it could refer to Spark or Delta Lake, for example) we left the data as-is since these cases were limited.
  2. If you look through the summarized results (or the raw data here), you’ll note that in the long tail of projects that received a single vote there are some that aren’t necessarily pure data projects (hello Java & Github!). To us, it didn’t make sense to prune these out, as they didn’t skew the main results in any way. We kept them in to reflect the organic output of the community.
  3. Regardless of point #2, the majority of long tail votes were for projects that are specifically focused on working with data — and some are even large and entrenched projects (like Neo4j, Redis & Keras), though they didn’t receive significant attention in the survey from our community.

By order of popularity, the Top 20 OSS data tools are listed here:

Thank you to all who participated, shared, tweeted — and to those OSS authors and community leaders who cajoled your networks to do the same. If you’re interested in the full data you can check out the summary data or the raw data.

Footnote: while our survey focused purely on open-source data tools, our friends at Kaggle also produce an awesome State of Data Science & ML survey each year that covers a broader range of topics & themes. (If you’re curious, we suggest you check out their 2020 edition.)

--

--

Pete Soderling

Engineer, Biz hacker, Geek, Investor. Founder of Data Council & Data Community Fund. I help engineers start companies.