Sailing In Uncharted Waters: Why Global, Grassroots Communities are Able to Open Up New Datasets

7 min readDec 19, 2017

This post is loosely based on a presentation in this session at the Fall 2017 American Geophysical Union Meeting. Here are the slides for that talk.

Open-source, open data communities are able to surface unique datasets and convene powerful communities, but the their paths often lie in uncharted waters.

This post is written by OpenAQ co-founder Christa Hasenkopf

Global, grassroots open data communities are uniquely beneficial to the public. They’re typically tiny and don’t have near the ‘fire power’ in lots of arenas compared to larger, more traditional organizations. But they are able to transparently surface datasets that traditional, better-funded, larger organizations simply aren’t…yet. And once those datasets are made open and freely available to the public, they offer a unique platform around which to convene multiple, diverse sectors across geographies to tackle an issue.

Available technical tools and falling costs of cloud computing have made it more feasible in only the last few years for these communities to emerge. This has given rise to new exciting possibilities of data-wrangling and community-building by non-traditional entities, but these new groups face challenges in navigating sustainable courses in these uncharted waters. This post discusses reasons why I believe global, grassroots communities have been able to accomplish opening up data where other traditional avenues haven’t and talks a little bit about sustainability. Disclosure: Much — if not all — of the post will pull from my experiences with OpenAQ, simply because that’s the main vantage point from which I’ve existed in this space. Your own path will vary. Please share it below!

So why aren’t traditional organizations opening up some powerful datasets?

It just doesn’t make sense for them to do so in a lot of cases. I think it breaks down into four main reasons:

(1) Opening some datasets are outside of organizational mandates.

It is not the mandate for many organizations to gather and make open certain types of datasets — especially for the purposes of making them freely open and fostering a community building on top and around them, rather than to use them for a particular use-case. For instance, I can’t imagine that it is feasible for a company gathering real-time air quality data generated by governments — the same dataset aggregated by OpenAQ — to spend a lot of resources that don’t support their own long-term stability, or in some cases, even undermine it, by wrangling the messy data and making it open in a nice neat form. This is probably especially true if the company is a start-up, where money, staff, and time are even more precious.

(2) Opening the datasets are politically-difficult or the information sits in legally-murky waters.

Some datasets are politically sensitive for certain groups to gather or to publicly acknowledge they use, even if they do privately. For instance, it would be odd for one government to publicly aggregate and share other governments’ real-time air quality data without their explicit permission. Non-governmental large international organizations would have to undertake some pretty heavy lifting to get scores of countries to agree to share their data in the same interface. Meanwhile, entities that do aggregate the data in automated manners similar to our own may worry about the undefined legal space around aggregating and transparently attributing some government data, depending on how those data are shared and what language is specified, if any, in describing its openness (for a US-centric example, here is a recent court case in California discussing data-scraping).

(3) Creating the datasets require a technical capability that is sometimes not available or perhaps not emphasized or understood by organizational decision-makers.

Even in cases where there is organizational will (i.e. funding and mandate) to unleash freely open data gathered through transparent means, lack of technical capabilities or insight can get in the way. For instance, it is one thing to “value open data,” as an organizational policy, and it’s another to interpret that to mean providing a robust programmatic means of accessing information that will respond to a given community’s evolving needs.

For instance, we have had conversations with several well-meaning organizations that have built what they describe as projects that make data more open, but in reality what they do visualizes the data on a website but does not make it easily accessible downstream of them. Both visualizing data and making it accessible are important, but they are different, distinct acts.

We’ve also been told that individuals at a large international organization think OpenAQ is a “cool app.” This is a well-intentioned compliment, but calling a platform that provides programmatic data access “an app” is about as fundamental a misunderstanding as going to a wholesale grocer and expecting to be handed a dinner menu. These aren’t subtle, “nit-picky” distinctions. These differences matter. Because even if technical capacity at an organization is strong, if decision-makers who allocate funds and define projects don’t understand the basics of a strong data-sharing pipeline, the potential open data power those entities could harvest, given their rich resources, will not be realized.

Apps, open data platforms harmonizing disparate data sources, and the originating data sources themselves are all key components of an effective open data pipeline that maximizes the utility of publicly-shared data for a variety of audiences. Understanding the distinct role of each piece of the pipeline is necessary to help build the most robust open data pipeline possible.

(4) It can be an easier and more exciting to invest in data-generating devices or the public-facing end products of those devices, but not the part that harmonizes data for more uses downstream.

The far left portion of the diagram above represents physical stuff you can actually deploy to generate data: sensors, monitors, things you actually touch. The far right side represents a nice end-product that the public can appreciate: an app, a beautiful data visualization, a solid policy-relevant scientific analysis. Either side has tangible, easy-to-explain outcomes. But the middle part is tricky. It often takes logistical, technical and/or diplomatic wrangling to harmonize multiple data sources from different devices/entities. Meanwhile, you enable others to create cool outputs by connecting them with originating data sources, but you can’t take credit for generating the data in the first place nor the resultant public-facing outputs yourself. Nor do you likely have purview over the output systems to robustly understand their full reach and impact. And if a downstream user doesn’t cite you in some way or otherwise contact you, you may not even know about a cool use-case. You also might be tempted to bypass the middle part altogether and build a smaller scale, often device-specific app or other use-case directly from individual data sources.

For all of these reasons, making a data-aggregating platform with the core purpose of harmonizing disparate datasets so others can be awesome is a hard sell for many organizations.

Then how is it possible for grassroots communities to make aggregate and harmonize open datasets?

In short, grassroots communities are able to get around or avoid most of the issues that face more traditional organizations, especially initially. They also tend to organically form around a particular problem — and that’s what gives them their necessary convening force to emerge. In the case of OpenAQ, we are possible because there is a large enough need to freely and transparently access real-time air quality data from governments such that individuals with relevant skill sets are willing to help out and make the platform possible — that is also one of the core reasons why we are open-source. Individuals would rather invest a little bit of time to contribute to an existing project from which they can benefit rather than have to attempt to build the whole thing from scratch and then maintain it themselves.

But it is also extraordinarily challenging for global, grassroots communities opening up data to sustain themselves — especially when they’re giving away the most obvious thing they could charge for: data! In the long-term, I believe traditional organization’s mandates will grow to include fostering global, grassroots communities opening up data in currently-considered non-traditional ways. In the meantime, it’s the job of organizations like ours to show we’re worth expanding their mandates and helping them find mechanisms to do that.

For now, we’re beginning to find funding avenues beyond “seed” grants, like contract work to help shore up our backend infrastructure, add new data sources, and search out meta data for larger organizations using our platform to power their efforts. We have also received grants for workshops, and are launching an organizational sponsorship program, where companies and non-profit organizations that want to support our work and the community doing powerful work on top of it can. These sorts of funding structures — in addition to our core mission — have leant themselves best to a non-profit business model for our work.

There truly are a million different sustainability paths similar communities in this space could pursue in these uncharted waters, and the only “right” one is the path that lets you pursue your mission, enables impact, and is fun. If you are a grassroots community opening up data, what’s your path look like?