We are in a new Gold Rush. Rapidly improving technology in both the collection and storage of data has created vast troves of unexplored information.
If we focus our gaze towards the municipal government, we can see the same proliferation of data, only slower. Local governments have been collecting data for centuries, but until recently it was not always accessible, or even considered “data.” Since the 19th century, Boston has been issuing and recording building permits. Through a massive digitization effort these permits are now accessible in an online database.3 Not only are governments turning to modern methods of data storage, but they are also creating applications to encourage citizens to engage with their local governments. Mobile and web applications will hopefully facilitate greater interaction between citizen and government.4 Each and every one of these citizen-to-government interactions is recorded and stored in a database—though not all are open and accessible to the citizen scientist.
Boston has built a few mobile applications for its residents. Notable among these apps are the BOS:3115, ParkBoston6, the less popular Boston PayTix7, and the new Blue Bikes8. Through BOS:311 residents can communicate directly to the Department of Public Works by recording an issue, its location, and even an image of the issue. Blue Bikes trips, 311 requests, and much more are provided to the public via Analyze Boston, Boston’s data portal.9
This new availability of data has unintentionally altered the way in which scientists interact with data. For the purposes of scientific inquiry, scientists and analysts have historically been close to the data generation process.
While we as residents and citizens might interact with governmental agencies, we don’t do so as scientists. Likewise, governmental agencies engage with citizens as governments, not scientists. As such, much—if not all—of the open and public data that we use within urban informatics—and greater digital humanities—fields was not generated with the express purpose of being analyzed. This inherently changes the way in which one approaches analysis.
In approaching data of this nature, researchers have begun embracing exploratory data analysis (EDA). EDA is extremely useful in developing insights from data in which there were no a priori hypotheses. In their influential book, R for Data Science10 Garret Grolemund and Hadley Wickham describe this inductive approach of exploratory data analysis.
“Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.”11
When researchers set out to test a hypothesis they often become closely involved with the data generation process. In this scenario, researchers are more likely to have preconceived hypotheses and expectations of what they may find hidden in their data.
This condition is often not the case when working with open data. We do not always know at the outset what we are looking for. With open data—and any data really—you never know what you may find if you begin to dig.