Return to search

When Machines Read the News: Data and Journalism in the United States, 1920-2020

The following thesis examines and historicizes the assortment of tools and practices—material, epistemic, and institutional—that developed over the last century in U.S.-based newsrooms as a result of news organizations’ first sporadic, then increasingly conscious, attempts at incorporating data-driven methods of information gathering, classification, archiving, and distribution into their organizational operations.

In its methods this thesis presents, first, a historical narrative that reaches from the early decades of the twentieth century into the early 2020s, and second, showcases empirical evidence through five case studies. Of the case studies one is historical and is explored in the third chapter through previously not consulted archival material. The other four are recent or current—two involved computational data collection and web scraping (seen in chapters four and five), one relied on ethnographic embedding, and one on interviews (mixed in with the previous two and also featured in chapters four and five).

In its conclusion, the thesis will argue that, at the very least, current and future organizational histories of journalism ought to more readily take into account the approaches and findings of the histories of technology and the sociologies of scientific knowledge, especially because understanding the contemporary epistemic and technological intrusions of computer science, statistics, data science, and software development into journalism requires the exploration of both the parallels and fault lines between these domains. In its Conclusion, then, this thesis will speculate on the potential future trajectories that such convergences might take and asks hopefully generative questions, both analytical and (mildly) normative.

These include: Can news organizations maintain a unique position among technology companies, intelligence services, and private data brokers in such a way that public and personal data can be responsibly collected, analyzed, and made transparent? Should the multi-lingual journalistic media corpus (text and images both) constitute a significant part of the training data used by generative language models and computer vision algorithms? Should investigative reporters work alongside computer scientists, statisticians, geographers, and data scientists, or should they incorporate the skills of the aforementioned domain knowledges into their own area of expertise? Does it benefit news organizations to rely on external data-analytic products for their work, or should they develop their own proprietary ones?

And finally (and very broadly): What is the role of news stories today in not only the traditional sense of framing and giving account of current events but as automatically becoming the data that are inevitably ingested into the machines that “read,” “make sense of," and invariably produce much of the public intelligence on which humans rely?

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/cf7e-xd37
Date January 2023
CreatorsIvancsics, Bernat
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0022 seconds