Machines Don't Read The Way We Do
Letter from the Executive Director, June 2022
Fundamentally, machines don’t read like humans read. This is true not only in the way information is interpreted, but also in how the content must be structured so that machines can consume it. There are exceptions, I grant, but humans are far more adept at moving quickly from consuming content in one form to the next. Moving from text recognition to image recognition and to video or sound recognition is the type of pattern-switching that people are adept at, but machines are not. Even in the world of textual analysis, “reading” a corpus of literature requires consistency across forms to be effective at scale and efficient. According to a 2020 report, the amount of time dedicated to either data loading or cleansing—the processes necessary prior to processing data—represents 45% of the overall project time in data analysis. This will vary by domain or dataset, but fundamentally, information needs to be consistently structured so that machines can consume it in meaningful ways. Of course, this process can be made more efficient through the use of standards.
At the start of last month’s NISO Hot Topic Virtual Conference on Text and Data Mining, I engaged in a wide-ranging conversation with Petr Knoth, head of the Big Scientific Data and Text Analytics Group within the Knowledge Media Institute at the Open University. We discussed several of the key elements of the process of textual analysis from a researcher’s perspective, but also from the perspective of an organization that is aggregating a tremendous amount of data and providing that to researchers, as Petr and his team do both. We started with the many applications of textual analysis applications, from using text analysis for discovery because no human can read all the relevant literature, to discerning unique connections among literature, to assessing content for plagiarism detection. The conversation extended from there to the role of repositories in aggregating this information and the need for wider standards adoption to make these kinds of approaches more easily implementable.
Facilitating this consumption by researchers involves more than simply authors marking up content—even if progress could be made on that front. Later in the program, the topic of <data lakes> came up again in various contexts. The COAR repository contains a vast collection of open access materials. At other points in the program, Prathik Roy, product director for data solutions and strategy at Springer Nature; John A. Walsh, director of the HathiTrust Research Center; and Nathan Kelber, director of the Text Analysis Pedagogy Institute at JSTOR Labs, all spoke about allowing researchers access to published content in different ways.
Could we envision a world in which researchers didn’t have to recreate their own relevant repository of content that they’ve cobbled together from the collection of resources to which their institution might have TDM access? Imagine, if a trusted third party could be established to provide a resource to which researchers might turn that would provide an analyzable corpus of all scholarly literature, how valuable such a resource might be. Of course, the licensing arrangements might be challenging, and adequate controls would have to be instituted. Such a system might provide publishers with more security than they might now have, if the content they wished to control was computable without researchers having to crawl it and recreate their own bespoke copies of the relevant content they are analyzing. This resource might allow researchers to focus more time on their analytical approaches and the resulting output rather than on gathering and normalizing the content they wish to process.
Can authors play a more important role in this process? Authors are often focused primarily on the content and not on the structure of the document, nor on the consistency of either metadata or linkages to content outside of the document. Yet, in order for the content to be made machine consumable, it needs to be in a format that computational systems can access. Content needs to be structured to include appropriate identifiers, metadata and semantic linking, so the document can be efficiently understood. As Petr suggested, this needn’t be a presumption that the author will know or be in a position to include the correct DOI name string or the appropriate data element URI. Rather, he envisioned authoring tool providers or manuscript submission systems that could help the author by providing automated drop-down lists that help the encoding of this information by providing dialog boxes that ask things like, “Do you mean this referent and this related identifier?”
The demands of computational analysis require a tremendous amount of content and also require normalized formats against which analytical tools can be applied. Standards can play an important role in each of these processes. Beyond the formats, they can also provide frameworks for identifying and addressing biases inherent in the data, in the training set, or in the algorithms used to study the data. Organizations of all sorts need to support the normalization of these data to reduce the burden on the users who apply text and data mining approaches.
Sincerely,
Todd Carpenter
Executive Director, NISO