In past eras of limited publicity and professional multipliers, journalists dominated the procurement of viewpoint diversity, lending relevant contenders’ ideas visibility alongside hegemonic voices. Today, the algorithm-amplified activities of countless voices and audiences govern the visibility of competing viewpoints in the digital public sphere. However, as I shall argue, the enormous expansion and fragmentation of publicity has merely shifted, but not diminished the importance of journalism. As a point of departure, I submit that viewpoint diversity is suspended between two poles: the hegemony of one viewpoint, and the entropy of all available viewpoints – most of which are unsubstantiated or redundant. Accordingly, exposing audiences to relevant contestation is equally critical as is organizing the cacophony of voices and discriminating relevant from irrelevant diversity. In this effort, journalists navigate an information environment that is heavily pre-structured by algorithms.
In this paper, I review the existing literatures on viewpoint diversity and digital journalism to identify key contributions and conflicts that algorithms create for the management of relevant diversity. I propose that algorithms play a highly ambivalent role depending on which criterion of relevant diversity is considered. With regard to identifying diverse experiences, algorithms may be capable of identifying viewpoints shared by many in society, but remain limited at exposing other groups to these experiences. Herein thus lies a key journalistic contribution. At the same time, these algorithms also wash singular or contrived experiences to the fore where these appeal to common fears and stereotypes. With regard to identifying diverse knowledge, next, algorithms are largely useless: They detect only what is believed or contested, not what is important to know, and their proxies for truth – the beliefs of authorities and crowds – are dubious at best. Professional journalism remains indispensable for determining what claims valuably complement existing knowledge. Still, algorithms can flag widely held beliefs that require checking and, if necessary, correcting. With regard to procuring political-normative choices, finally, algorithms are generally useful for crystallizing alternatives, but also easily rigged. However, both their capacities and limitations are rooted in the strategic advocacy of political and social groups, and thus make little difference compared to ‘old-fashioned’ journalistic work. While journalists are no longer responsible for rendering contention public, their contribution by far exceeds the pointing-out of relevant contributions. By bridging distinct ‘filter bubbles’, verifying pertinent claims and exposing baseless viewpoints, journalism is at least as critical for managing excessive diversity as the amplification of suppressed contention.
Most contemporary computational approaches to text analysis suffer from severe validity problems as the heterogeneity of analyzed discourse increases. Capitalizing on their ability to treat large numbers of documents efficiently, many tools have imposed tacit assumptions about the homogeneity of texts and expressions, which can result in consequential biases if they are violated. For instance, in bag-of-words approaches, variability in the length of texts tends to inflate the weight of long texts in a sample, banking on many, relatively uninformative co-occurrences rather than the comparatively informative contents of shorter texts (e.g., tweets). Where analyses are shifted to the paragraph level or smaller units instead, syntactic styles and linguistic conventions recreate the same biastoward longer units, while links between adjacent units are lost. Stylistic differences and formal document contents also bias the recognition of word use patterns, grouping documents and terms to reflect distinct wordings rather than meanings, while synonyms and circumscriptions are separated. Similarly, named entities are regularly referred to differently in different discursive settings, fragmenting their recognition. Each erroneous distinction between commensurable contents, in turn, results in a multiplication of matrix rows or network nodes, diluting the recognition of semantic patterns and cloaking linkages between them. All these challenges, finally, are trumped by the inability of most existing approaches to handle texts written in different languages. While advanced tools and dictionaries exist to map entities and anaphora, parse grammatical relations, and extract subtle patterns in a reliable fashion, these tools are notoriously hard to transfer across languages. Unless the analyzed material can be brought into one shared language, the investigation of heterogeneous, multi-lingual discourse remains limited to a manual comparison of findings generated in structurally disjunct analyses. To avoid challenges to the validity of analysis, most applications to date have focused on rather constrained samples of discourse text – at considerable costs for the reach and theoretical relevance of generated findings. Why, for instance, would we restrict our analysis of social media debates about an electoral race to tweets written in one titular language? In most countries, sizeable populations use additional languages, and cross-platform integration is advancing rapidly (e.g., tweets advertise longer articles and blog posts; discussion boards comment upon current twitter activity). Likewise, what do we lose if our analysis algorithmically separates contributions using terminology to discuss an issue from those using simpler language? Beyond the need to raise awareness for the often far-reaching implications of tacit homogeneity assumptions hard-coded into existing computational tools, new strategies are needed for evaluating and addressing these limitations. Especially for issues of transnational import, which are discussed by widely diverse audiences, in different languages, across different platforms, an approach is needed that can separate superficial differences in the lexical texture of discourse texts from the underlying patterns in their semantic content. Departing from a discussion of threats to comparative validity in computational text analysis, this paper distinguishes two main levels of heterogeneity effects, which derive from variability rooted in the language, cultural setting, document type and general properties of natural discourse. On the level of meaning-carrying entities referred-to in a text, difficulties in handling polysemic or synonymous expressions derive from different languages (e.g., ‘األول الوزير’/‘prime minister’/’statsminister’) and linguistic styles (e.g., ‘the prime minister’, ’10 Downing Street’, ‘that hag’, ‘PM’), the context-dependent need for explication (e.g., ‘Theresa’, ‘prime minister’ vs. ‘the British prime minister’), and the natural variability in the use of language itself (e.g., ‘premier’, ‘prime minister’, ‘first minister’, ‘head of government’). To ensure a valid treatment of commensurable contents, an elaborate mapping of lexical expressions onto semantic concepts is necessary. On the level of meaningful associations between concepts, again, language-specific and stylistic differences (‘May’s speech held yesterday before the House of Commons’ vs. ‘Merkels gestrige Bundestagsrede’) influence the modes available for grammatically expressing relatedness, as well as the confidence warranted when translating proximity into probable relatedness; other stylistic characteristics (e.g., of document types: tweets, essays, minutes) affect the plausibility of inferring relatedness from co-presence in the same text, and provide meta- and macrostructures that help inferring meaningful association; and also the length and structuringof a document influences the chance of two entities contained within it being meaningfully related. To establish a valid, equivalent measure of association, thus, we need to consider both proximity and copresence while adjusting for the specific characteristics of the genre, linguistic code, and type of document. Introducing Jamcode, a dictionary-based tool for comparative semantic text analysis in heterogeneous discourse, the finally paper presents different suitable strategies for overcoming the limitations raised by common homogeneity assumptions. To handle various possibilities to express the same meaning in different styles and languages, Jamcode provides a simple, Boolean syntax that maps n-grams found within specified, disambiguating context onto an ontology of semantic constructs. Localizing references to such constructs within heterogeneous discourse texts, Jamcode then applies a windowed coding procedure that models the local and global syntactic structure of the document to determine proximal co-occurrences. While the mapping of lexical structures upon semantic entities serves to establish the equivalence of recognized contents, the windowed coding procedure enables the approximate equivalence of associations between recognized contents. In consequence, equivalent statements, frames and repertoires can be recognized across different languages and forms of discourse. Drawing upon a series of validation studies conducted within the framework of the INFOCORE project (which probes the role of media in violent conflict across political debates, news discourse, social media, and strategic communication in six conflicts, eight languages and 11 countries), the paper documents the specific gains in comparative validity, as well as the main trade-offs embedded in the respective operational choices. Highlighting critical conditions and challenges throughout the development and application of the required dictionary and tool, the presentation concludes with a set of researchpragmatic guidelines governing the need for greater or lesser effort to ensure the validity of analysis, and highlights suitable auxiliary strategies that can reduce effort while containing the costs for the analysis.