Most contemporary computational approaches to text analysis suffer from severe validity problems as the heterogeneity of analyzed discourse increases. Capitalizing on their ability to treat large numbers of documents efficiently, many tools have imposed tacit assumptions about the homogeneity of texts and expressions, which can result in consequential biases if they are violated. For instance, in bag-of-words approaches, variability in the length of texts tends to inflate the weight of long texts in a sample, banking on many, relatively uninformative co-occurrences rather than the comparatively informative contents of shorter texts (e.g., tweets). Where analyses are shifted to the paragraph level or smaller units instead, syntactic styles and linguistic conventions recreate the same biastoward longer units, while links between adjacent units are lost. Stylistic differences and formal document contents also bias the recognition of word use patterns, grouping documents and terms to reflect distinct wordings rather than meanings, while synonyms and circumscriptions are separated. Similarly, named entities are regularly referred to differently in different discursive settings, fragmenting their recognition. Each erroneous distinction between commensurable contents, in turn, results in a multiplication of matrix rows or network nodes, diluting the recognition of semantic patterns and cloaking linkages between them. All these challenges, finally, are trumped by the inability of most existing approaches to handle texts written in different languages. While advanced tools and dictionaries exist to map entities and anaphora, parse grammatical relations, and extract subtle patterns in a reliable fashion, these tools are notoriously hard to transfer across languages. Unless the analyzed material can be brought into one shared language, the investigation of heterogeneous, multi-lingual discourse remains limited to a manual comparison of findings generated in structurally disjunct analyses. To avoid challenges to the validity of analysis, most applications to date have focused on rather constrained samples of discourse text – at considerable costs for the reach and theoretical relevance of generated findings. Why, for instance, would we restrict our analysis of social media debates about an electoral race to tweets written in one titular language? In most countries, sizeable populations use additional languages, and cross-platform integration is advancing rapidly (e.g., tweets advertise longer articles and blog posts; discussion boards comment upon current twitter activity). Likewise, what do we lose if our analysis algorithmically separates contributions using terminology to discuss an issue from those using simpler language? Beyond the need to raise awareness for the often far-reaching implications of tacit homogeneity assumptions hard-coded into existing computational tools, new strategies are needed for evaluating and addressing these limitations. Especially for issues of transnational import, which are discussed by widely diverse audiences, in different languages, across different platforms, an approach is needed that can separate superficial differences in the lexical texture of discourse texts from the underlying patterns in their semantic content. Departing from a discussion of threats to comparative validity in computational text analysis, this paper distinguishes two main levels of heterogeneity effects, which derive from variability rooted in the language, cultural setting, document type and general properties of natural discourse. On the level of meaning-carrying entities referred-to in a text, difficulties in handling polysemic or synonymous expressions derive from different languages (e.g., ‘األول الوزير’/‘prime minister’/’statsminister’) and linguistic styles (e.g., ‘the prime minister’, ’10 Downing Street’, ‘that hag’, ‘PM’), the context-dependent need for explication (e.g., ‘Theresa’, ‘prime minister’ vs. ‘the British prime minister’), and the natural variability in the use of language itself (e.g., ‘premier’, ‘prime minister’, ‘first minister’, ‘head of government’). To ensure a valid treatment of commensurable contents, an elaborate mapping of lexical expressions onto semantic concepts is necessary. On the level of meaningful associations between concepts, again, language-specific and stylistic differences (‘May’s speech held yesterday before the House of Commons’ vs. ‘Merkels gestrige Bundestagsrede’) influence the modes available for grammatically expressing relatedness, as well as the confidence warranted when translating proximity into probable relatedness; other stylistic characteristics (e.g., of document types: tweets, essays, minutes) affect the plausibility of inferring relatedness from co-presence in the same text, and provide meta- and macrostructures that help inferring meaningful association; and also the length and structuringof a document influences the chance of two entities contained within it being meaningfully related. To establish a valid, equivalent measure of association, thus, we need to consider both proximity and copresence while adjusting for the specific characteristics of the genre, linguistic code, and type of document. Introducing Jamcode, a dictionary-based tool for comparative semantic text analysis in heterogeneous discourse, the finally paper presents different suitable strategies for overcoming the limitations raised by common homogeneity assumptions. To handle various possibilities to express the same meaning in different styles and languages, Jamcode provides a simple, Boolean syntax that maps n-grams found within specified, disambiguating context onto an ontology of semantic constructs. Localizing references to such constructs within heterogeneous discourse texts, Jamcode then applies a windowed coding procedure that models the local and global syntactic structure of the document to determine proximal co-occurrences. While the mapping of lexical structures upon semantic entities serves to establish the equivalence of recognized contents, the windowed coding procedure enables the approximate equivalence of associations between recognized contents. In consequence, equivalent statements, frames and repertoires can be recognized across different languages and forms of discourse. Drawing upon a series of validation studies conducted within the framework of the INFOCORE project (which probes the role of media in violent conflict across political debates, news discourse, social media, and strategic communication in six conflicts, eight languages and 11 countries), the paper documents the specific gains in comparative validity, as well as the main trade-offs embedded in the respective operational choices. Highlighting critical conditions and challenges throughout the development and application of the required dictionary and tool, the presentation concludes with a set of researchpragmatic guidelines governing the need for greater or lesser effort to ensure the validity of analysis, and highlights suitable auxiliary strategies that can reduce effort while containing the costs for the analysis.
One critical function that news narratives perform is orienting action: Providing a selective, coherent account of events, they suggest what needs to be done, coordinating and motivating public agendas. The importance of news narratives’ agendas for action has been particularly salient in the coverage of conflict1 (Wolfsfeld 1997, Robinson et al. 2010): Conflict spurns heated debates wherein advocated courses of action collide, while audiences rely heavily on various media to comprehend ongoing events. Keeping track of the cacophony of agendas advanced in print and online newspapers and magazines, social media, and other public discourse confronts news readers, journalists, decision makers, and scholars alike with a major challenge. Computer assisted analyses have the potential to help comprehending conflict news, distilling agendas for action and possibly predicting the mobilization of consensus and collective action (Snow and Benford 1988). This paper presents the INFOCORE consortium’s ongoing efforts at automatically capturing agendas in conflict discourse, employing NLP technology and statistical analysis. We demonstrate the utility and potential of our approach using coverage of the Syrian chemical weapons crisis in 2013.