Same, Same? Ensuring Comparative Equivalence in the Semantic Analysis of Heterogeneous, Multilingual Corpora

Baden, Christian

Citation:

Baden, C. (2017). Same, Same? Ensuring Comparative Equivalence in the Semantic Analysis of Heterogeneous, Multilingual Corpora. In ICA Annual Conference . San Diego, CA.

Abstract:

Most contemporary computational approaches to text analysis suffer from severe validity problems as the heterogeneity of analyzed discourse increases. Capitalizing on their ability to treat large numbers of documents efficiently, many tools have imposed tacit assumptions about the homogeneity of texts and expressions, which can result in consequential biases if they are violated. For instance, in bag-of-words approaches, variability in the length of texts tends to inflate the weight of long texts in a sample, banking on many, relatively uninformative co-occurrences rather than the comparatively informative contents of shorter texts (e.g., tweets). Where analyses are shifted to the paragraph level or smaller units instead, syntactic styles and linguistic conventions recreate the same biastoward longer units, while links between adjacent units are lost. Stylistic differences and formal document contents also bias the recognition of word use patterns, grouping documents and terms to reflect distinct wordings rather than meanings, while synonyms and circumscriptions are separated. Similarly, named entities are regularly referred to differently in different discursive settings, fragmenting their recognition. Each erroneous distinction between commensurable contents, in turn, results in a multiplication of matrix rows or network nodes, diluting the recognition of semantic patterns and cloaking linkages between them. All these challenges, finally, are trumped by the inability of most existing approaches to handle texts written in different languages. While advanced tools and dictionaries exist to map entities and anaphora, parse grammatical relations, and extract subtle patterns in a reliable fashion, these tools are notoriously hard to transfer across languages. Unless the analyzed material can be brought into one shared language, the investigation of heterogeneous, multi-lingual discourse remains limited to a manual comparison of findings generated in structurally disjunct analyses. To avoid challenges to the validity of analysis, most applications to date have focused on rather constrained samples of discourse text – at considerable costs for the reach and theoretical relevance of generated findings. Why, for instance, would we restrict our analysis of social media debates about an electoral race to tweets written in one titular language? In most countries, sizeable populations use additional languages, and cross-platform integration is advancing rapidly (e.g., tweets advertise longer articles and blog posts; discussion boards comment upon current twitter activity). Likewise, what do we lose if our analysis algorithmically separates contributions using terminology to discuss an issue from those using simpler language? Beyond the need to raise awareness for the often far-reaching implications of tacit homogeneity assumptions hard-coded into existing computational tools, new strategies are needed for evaluating and addressing these limitations. Especially for issues of transnational import, which are discussed by widely diverse audiences, in different languages, across different platforms, an approach is needed that can separate superficial differences in the lexical texture of discourse texts from the underlying patterns in their semantic content. Departing from a discussion of threats to comparative validity in computational text analysis, this paper distinguishes two main levels of heterogeneity effects, which derive from variability rooted in the language, cultural setting, document type and general properties of natural discourse. On the level of meaning-carrying entities referred-to in a text, difficulties in handling polysemic or synonymous expressions derive from different languages (e.g., ‘األول الوزير’/‘prime minister’/’statsminister’) and linguistic styles (e.g., ‘the prime minister’, ’10 Downing Street’, ‘that hag’, ‘PM’), the context-dependent need for explication (e.g., ‘Theresa’, ‘prime minister’ vs. ‘the British prime minister’), and the natural variability in the use of language itself (e.g., ‘premier’, ‘prime minister’, ‘first minister’, ‘head of government’). To ensure a valid treatment of commensurable contents, an elaborate mapping of lexical expressions onto semantic concepts is necessary. On the level of meaningful associations between concepts, again, language-specific and stylistic differences (‘May’s speech held yesterday before the House of Commons’ vs. ‘Merkels gestrige Bundestagsrede’) influence the modes available for grammatically expressing relatedness, as well as the confidence warranted when translating proximity into probable relatedness; other stylistic characteristics (e.g., of document types: tweets, essays, minutes) affect the plausibility of inferring relatedness from co-presence in the same text, and provide meta- and macrostructures that help inferring meaningful association; and also the length and structuringof a document influences the chance of two entities contained within it being meaningfully related. To establish a valid, equivalent measure of association, thus, we need to consider both proximity and copresence while adjusting for the specific characteristics of the genre, linguistic code, and type of document. Introducing Jamcode, a dictionary-based tool for comparative semantic text analysis in heterogeneous discourse, the finally paper presents different suitable strategies for overcoming the limitations raised by common homogeneity assumptions. To handle various possibilities to express the same meaning in different styles and languages, Jamcode provides a simple, Boolean syntax that maps n-grams found within specified, disambiguating context onto an ontology of semantic constructs. Localizing references to such constructs within heterogeneous discourse texts, Jamcode then applies a windowed coding procedure that models the local and global syntactic structure of the document to determine proximal co-occurrences. While the mapping of lexical structures upon semantic entities serves to establish the equivalence of recognized contents, the windowed coding procedure enables the approximate equivalence of associations between recognized contents. In consequence, equivalent statements, frames and repertoires can be recognized across different languages and forms of discourse. Drawing upon a series of validation studies conducted within the framework of the INFOCORE project (which probes the role of media in violent conflict across political debates, news discourse, social media, and strategic communication in six conflicts, eight languages and 11 countries), the paper documents the specific gains in comparative validity, as well as the main trade-offs embedded in the respective operational choices. Highlighting critical conditions and challenges throughout the development and application of the required dictionary and tool, the presentation concludes with a set of researchpragmatic guidelines governing the need for greater or lesser effort to ensure the validity of analysis, and highlights suitable auxiliary strategies that can reduce effort while containing the costs for the analysis.

Last updated on 04/26/2017

Prof. Christian Baden

The Department of Communication and Journalism

Same, Same? Ensuring Comparative Equivalence in the Semantic Analysis of Heterogeneous, Multilingual Corpora

Citation:

Abstract:

Recent Publications

Recent Presentations

Recent Working Papers