class: center, middle, inverse, title-slide # rectr ## Reproducible Extraction of Cross-lingual Topics using R ### Chung-hong Chan ### 2020-05-09 --- # Multilingual data: Topic model * Welche Schäden Covid-19 im Körper anrichtet? * 警與狗配圖教「狗」字 太陽島:是警犬 沿用11年 * Macron veut prolonger les droits des intermittents du spectacle jusqu'à août 2021 * Bevrijdende signaal in Europese voetbalcrisis komt uit Berlijn * 現地調査は「正しい雰囲気で」 コロナ起源巡り中国大使 * ‘사퇴 거부’ 양정숙ㆍ시민당 맞고소 진흙탕 싸움 * الشفاء من فيروس كورونا: روايات مرضى لا يستطيعون التخلص من الفيروس * דו"ח פנימי בסין: העוינות בין בייג'ין לוושינגטון עלולה להידרדר לעימות מזוין --- # Path of least resistance  .small[.footnote[ [1] De Vries, E., Schoonvelde, M., & Schumacher, G. (2018). No longer lost in translation: Evidence that Google Translate works for comparative bag-of-words text applications. Political Analysis, 26(4), 417–430. [2] Reber, U. (2019). Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora. Communication Methods and Measures, 13(2), 102–125. ] ] --- # Context matters -- .pull-left[ <img src = "kater1.png" height = 300> ] -- .pull-right[ <img src = "kater2.png" height = 300> ] --- # Reproducibility matters .pull-left[  ] .pull-right[  ] .center[Do you remember them?] ??? killedbygoogle.com --- # Our proposal .pull-left[ 1. A 5-step process 2. Human validation ] .pull-right[ <img src = "humaneval.png" width = 500> ] In an easy-to-use <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 581 512"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> package: rectr --- # An example 3139 articles with the keyword **Paris** from Nov 1 2015 to Dec 31 2015 from The New York Times (English), Süddeutsche Zeitung (German) and Le Figaro (French). Data requirement: 1. character vector of content, e.g. `paris$content` 2. character vector of language, e.g. `paris$lang` ```r c("fr", "de", "en") ``` --- # Step 0: Download word embeddings .pull-left[ ```r get_ft("fr") get_ft("de") get_ft("en") ``` ] .pull-right[ <img src = "ft_download.webp" width = 500> ] You can archive the word embeddings, hence it is reproducible. --- # Step 1: Read word embeddings ```r emb <- read_ft(c("fr", "de", "en")) ``` --- # Step 2: Create a multilingual corpus Create a quanteda compatible multilingual corpus ```r paris_corpus <- create_corpus(paris$content, paris$lang) paris_corpus ``` ``` ## Corpus consisting of 3,391 documents and 1 docvar. ## text1 : ## "Avec plus de 100 000 cas par an, c'est la 3e cause de mort d..." ## ## text2 : ## "LE FIGARO. - Lors des derniers Entretiens de Bichat (sur l'é..." ## ## text3 : ## "L'ancien ambassadeur en Iran, analyste de politique interna..." ## ## text4 : ## "Le président de l'agence de communication Tilder a monté l'e..." ## ## text5 : ## "Lancé voici un an par la Commission européenne, présenté par..." ## ## text6 : ## "Après 18 mois de psychodrame, le groupe américain prend enfi..." ## ## [ reached max_ndoc ... 3,385 more documents ] ``` --- # Step 3: Create document-feature matrix with word embeddings ```r paris_dfm <- transform_dfm_boe(paris_corpus, emb) ``` ```r paris_dfm ``` ``` ## dfm with a dimension of 3391 x 300 and fr/en/de language(s). ``` --- # Step 4: Filter the DFM by k k: number of topics ```r paris_dfm_filtered <- filter_dfm(paris_dfm, paris_corpus, k = 5) paris_dfm_filtered ``` ``` ## dfm with a dimension of 3391 x 11 and fr/en/de language(s). ## Filtered with k = 5 ``` --- # Step 5: Fit a Guassian Mixture Model ```r paris_gmm <- calculate_gmm(paris_dfm_filtered, seed = 42) paris_gmm ``` ``` ## 5-topic rectr model trained with a dfm with a dimension of 3391 x 11 and fr/en/de language(s). ## Filtered with k = 5 ``` ```r dim(paris_gmm$theta) ``` ``` ## [1] 3391 5 ``` --- # Articles with high `\(\theta_{t}\)`
--- # The team * Chung-hong Chan (U Mannheim) * Jing Zeng (UZH) * Hartmut Wessler (U Mannheim) * Marc Jungblut (LMU München) * Kasper Welbers (VU Amsterdam) * Joseph Bajjalieh (UIUC) * Wouter van Atteveldt (VU Amsterdam) * Scott Althaus (UIUC) Part of the [Responsible Terrorism Coverage](https://responsibleterrorismcoverage.org/) Project <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> : resteco --- background-image: url(https://media.giphy.com/media/llKJGxQ1ESmac/source.gif) background-position: center background-size: cover class: hide-logo, center, hide-footer, middle .imagelab[ICA Computational Methods Top Paper] --- class: center, hide-logo # Available now! <img src ="rectr_logo.png" width = 400> <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> : [chainsawriot/rectr](https://github.com/chainsawriot/rectr) | Slides: [chainsawriot.github.io/ica2020_rectr](https://chainsawriot.github.io/ica2020_rectr)