Features Extraction To Improve Comparable Tweet corpora Building

TitleFeatures Extraction To Improve Comparable Tweet corpora Building
Publication TypeConference Proceedings
Year of Conference2016
AuthorsHajjem, M, C., L
Conference NameJournées internationales d'Analyse statistique des Données Textuelles
Conference Locationnice
KeywordsTweet Clusturing ; Ambiguity Estimation ; Twitter mining ; Comparable corpora construction ; Comparability

This paper deals with comparable corpus building from Twitter. Especially, we focus on the task related
to relevance evaluation process of tweets. In fact, as Twitter microblog is very popular, tweets could
be considered as a new data source of comparable corpora. So, a possible way to build comparable
corpora from Twitter is to extract tweets in two selected languages and sharing a specic topic, in
order to construct a multilingual corpus. However, the problem of mining relevant tweets deals with
a real challenge: how to only extract the most relevant tweets according to a specic topic from the
huge number of collected tweets? In this respect, we propose in this paper a unsupervised machine
learning based approach to improve the quality of the collected textual data, in order to identify which
messages, i.e, tweets, address the specic topic. Several tweets representations are carried out to lter the
extracted messages. The main goal of such relevance estimation process is improving the comparability
degree between bilingual extracted tweet corpora.