Ph.D Thesis Description
Our vital interest in this research work is to enhance retrieval of the Arabic web documents and to get a better retrieval performance. Several linguistic Arabic resources exist but they are not used and explored by Information Retrieval (IR) systems. This is due to the fact that they are not structured and can not be directly and effectively exploited. Therefore, we afford much effort to structure such resources in more structured and formal representations. Moreover, text mining techniques need to be undertaken to mine such resources to extract usefull knowledge for a morpho-semantic expansion of queries or a semantic indexing of documents.
This thesis is threefold. First, we work on the linguistic resources building. Second, we exploit such resources in the Information Retrieval (IR) process. Finally, we evaluate the linguistic factors that have a direct impact on the IR systems effectiveness and performance.
Arabic language is featured by its complex structure and semantics. Interpreting an Arabic passage and understanding what meaning gives requires different steps of linguistic processing taking into account the morphology, the syntax and then the semantics. Deep ambiguity is noticed for the Arabic. Both morphological and semantic ambiguities are recognized in spite of NLP (Natural Language Processing) implied tools. For this fact, we demonstrate how disambiguating moprhology of Arabic will contribute to semantically disambiguate Arabic terms by carrying and comparing the different existing analysis tools in a context of IR.
Moroever, traditional search engines do not consider through the retrieval process the semantics of the queries or the enquired documents. So, to be able to semantically handle Arabic documents, external semantic resources are needed as dictionaries, corpora or ontologies. Such resources are integrated in special formal formats as semantic spaces or graph-based representation. To achieve such goals as handling both morphology and semantics of Arabic, we implement knowledge extraction techniques and text minining methods. Both morphological and semantic knowledge are included either for the query enrichment and expansion or the indexing of documents by investigating the morphological and the semantic dependencies and relations between terms. Thus, most of the experimented approaches have shown a significant improvement over all bag-of-words baselines.