Arabic IR

Traditional information retrieval system was carried out essentially in English and fueled by the annual Text Retrieval Conferences (TREC) sponsored by NIST (the National Institute of Standards and Technology). NIST has accumulated large amounts of standard data (text collections, queries, and relevance judgments) so that IR researchers can compare their techniques on common data sets. More recently, IR researchers have found a real interest to study new languages other than English. Now, TREC includes multilingual data and other organizations sponsor similar annual evaluations for European languages (CLEF) and Asian languages (NTCIR) (Chinese, Japanese, and Korean). Arabic began to be included in the TREC cross-lingual track, and in the TDT (Topic Detection and Tracking) evaluations. The availability of standard Arabic data sets from the NIST and the Linguistic Data Consortium (LDC) has in turned spurred a huge acceleration in progress in information retrieval and other natural language processing involving Arabic language. Arabic is an interesting case to study in IR, because it is a highly inflected language. In this sub-topic, we study some problematic related to IR systems (lemmatization, morphological analysis, indexation) and we use the Hadith corpora as knowledge basis.

Hybrid indexing tool for Arabic information retrieval system

The goal of this project is to enhance an existing hybrid indexing tool in order to give better efficiency (speed, resources) and effectiveness (recall, precision,…). The JARIR group has been working on developing an Arabic IRS (Information Retrieval System) based on a hybrid index (Ben Guirat et al. 2016). The proposed approach is to build a multilevel index where the hierarchical structure represents the semantic relations between the different word forms (root, verbal pattern and stem). Given the existent tool, this project aims to:

Hadith NER

Description

This resource includes a set of gazetteer lists useful for NER (Named Entity Recognition) and Arabic text processing applications. It is composed of seven files corresponding to seven Arabic Named Entities (NE) extracted from hadith books.

Al-Bukhari NER

Description

This resource includes a set of gazetteer lists useful for NER (Named Entity Recognition) and Arabic text processing applications. It is composed of seven files corresponding to seven Arabic Named Entities (NE). Each file is tab-separated and provides the frequency of each NE extracted from Sahih Al-Bukhari book.

Kunuz Corpus (Alpha version)

This resource is an XMLized version of Sahih Albukhari, the most authentic hadith book. It has the following characteristics:

Selected references (Information retrieval )

Hi all


This is a bibliographic list collected by Ibrahim Bounhas for Information Retrieval related fields.
Subscribe to RSS - Arabic IR