Kunuz Corpus (Alpha version)



This resource is an XMLized version of Sahih Albukhari, the most authentic hadith book. It has the following characteristics:

  • At the macro-logical level, the collection t is structured through sections in a hierarchical manner (see Structure.xml). From thsi structure, we extracted a list of domains (Domains.txt)  and an index linkig domains to hadiths (Domain-Index.txt).
  • At the micro-logical level,the hadiths are fully tagged; mainly narrator names, Quran verses and the real content of each hadith (metn) are recognized (see "Micro-Documents" folder). This resource may be used for Named Entity Recognition (NER) applications and in general for information extraction from Arabic documents. The main tags are summarized as follows:
Tag Significance
<Metn> or <M> The real content of the hadith (المتن
<Sanad> Chain of narrators (السند)
<S> Section
<TI> Title of a section
<Q> Quranic verse
<C> Poesy
<R> Narrator
<bkh> The name of the author (i.e. Al-Bukhari)
<qwl> Indicates the hadith contains a speech of the prophet PBUH.
<pbuh> The name of the prophet PBUH.
<proph> The names of the prophets.
<mlk> The names of anges.
<trb> Names of islamic  doctrines
<rel> Names of religions and holy books
<plc> Names of places
<wem> Names of women
<cmtbkh> Comments of the author (i.e. Al-Bukhari)
<cmt> Comments (e.g. definitions)
<fnarmetn> First narrator and metn
<C> Poesy
<poepart> Part of poesy
<qpart> Names of quranic surats or parts
<cmtcre> Credibility comment
<qvpart> A part of a quranic verse

The attribute "l" of each section tag indicates its level e.g. <S l="1"> stands for a section of level one; <S l="2"> represents a sub-section.

  • We also provide full documents containing hadiths and their explanations in the TREC XML format (Full-Documents-Trec folder)
  • The collection is designed for assessing Information Retrieval (IR) systems. We collected a set of standard topics (Queries.xml) and relevance judgments according to standard topic development and sampling procedures of RI compagins (see Qrels.txt).

The resource is available for free usage for the research community. Click the following links to download:

It is distributed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. When using it, you are encouraged to cite:

I. BounhasOn the Usage of a Classical Arabic Corpus as a Language Resource: Related Research and Key ChallengesACM Transactions on Asian and Low-Resource Language Information Processing, vol. 18, no. 3, p. Article N°23, 2019.

I. Bounhas and Ben Guirat, S.KUNUZ: a Multi-purpose Reusable Test Collection for Classical Arabic Document Engineering, in Proceedings of the 16th CS/IEEE International Conference on Computer Systems and Applications (AICSSA), Abu Dhabi, UAE, November 03-07, 2019, Abu Dhabi, UAE, 2019.

The project aims to build a multilingual standard IR collection. Queries and Qrels (relevance judgments) will be published later.


Access conditions: