In 2009, the South African National HLT Network (NHN) funded a technology audit that was conducted to form a clear profile of the research and development activities in the human language technology field in South Africa. This audit was used as the basis for the RMA Index, which is a list of South African resources with the relevant metadata (information such as developer details and specifications). Some of these resources are included in the RMA Catalogue, and are therefore available for download.

Collections in this community

  • Resource Catalogue [349]

    A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.
  • Resource Index [411]

    A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.
  • Student Data Repository [6]

    A collection of language resources available as part of the output of post-graduate study programs

Recent Submissions

  • African Wordnet version 1.0 

    Griesel, Marissa (UNISA, 2022-09-20)
    Developed using the expand model with Princeton WordNet 3.1 as basis. Please see https://africanwordnet.wordpress.com/ for all details on the project. ...
  • Ex Machina: Using NLP and statistical learning models to model eyewitness statements and choosing behaviour 

    Nortje, Alicia, et al. (Sadilar, 2019-05-07)
    This curated database includes data from various of empirical studies where eyewitness statements and descriptions were collected. The original studies, ...
  • Autshumato English-Tshivenḓa Parallel Corpora 

    McKellar, Cindy (North-West University; Centre for Text Technology (CTexT), 2023-12-12)
    Aligned parallel corpora for the following language pair: English-Tshivenḓa. Data was crawled from various multilingual government websites, sourced ...
  • Autshumato Monolingual Tshivenḓa Corpus 

    McKellar, Cindy (North-West University; Centre for Text Technology (CTexT), 2023-12-12)
    Monolingual corpus for Tshivenḓa. The data is given as a single UTF-8 text file, with each segment on a newline.
  • Morphologically annotated corpus for isiNdebele 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in isiNdebele converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data ...
  • Morphologically annotated corpus for isiXhosa 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in isiXhosa converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for isiZulu 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in isiZulu converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for Siswati 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in Siswati converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for Sesotho 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in Sesotho converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for Sepedi 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in Sepedi converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...

View more