Digitisation guidelines

The creation of digital resources and collections through the process of digitisation is an important aspect in the development of larger scale data collections that would allow researchers and developers access to various physical objects, access to which would otherwise be limited and/or become completely lost. With this in mind, SADiLaR is in the process of developing and reusing processes and standards that will allow us to create digital collections for the South African context in a systematic and sustained fashion to support both research activities in the field of digital humanities, and help institutions with limited resources in the development of digital collections.

Although SADiLaR has a vested interest in access to large digital collections of data in South African languages, the focus of the Centre is specifically on fostering access to digital resources that will support various types of research related to the use of South African languages. With this in mind, there are certain principles related to the digitisation efforts of SADiLaR that are relevant to these guidelines.

Firstly, the primary aim of the Centre is access to digital language resources, and not on performing archival functions typically managed and implemented by libraries. The Centre cannot, and should not, replicate the work already performed by various universities and libraries around the country. SADiLaR will rather focus on exploiting the resources that have already been developed for archival purposes, in order to facilitate research initiatives based on these digital resources. There are, however, several smaller institutions that do not have the necessary capacity in terms of technical knowledge to perform digitisation activities, and in such cases, SADiLaR will try to support these institutions with technical and other resources, including the archiving and distribution of the digital resources.

Secondly, SADiLaR will not necessarily determine the prioritisation of digitisation efforts. Instead, the Centre will try to establish a framework and process for collecting and distributing metadata about physical objects that have not been digitised, and for which there is a need, from researchers, to have the objects digitised. This facilitation, in the form of a finding aid, will be an open Language Resource Index of various collections that may be digitised, where researchers can request the digitisation of specific works.

Based on these assumptions, the following guidelines are set out to provide the basic information and processes required to digitise an analogue source into a digital object that can be used in the creation of collections for access and research purposes.

Please see the Metadata guidelines section for more information on metadata.

Digitisations standards

According to Groenewald and Klapwijk (2010), a digital object is any self-contained unit of information in a digital format that consists of data and procedures required to manipulate the data. The digital object contains both the electronic data, in a suitable format according to the source, and the metadata associated with the object. Because the size and quality of a digital object is determined by the amount of information contained in the object, there are different digital formats for different types of digital objects, where the choice of formatting is dependent on the purpose and scope of the digitisation effort. Since the aim of SADiLaR is not archival, the formats proposed in this guideline do not represent the highest quality digital objects, but rather those formats that allow digital access to the information contained in the original analogue version. Furthermore, SADiLaR is primarily interested in the preservation, distribution and access to language data, which implies that there are mainly two formats that are relevant to the digitisations guidelines described here, namely text data, and audio data.

Text data

The digitisation of text data usually involves the scanning of a document into an image, conversion of the image into machine readable text through optical character recognition (OCR) and display in either JPEG or PDF formats with the associated text. The textual information is further enhanced by markup, usually in the form of metadata as described in the following section. The recommendations for the required formats are as follows:

  • Scanning of master text image to TIFF 400 dpi;
  • Conversion to JPEG/PDF;
  • Optical character recognition with the relevant language model. Language models for the South African languages are available and can be adapted for specifically problematic texts; and 
  • Addition of relevant metadata.

Audio data

Sound is stored in various file formats, many of which are software dependent, and this software may be proprietary. It is preferable that digital audio be converted to a format that is not in a proprietary format and can be reused with different software applications. The recommended format is lossless 24 bit / 96kHz WAV format, which will be converted to MP3 format for web and research representations. The audio data must also be furnished with relevant metadata for access and discovery purposes.