The SIGNOR Corpus of SZJ

Within the SIGNOR project a balanced and representative corpus of SZJ was built. The corpus is available for querying in its transcribed version providing an avatar demonstration of each sign. The entire corpus is currently not publishable due to data protection issues, however permissions for publication are being collected in order to release the recordings too.

The project was funded by the Slovenian National Research Agency (ARRS), project code J6-4081, duration 07.2011-06.2014.

About the project

Within the SIGNOR project a balanced and representative corpus of SZJ was built. The corpus is available for querying in its transcribed version providing an avatar demonstration of each sign. The entire corpus is currently not publishable due to data protection issues, however permissions for publication are being collected in order to release the recordings too.

Corpus compilation

Most of the recording sessions were performed on the premises of the local deaf clubs, only in some cases the mobile recording team visited the informant at home. Recordings of the pupils of the Deaf Institute Ljubljana were performed on the institute's premises with a written consent of the parents and the institute director. Each recording session with an individual informant was composed of three parts:

  • spontaneous signing about basic personal information; the interviewer encourages the informant to introduce her- or himself and sign about their family, occupation, deafness, education etc.
  • signing after watching an elicitation video with a general topic; the informant is requested to summarize or comment the just seen video
  • signing after watching an elicitation video with a specialised topic or free signing about a specialised subject chosen by the informant.

At present our inventory of recordings contains data of 80 informants, which represents between 5 and 10 percent of SZJ users in Slovenia. The representativity of our corpus seems satisfactory, but further research will show whether more data is needed in specific subgroups.

Annotation

Our annotation scheme is largely based on the experience gained by the German Sign Language project DGS. The annotation of recordings is performed using the iLex tool.

The SIGNOR annotation scheme includes the following annotation layers:

  • Segmentation or tokenization. Here the signing stream is split into individual signs which are stored within iLex as time intervals in a specific recording.
  • Glossing or lemmatisation. Each sign in SZJ is assigned a unique semantic tag (e.g. MAMA1), which is distinguished from others by its specific form and reference to a specific concept.
  • Mouthing. The comprehension of a specific sign often depends on the mouthing which may pronounce a certain word, its beginning or provide other indication of the intended meaning.
  • HamNoSys transcription. HamNoSys (Hamburg Notation System) is a special system of graphical sign notation. It uses special signs to note the form, location and movement of the hands.
  • Meaning. Each sign represents a lexeme with a single or multiple (lexicalised) meanings specified in the semantic database. The database of meanings was adapted from the Slovene WordNet, sloWNet.
  • Compositional meaning. At all previous levels the signs are annotated individually, here we annotate the meaning composed of several individual signs; for example DELATI1 + ŽENSKA1 = delavka [work + woman = female worker]
  • Segmentation into utterances. Signed text does not contain punctuation or other explicit markers of sentence boundaries, so that segmentation into utterances can only be performed using a combination of structural and semantic cues.

The corpus annotation is not yet finished, currently we are working on validation of ambiguous glosses and segmentation into utterances.

Results

Sign frequency data was partly used for the online dictionary of SZJ that is currently being developed by the Slovenian Association of Deaf and Hard of Hearing.

An analysis of lexical properties of SZJ has been performed, see publications.

A search platform has been developed providing access to corpus annotations and frequency data.

People

University of Ljubljana,
Faculty of Arts,
Department of Translation Studies

izr. prof. dr. Špela Vintar, project leader

mag. Boštjan Jerko, researcher

Marjetka Kulovec, researcher

Publications

Korpus slovenskega znakovnega jezika. Proceedings ISJT2012.

Compiling the Slovene Sign Language Corpus. Proceedings LREC2012.

Korpus in slovnica SZJ. Iz sveta tišine, Letnik XXXII, št. 11 / november 2011

Prvi leksikalni podatki o slovenskem znakovnem jeziku iz korpusa Signor. Proceedings ISJT2014.

Lexical Properties of Slovene Sign Language: A Corpus-Based Study. Sign Language Studies, 15:2, v tisku.