Author:
Jelena Kuvač Kraljević, Gordana Hržica
Email:
jkuvac@erf.hr; gordana.hrzica@erf.hr
Summary
Interest in spoken-language corpora has increased over the past two decades leading to
the development of new corpora and the discovery of new facets of spoken language.
These types of corpora represent the most comprehensive data source about the language
of ordinary speakers. Such corpora are based on spontaneous, unscripted speech defined
by a variety of styles, registers and dialects.
The aim of this paper is to present the Croatian Adult Spoken Language Corpus (HrAL),
its structure and its possible applications in different linguistic subfields. HrAL was built by
sampling spontaneous conversations among 617 speakers from all Croatian counties, and
it comprises more than 250,000 tokens and more than 100,000 types. Data were collected
during three time slots: from 2010 to 2012, from 2014 to 2015 and during 2016.
HrAL is today available within TalkBank, a large database of spoken-language corpora
covering different languages (https://talkbank.org), in the Conversational Analyses
corpora within the subsection titled Conversational Banks. Data were transcribed,
coded and segmented using the transcription format Codes for Human Analysis of
Transcripts (CHAT) and the Computerised Language Analysis (CLAN) suite of
programmes within the TalkBank toolkit. Speech streams were segmented into
communication units (C-units) based on syntactic criteria. Most transcripts were linked
to their source audios. The TalkBank is public free, i.e. all data stored in it can be shared
by the wider community in accordance with the basic rules of the TalkBank.
HrAL provides information about spoken grammar and lexicon, discourse skills, error
production and productivity in general. It may be useful for sociolinguistic research and
studies of synchronic language changes in Croatian.
Key words
Croatian Adult Spoken Language Corpus (HrAL); language sampling; spontaneous speech corpora
Visits:
1387
Downloads:
51