Transcript corpora for linguistic exploration and research
Corpora.li currently hosts two searchable speech corpora. A corpus is a structured collection of texts that lets researchers and learners explore real language use. The tools below allow research into two specific varieties of English: Contemporary Singapore English speech and contemporary English legal speech.
The YouTube Corpus of Singapore English Podcasts (YCSEP) contains approximately 8.4 million words (620 hours of speech) from public podcasts from Singapore. This web app allows you to search transcripts, sort results, and listen to and download aligned audio segments.
Visit YCSEPA static version of the corpus is available at doi.org/10.7910/DVN/B7JRID.
For more information, please see Coats, Steven, Alessandro Basile, Cameron Morin, and Robert Fuchs. (Forthcoming). The YouTube Corpus of Singapore English Podcasts. English World-Wide.
The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings (C.R.I.M.E.) is a searchable database of more than 133 million words (19,377 hours) of investigative interviews, courtroom proceedings, and "true crime" content. Search transcripts, play corresponding audio in context, and download the transcripts and audio.
Visit C.R.I.M.E.A static version is available at doi.org/10.7910/DVN/MLMB6E.
For more information, please see Coats, Steven, and Dana Roemling. (Forthcoming). CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings.
The resources hosted at corpora.li contain material that may be protected by copyright. They are made available for purposes such as research, scholarship, and teaching according to the provisions of Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market, as well as under the "Fair Use" doctrine of US copyright law (17 U.S.C. ยง 107).
The creation of corpora.li, YCSEP, and CRIME has received support from the European Union โ NextGenerationEU instrument and funding from the Research Council of Finland under grant number 358727, to dariah.fi.