CHLT Corpora

Members of the Center for Human Language Technology have access to the following list of speech and text corpora:

Catalog ID Description Format
LDC94T4A UN Parallel Text (Complete) 3 DVD
LDC95T7 Penn Treebank, Release 2 download
LDC96L16 CALLHOME Spanish Lexicon download
LDC96L17 CALLHOME Japanese Lexicon download
LDC96S35 CALLHOME Spanish Speech 1 CD
LDC96T17 CALLHOME Spanish Transcripts download
LDC99T41 Spanish Newswire Text, Volume 2 1 CD
LDC99L22 Egyptian Colloquial Arabic Lexicon download
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 download
LDC2003T10 Syntactically Annotated Idioms Dictionary download
LDC2005S25 Santa Barbara Corpus of Spoken American English 1 DVD
LDC2005S26 CSLU: 22 Languages Corpus 2 DVD
LDC2005T01 Chinese Treebank 5.0 download
LDC2005T06 Chinese News Translation Text Part 1 download
LDC2005T10 Chinese English News Magazine Parallel Text 1 CD
LDC2005T12 English Gigaword Second Edition 2 DVD
LDC2005T13 CCGbank download
LDC2005T14 Chinese Gigaword Second Edition 1 DVD
LDC2005T23 Chinese Proposition Bank 1.0 download
LDC2005T28 HARD 2004 Text 1 DVD
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus online
LDC2005T35 ANC Second Release 2 DVD
LDC2006S34 Russian through Switched Telephone Network (RuSTeN) 1 DVD
LDC2006S42 Korean Broadcast News Speech 1 DVD
LDC2006T04 Multiple Translation Chinese (MTC) Part 4 download
LDC2006T12 Spanish Gigaword First Edition 1 DVD
LDC2006T13 Web 1T 5-gram Version 1 6 DVD
LDC2006T17 French Gigaword First Edition 1 DVD
LDC2007S08 CSLU: Foreign Accented English Release 1.2 1 DVD
LDC2007S15 Nationwide Speech Project 1 DVD
LDC2007T02 English Chinese Translation Treebank v 1.0 download
LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text download
LDC2007T40 Arabic Gigaword Third Edition 1 DVD
LDC2008S03 STC-TIMIT 1.0 1 DVD
LDC2008S05 2005 NIST Language Recognition Evaluation 1 DVD
N/A British National Corpus - XML Edition 2 DVD