CHLT Corpora
Members of the Center for Human Language Technology have access to the following list of speech and text corpora:
| Catalog ID | Description | Format |
|---|---|---|
| LDC94T4A | UN Parallel Text (Complete) | 3 DVD |
| LDC95T7 | Penn Treebank, Release 2 | download |
| LDC96L16 | CALLHOME Spanish Lexicon | download |
| LDC96L17 | CALLHOME Japanese Lexicon | download |
| LDC96S35 | CALLHOME Spanish Speech | 1 CD |
| LDC96T17 | CALLHOME Spanish Transcripts | download |
| LDC99T41 | Spanish Newswire Text, Volume 2 | 1 CD |
| LDC99L22 | Egyptian Colloquial Arabic Lexicon | download |
| LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | download |
| LDC2003T10 | Syntactically Annotated Idioms Dictionary | download |
| LDC2005S25 | Santa Barbara Corpus of Spoken American English | 1 DVD |
| LDC2005S26 | CSLU: 22 Languages Corpus | 2 DVD |
| LDC2005T01 | Chinese Treebank 5.0 | download |
| LDC2005T06 | Chinese News Translation Text Part 1 | download |
| LDC2005T10 | Chinese English News Magazine Parallel Text | 1 CD |
| LDC2005T12 | English Gigaword Second Edition | 2 DVD |
| LDC2005T13 | CCGbank | download |
| LDC2005T14 | Chinese Gigaword Second Edition | 1 DVD |
| LDC2005T23 | Chinese Proposition Bank 1.0 | download |
| LDC2005T28 | HARD 2004 Text | 1 DVD |
| LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | online |
| LDC2005T35 | ANC Second Release | 2 DVD |
| LDC2006S34 | Russian through Switched Telephone Network (RuSTeN) | 1 DVD |
| LDC2006S42 | Korean Broadcast News Speech | 1 DVD |
| LDC2006T04 | Multiple Translation Chinese (MTC) Part 4 | download |
| LDC2006T12 | Spanish Gigaword First Edition | 1 DVD |
| LDC2006T13 | Web 1T 5-gram Version 1 | 6 DVD |
| LDC2006T17 | French Gigaword First Edition | 1 DVD |
| LDC2007S08 | CSLU: Foreign Accented English Release 1.2 | 1 DVD |
| LDC2007S15 | Nationwide Speech Project | 1 DVD |
| LDC2007T02 | English Chinese Translation Treebank v 1.0 | download |
| LDC2007T09 | ISI Chinese-English Automatically Extracted Parallel Text | download |
| LDC2007T40 | Arabic Gigaword Third Edition | 1 DVD |
| LDC2008S03 | STC-TIMIT 1.0 | 1 DVD |
| LDC2008S05 | 2005 NIST Language Recognition Evaluation | 1 DVD |
| ELRA-S0004 | BDLEX | 1 DVD |
| N/A | British National Corpus - XML Edition | 2 DVD |
