Linguistics Tools

Many corpora come with tools for extracting, searching, or otherwise manipulating the data in the corpus. Please check the "About" page and/or the corpus index page for information about tools specific to the corpus of interest to you.


  

Tools (8)
Analysis Tools (2)
IMS Open Corpus Workbench
The IMS Open Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

Official CQP demos:
The official demos are hosted by the Computational Corpus Linguistics Group at FAU Erlangen-Nürnberg, Germany, and use the sample encoded corpora available on the CWB SourceForge site.
  • DICKENS (English, 3.4M tokens)

    A collection of novels by Charles Dickens used as the main example corpus in the CQP Query Language Tutorial.

  • BUNDESTAG (German, 5.7M tokens)

    Debates of the German parliament (1994–1998) with rich morphosyntactic annotation and shallow parsing. Suitable as a substitute for the smaller GLAW-NEW corpus of law texts in the CQP Query Language Tutorial.

  • EUROPARL (6 languages, ca. 40M tokens each)

    Web GUI for the annotated Europarl Corpus, Version 3 containing debates of the European Parliament from the years 1996–2006 (currently, only six languages are included in the GUI). This interface also supports the simplified CEQL syntax, aligned context display and word lists with automatic generation of translation candidates. The Europarl corpus will be used by future editions of the CQP Query Language Tutorial to introduce query and display options for aligned copora.

Other examples:

Instructions: Corpus Workbench is available for download from SourceForge, along with support packages, the web GUI (CQPweb) and sample encoded corpora. Their site also provides documentation. A YouTube channel containing 27 tutorial videos (as of April 2018) is available.
TranscriberAG
TranscriberAG is a tool for assisting the manual annotation of speech signals. It provides a user-friendly graphical user interface for segmenting long duration speech recordings, transcribing them, and labeling speech turns, topic changes and acoustic conditions.

Transcriber is developed with the scripting language Tcl/Tk and C extensions. It relies on the Snack sound extension, which allows support for most common audio formats, and on the tcLex lexer generator. TranscriberAG runs on multiple platforms (Windows XP, Mac OS X and Linux). It is developed in C++ using the GTK+ library for the GUI and the AGlib for the annotation file management.
Instructions: Visit the TranscriberAG site for download and installation instructions.
Compression Tools (1)
7-zip
Corpora consist in large part of data files, many of which are compressed to save space. You may need to uncompress these files before using them. 7-Zip unpacks zip, gzip, bzip2, tar, and rar files, and provides several other features. It is available under the GNU LGPL license. Mac, Linux, and Unix users do not require a separate tool since compression and decompression tools are built into the operating system.
Instructions: Visit the 7-zip site for download and installation instructions.
Concordance Tools (1)
AntConc
AntConc is a freeware concordance program for Windows (Windows Vista, 7, and 8), Macintosh OS/X (10.4 through 10.8), and Linux (tested on Ubuntu 10)
Instructions: Visit the AntConc site for download and access to several other related tools.
Resource Lists (2)
Statistical natural language processing and corpus-based computational linguistics resources
Standford University provides an annotated list of resources with numerous tools (some free, some downloadable) as well as links to other resources.
University Centre for Computer Corpus Research on Language Tools
The University Centre for Computer Corpus Research on Language at Lancaster University (UK) provides a list of tools developed there which are available for use.
SPHERE Tools (1)
SPHERE
SPHERE is an audio format developed by the National Institute of Standards and Technology (NIST). It is used in many LDC corpora, including Fisher English Training, HCRC Map Task, and CALLHOME. Visit NIST's page for SPHERE and related tools or LDC tools for converting files from SPHERE to other audio formats.
Transcription Tools (1)
ELAN
ELAN is a professional tool for the creation of complex annotations on video and audio resources.
Instructions: It is free to use, and runs on Mac OS/X, Windows 7, Windows 8, and Linux. The software and instructions are available for download at http://tla.mpi.nl/tools/tla-tools/elan/download/