From Wikipedia, the free encyclopedia
Corpus linguistics is the
study of language as expressed in samples (corpora)
or "real world" text. This method represents a
digestive approach to deriving a set of abstract rules by
which a
natural language is governed or else relates to another
language. Originally done by hand, corpora are largely derived
by an automated process, which is corrected. The core of a
corpus is the derivation of a set of
Part-of-speech tags, representing a formal overview of the
various types of words and word-relationships in a given
language.
Computational methods had once been viewed as a
holy grail of linguistic research, which would ultimately
manifest a
ruleset for
natural language processing and
machine translation at a high level. Such has not been the
case, and since the
cognitive revolution, cognitive linguistics has been largely
critical of many claimed practical uses for corpora. However, as
computation capacity and speed have increased, the use of
corpora to study language and term relationships en masse has
gained some respectability.
The corpus approach runs counter to
Noam Chomsky's view that real language is riddled with
performance-related errors, thus requiring careful analysis of
small speech samples obtained in a highly controlled laboratory
setting. Corpus linguistics does away with Chomsky's
competence/performance split; adherents believe that
reliable language analysis best occurs on field-collected
samples, in natural contexts and with minimal experimental
interference.[citation
needed]
|
Contents
-
1
History
-
2
References
-
2.1
Journals
-
2.2
Book Series
-
2.3
Other
-
3
See also
-
4
External links
|
History
A landmark in modern corpus linguistics was the publication
by
Henry Kucera and
Nelson Francis of Computational Analysis of Present-Day
American English in 1967, a work based on the analysis of
the
Brown Corpus, a carefully compiled selection of current
American English, totalling about a million words drawn from a
wide variety of sources. Kucera and Francis subjected it to a
variety of computational analyses, from which they compiled a
rich and variegated opus, combining elements of linguistics,
language teaching, psychology, statistics, and sociology. A
further key publication was
Randolph Quirk's 'Towards a description of English Usage'
(1960, Transactions of the Philological Society, 40-61) in which
he introduced
The Survey of English Usage.
Shortly thereafter Boston publisher Houghton-Mifflin
approached Kucera to supply a million word, three-line citation
base for its new
American Heritage Dictionary, the first dictionary to be
compiled using corpus linguistics. The AHD made the innovative
step of combining prescriptive elements (how language
should be used) with descriptive information (how it
actually is used).
Other publishers followed suit. The British publisher
Collins'
COBUILD
dictionaries, designed for users learning English as a
foreign language, were compiled using the
Bank of English.
The
Brown Corpus has also spawned a number of similarly
structured corpora: the
LOB Corpus (1960s British English), Kolhapur (Indian
English), Wellington (New Zealand English), ACE (Australian
English), the
Frown Corpus (early 1990s American English), and the
FLOB Corpus (1990s British English). Other corpora represent
many languages, varieties and modes, and include The
British National Corpus, a 100 million word collection of a
range of spoken and written texts, created in the 1990s by a
consortium of publishers, universities (Oxford
and
Lancaster) and the
British Library. There is a project underway to create an
American National Corpus.
References
Journals
There are several international peer-reviewed journals
dedicated to corpus linguistics, for example,
Corpora,
Corpus Linguistics and Linguistic Theory,
ICAME Journal and the
International Journal of Corpus Linguistics.
Book Series
Book series in this field include
Language and Computers,
Studies in Corpus Linguistics and
English Corpus Linguistics
Other
- Biber, Douglas, Susan Conrad, Randi Reppen Corpus
Linguistics, Investigating Language Structure and Use,
Cambridge: Cambridge UP, 1998.
ISBN 0-521-49957-7
See also
-
Concordance (KWIC)
-
Collocation
-
Collostructional analysis
-
Linguistic Data Consortium
-
Keyword (linguistics)
-
Machine translation
-
Search engines: they access the "web corpus".
-
Semantic prosody
-
Text corpus
-
Translation memory
External links
-
Bookmarks for Corpus-based Linguists: very comprehensive
site with categorized and annotated links to language
corpora, software, references, etc.
-
Corpora discussion list
-
Manuel Barbera's overview site
-
Przemek Kaszubski's list of references
-
Corpus4u Community
-
McEnery and Wilson's Corpus Linguistics Page
-
Research and Development Unit for English Studies
-
The Centre for Corpus Linguistics at Birmingham University
-
Gateway to Corpus Linguistics on the Internet: an
annotated guide to corpus resources on the web
-
Biomedical corpora
-
Linguistic Data Consortium, currently the premier
distributor of corpora
-
Stefan Th. Gries's Corpus Linguistics with R list
Categories:
Articles with unsourced statements since February 2007
|
All articles with unsourced statements |
Discourse analysis |
Corpus linguistics