Digital Corpora and Databases:
New Horizons in Slavic Linguistics

Dr. Elisabeth Seitz, University of Tuebingen, Germany
(elisabeth.seitz@uni-tuebingen.de)


This paper was presented in a lecture on invitation of the Department of Slavonic Languages and Literatures at the University of Ljubljana, Slovenia, 19th. March 1998


TABLE OF CONTENTS

1. Corpora & Databases - Introduction
2. Corpora on Local Machines vs. Corpora on the WWW
2.1 Why Use Machine-Readable Corpora?
2.2 Corpus Encoding
2.3 Network Serving of Texts
3. Corpora: Annotated or Unannotated?
3.1 Part-of-Speech Annotation
3.2 Lemmatisation
3.3 Parsing
3.4 Semantics
3.5 Discoursal and text linguistic annotation
3.6 Phonetic transcription
3.7 Prosody
3.8 Problem-oriented tagging
4. Local Query & Concordance Software
5. Slavic Text Corpora on the WWW
6. Bibliography

1. Corpora & Databases - Introduction

Definition of a corpus

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2defin.htm)

The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.

In principle, any collection of more than one text can be called a corpus, corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

Why use a corpus? (Ball 1997)

Linguistics: to study linguistic competence or performance as revealed in naturally-occurring data. Most applications will require or lead to the creation of annotated text. Diachronic linguistics: texts are all we have; introspection worthless; better to analyze a systematic collection of data than to reuse/reanalyze others' examples. Computational linguistics: to train/test a natural language processing system on a representative sample of the kinds of texts the system is expected to process; to build large lexicons in a given domain ... Applied linguistics: First/second language acquisition research: supplement/replace elicitation, as in 'Linguistics' above Language teaching/learning: language for specific purposes (e.g. use newspaper corpora, corpora of scientific texts); to prepare vocabulary lists based on high-frequency lexical items; to prepare CLOZE tests; to answer ad hoc learner questions ('What's the difference between few and a few?'); to discover facts about language

Taxonomies of corpora (Ball 1997)

A. by medium: printed, electronic text, digitized speech, video (e.g. for ASL), mixed

B. by design method: balanced, pyramidal, opportunistic

C. language variables:

D. language states: synchronic vs. diachronic (e.g. Brown vs. Helsinki Diachronic corpus)

E. Plain vs. annotated

perfectly plain: e.g. Project Gutenberg texts, produced by scanning; no information about text (usually, not even edition) marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc. annotated with identifying information, e.g. edition date, author, genre, register, etc. annotated for part of speech, syntactic structure, discourse information, etc.

2. Corpora on Local Machines vs. Corpora on the WWW

Why Use Machine-Readable Corpora?

local Corpora: placed on a machine without connection to the WWW, accessible only locally
online-Corpora: located on a WWW-Server, accessible from outside

An example for a corpus not accessible on the Web is the Uppsala Corpus, a Russian language corpus that must be bought from the institution that created it and downloaded via ftp.

Corpus Encoding

The guidelines of the Text Encoding Initiative (TEI) (http://etext.virginia.edu/TEI.html) utilize the Standard Generalized Markup Language (SGML), to create a single coherent framework (the TEI DTD, or Document Type Definition), which can be used to encode most any scholarly textual resource, in a manner that is hardware-, software-, and application-independent.

TEI Lite (http://www.uic.edu/orgs/tei/lite//) is a simplified version of the Text Encoding Initiative (TEI) Guidelines, which are addressed to anyone who wants to interchange information stored in an electronic form.

As explained in the TEI Lite introduction, the TEI Guidelines provide a means of making explicit certain features of a text in such a way as to aid the processing of that text by computer programs running on different machines. This process of making explicit we call markup or encoding. Any textual representation on a computer uses some form of markup; the TEI came into being partly because of the enormous variety of mutually incomprehensible encoding schemes currently besetting scholarship, and partly because of the expanding range of scholarly uses now being identified for texts in electronic form.

Structure Markup in TEI Lite for Slovene Translation of Plato's Republic (TELRI Newsletter No. 5)

The Slovene translation of Plato's 'Republic' was keyed-in by the Ljubljana 2 site (Institute for Slovene Language "Fran Ramovs ", Slovene Academy for Sciences and Arts, Ljubljana, Slovenia) in the text editor Eva. This version served as the starting point for encoding the document in TEI Lite conformant markup. The task of the uptranslation was begun at the summer Nancy workshop, and finished in Ljubljana. In total, the process took about three days. The up-translation was accomplished partly via search and replace operations and macros in the editor Emacs, and partially via small Perl programs.

The TEI Guidelines use the Standard Generalized Markup Language (SGML) to define their encoding scheme. SGML is an international standard (ISO 8879), used increasingly throughout the information processing industries, which makes possible a formal definition of an encoding scheme, in terms of elements and attributes, and rules governing their appearance within a text. In selecting from the several hundred SGML elements defined by the full TEI scheme, a useful 'starter set' has been identified comprising the elements which almost every user should know about.

Non-Roman Character and Glyph Presentation

One of SGML's features as an encoding meta-language is that it specifies a device-independent character encoding mechanism. SGML entities represent non-ASCII characters in the SGML source file, and substitutions are made to the correct presentational glyph by the viewing application software. The Transactions contains non-English terms and symbols which are not part of standard ASCII character set. This component of the project will be of even greater importance when working on the Russian texts.

Network Serving of Texts

The choice of SGML encoding allows for great flexibility in providing texts for network delivery. Because it is an encoding that separates presentation and formatting information from structure and content information, display devices of different kinds can be supported more easily. SGML's structured nature allows for the transmission of document fragments rather than the entire text. The tagging also allows for more focused information retrieval applications on the texts.

3. Corpora: Annotated or Unannotated?

A GLANCE AT TAGGING SCHEMA

The following diagram attempts to depict the various approaches to automatic POS tagging. In reality, the picture is much more complicated, since many tagging systems use aspects of some or all of these approaches (Van Guilder 1995):

Annotated vs. Unannotated Corpora:
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2encode.htm)

If a corpus is said to be unannotated, it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.

For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.

Leech's Maxims of Annotation
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2maxims.htm)

  1. It should be possible to remove the annotation from an annotated corpus in order to revert to the raw corpus. At times this can be a simple process - for example removing every character after an underscore e.g. "Claire_NP1 collects_VVZ shoes_NN2" would become "Claire collects shoes". However, the prosodic annotation of the London-Lund corpus is interspersed within words - for example "g/oing" indicates a rising pitch on the first syllable of the word "going", meaning that the original words cannot be so easily reconstructed.
  2. It should be possible to extract the annotations by themselves from the text. This is the flip side of maxim 1. Taking points 1 and 2 together, the annotated corpus shuld allow the maximim flexibility for manipulation by the user.
  3. The annoatation scheme should be based on guidelines which are available to the end user. Most corpora have a manual which contains full details of the annotation scheme and guidelines issued to the annotators. This enables the user to understand fully what each instance of annotation represents without resorting to guesswork, and to understand in cases of ambiguity why a particular annotation decision was made at that point. You might want to briefly look at an example of the guidelines for part-of speech annotation of the BNC corpus although this page has restricted access.
  4. It should be made clear how and by whom the annotation was carried out. A corpus may be annotated manually, either by a single person or by a number of different people; alternatively the annotation may be carried out automatically by a computer program whose output may or may not be corrected by human beings.
  5. The end user should be made aware that the corpus annotation is not infallible, but simply a potentially useful tool. Any act of corpus annotation is, by defintion also an act of interpretation, either of the stucture of the text or of its content.
  6. Annotation schemes should be based as far as possible on widely agreed and theory-neutral principles. For example, parsed corpora often adopt a basic context-free phrase structure grammar rather than implementing a narrower specific grammatical theory such as Chomsky's Principals and Parameters framework.
  7. No annotation scheme has the a priori right to be considered as a standard. Standards emerge through practical consensus.

One of the first distinctions which can be made among POS taggers is in terms of the degree of automation of the training and tagging process. The terms commonly applied to this distinction are supervised vs. unsupervised. Supervised taggers typically rely on pre-tagged corpora to serve as the basis for creating any tools to be used throughout the tagging process, for example: the tagger dictionary, the word/tag frequencies, the tag sequence probabilities and/or the rule set.

Unsupervised models, on the other hand, are those which do not require a pretagged corpus but instead use sophisticated computational methods to automatically induce word groupings (i.e. tag sets) and based on those automatic groupings, to either calculate the probabilistic information needed by stochastic taggers or to induce the context rules needed by rule-based systems. Each of these approaches has pros and cons. (Van Guilder 1995)

The following table outlines the differences between these two approaches (Van Guilder 1995):

SUPERVISED
UNSUPERVISED
selection of tagset/tagged corpus  induction of tagset using untagged training data 
creation of dictionaries using tagged corpus induction of dictionary using training data 
calculation of disambiguation tools. may include: induction of disambiguation tools. may include: 
word frequencies word frequencies
affix frequencies affix frequencies
tag sequence probabilities  tag sequence probabilities
ěformulaicě expressions  -
tagging of test data using dictionary information  tagging of test data using induced dictionaries 
disambiguation using statistical, hybrid or rule based approaches  disambiguation using statistical, hybrid or rule based approaches 
calculation of tagger accuracy calculation of tagger accuracy

Types of Annotation

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2types.htm)

  1. Part of Speech annotation
  2. Lemmatisation
  3. Parsing
  4. Semantics
  5. Discoursal and text linguistic annotation
  6. Phonetic transcription
  7. Prosody
  8. Problem-oriented tagging

1. Part-of-Speech Annotation

The general purpose of a part-of-speech tagger is to associate each word in a text with its morphosyntactic category (represented by a tag), as in the following example:

 This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT  

An Example from the Spoken English Corpus with C7 tagset: (source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2posex.htm)

Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0; ponies&NN2; '&POS; feet&NN2; ,&PUN; suddenly&AV0; heard&VVD-VVN; Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP; better&AV0; dig&VVB; out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC; polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI; playing&VVG; in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;

The codes used are:

All the tags here contain three characters. Tags have been attached to words by the use of TEI entity references delimited by & and ;. Some of the words (such as heard) have two tags assigned to them. These are known as portmanteau tags and have been assigned to help the end user in cases where there is a strong chance that the computer might otherwise have selected the wrong part of speech from the choices available to it (this corpus has not been corrected by hand).

COBUILD Part-of-speech tags (source: http://titania.cobuild.collins.co.uk/form.html)

The corpus has been tagged automatically with a statistical tagger. You can specify a search on word/TAG combinations by appending an oblique stroke and a part-of-speech tag. POS tags must be in uppercase. Here are some major POS tags:

NOUN a macro tag: stands for any noun tag
VERB a macro tag: stands for any verb tag
NN common noun
NNS noun plural
JJ adjective
AT definite and indefinite article
RB adverb
VB base-form verb
VBN past participle verb
VBG -ing form verb
VBD past tense verb

Morphosyntactic Tagging of 16th Century Slovene texts:

(TRU_NT77) Glih taku ta (ATTR1.1:)Turska (SUB1.2:/5:>4:)vera doli iemle inu she bo stuprou naprei manshi perhaiala. Sakai ty eni (2SUB1.6:_4:)Bashi (UNDsub)inu Turki nih (AKKOBJ1.2:)Otroke skriuaie kerszhuio. Inu ty eni (ATTR-1.1:)-mladi (SUB1.21)Turki (ATTR1.1:)vogerski desheli (taku sta meni pred enim leitom dua (SUB 2.6:>12)-Studenta is (ATTR1.1:)Ardelske deshele, is Sybenburga, poueidala) se vshulah ta (AKKOBJ1.1:) Cate-hi-sem vuzhe, Inu ty (pronSUB1.5:/8:)eni ozhitu to (AKKOBJ1.2:)vero (ATTR2.1:)-Kerszhansko sposnaio, inu se puste od Turkou, skamene- (2INF2.4:_6:)possuti (UNDverb)inu vmoriti.

(TRU_DP) (pronAKKOBJ1.5:)Tiga, praui (pronSUB2.1:)on , (SUB1.1:>1:)MoiSes nei dal, (pronSUB1.4:)on le to (AKKOBJ1.1:)PoStauo oSnanuie, Ampag ta Syn (ATTR2.1:)Boshy IeSus CriStus, ta (HTpronSUB1.2:>27/ 29/-37/-40/44)iSti Sam ie no tu (2ATTR1.2:_1:)prauu nebesku (4AKKOBJ1.22/24_21/23_17/19_15/13_8:/6:) blagu, odpuSzhane (ATTR1.1:)vSeh Grehou, to Prauizo (UNDsub)inu brumo, (HTSUB1.3:)kir pred Bugo— vela, inu (ATTR1.1:)vSe tu, (pronAKKOBJ1.2:)kar (HTSU=PR)moramo (INF2.1:)imeiti, aku (HTSU=PR)hozhmo vnebeSSa (INF2.2)priti, dobil (UNDverb)inu Saslushil per Bugi, Inu tu (pronAKKOBJ1.3:/6:)iStu (ATTR2.2:)vSe no— ponuie (UNDverb)inu naprei daie vtim Euangeliu, inu (SU=PR)vely no— vSem (pronAKKOBJ1.1:)tu (INF2.4:)vSeti Sto Vero od nega SabSton.

AKKOBJ = Accusative (direct) Object
ATTR = adjectival or participal Attribute - Noun
INF = Infinitive Construction
pron = pronominal
SUB = Subject - Predicate
2SUB = double Subject
UNDsub = double construction x1 AND x2 with x = noun
UNDverb = double construction x1 AND x2 with x = verb
.1, .2, .3 etc. = x is 1/ 2/ 3... graphic units away from y
1. = x stands bevore y
2. = x stands after y
/ = parallel syntactic construction x/y1-x/y2
_ = parallel syntactic construction x1_y-x2_y
> = infinite verb form > finite


The result of quantitative Analysis of Subject-Predicate-Frames can be presented as follows:

2. Lemmatisation

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2lemma.htm)

Lemmatisation is a procedure of finding the lemma (i.e. textual representation of the basic form and its grammatical categorisation) and the grammatical categorisation of the given, analysed form. The textual form is treated unilaterally, as a string of letters between delimiters. (Szafran 1997, 437). It is closely allied to the identification of parts-of-speech and involves the reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants, and to produce frequency and distribution information for the lexeme. See the example below - the fourth column contains the lemmatised words:

N12:0510g - PPHS1m	He	he
N12:0510h - VVDv	studied	study
N12:0510i - AT		the	the
N12:0510j - NN1c	problem	problem
N12:0510k - IF		for	for
N12:0510m - DD221	a	a
N12:0510n - DD222	few	few
N12:0510p - NNT2	seconds	second
N12:0520a - CC		and	and
N12:0520b - VVDv	thought	think
N12:0520c - IO		of	of
N12:0520d - AT1		a	a
N12:0520e - NNc		means	means
N12:0520f - IIb		by	by
N12:0520g - DDQr	which	which
N12:0520h - PPH1	it	it
N12:0520i - VMd		might	may
N12:0520j - VB0		be	be
N12:0520k - VVNt	solved	solve
N12:0520m - YF		+.	-
Slavic example: Tokarski's Index (Szafran 1997, 437)

Tokarski's Index is an a tergo sorted list of all possible endings of words in Polish together with endings of the corresponding base-form and their grammatical categorisation. Both endings are considered here strictly unilaterally as strings of characters only. The list consists of ca. 18,000 entries. The Index is based on SΩownik Je zyka Polskiego PAN edited by W. Doroszewski (cf. Doroszewski (1959-1969)). There are four types of entries but we exclude here one of the groups as not useful for our purposes. Entries in that group collect information only from other groups of entries, and show this information in a shorter and more compact way.

The vast majority of entries in the Index are basic entries. These entries consist of four columns. This is illustrated in (1):

(1)	-ac	mII N	pajac, kac, plac, pac, materac, palac (10)
 	-ac	z II lG	-aca	plac, mac, prac, tac (10)

Each entry contains an ending of the described forms (first column), an ending of related basic forms (third column), the grammatical categorisation of the form (second column after the space) and grammatical categorisation of the basic form (second column before the space) and examples (fourth column).

Another group of entries are called referential entries, which have a different format:

(2)  .ich 	A 10	Þ	.im	A 4	srogich, pruskich, zadnich, twoich, czyich (3,300)
(Szafran 1997, 438)

Instead of introducing a group of basic entries for describing forms we can often use a referential entry. It points to another group of entries in the Index. Finally, a third, very small, group of entries are called the special entries. One such entry is given in (3):

(3) Ces´ [2 formy]	C + )es´	jakes´, osioΩes´, kto´rymes´, winienes´, juz es´
They describe a case - occurring relatively often in Polish - of compound forms consisting of two parts.

Dealing with Homonyms (Szafran 1997, 439)

A homonym is a word with the same textual shape (spelling) as another word but with a different meaning and origin. Let us consider the word biaΩka. According to the described procedure in the successive steps of our algorithm we consider forms: biaΩka, iaΩka, aΩka, Ωka, ka, a. For the forms biaΩka, iaΩka, aΩka there are no basic entries in the Index. For the form Ωka. according to the Index (cf. entries in lines 41-47 on the 58th page of the Index) we have the following output for the basic forms (leaving aside the grammatical categorisation of the analysed form for the moment):

(4)		(bialka, ndm)	-	indeclinable
		(bialek, mIII)	-	substantive, masculine
		(bialko, nII)	-	substantive , neuter
		(bialka, z III)	-	substantive, feminine
		(bialka, blp)	-	substantive, plurale tantum
		(bialki, A)	-	adjective
		(bialkac, I)	-	verb

Finally, for the forms ka and a we have again no basic entries. Potentially all of the obtained basic forms may be the correct Polish lexemes. Without additional information it is impossible to decide whether they belong to contemporary Polish or not.

3. Parsing

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2parse.htm)

Parsing involves the procedure of bringing basic morphosyntactic categories into high-level syntactic relationships with one another. This is probably the most commonly encountered form of corpus annotation after part-of-speech tagging. Parsed corpora are sometimes known as tree-banks. This term alludes to the tree diagrams or "phrase markers" used in parsing. The probably most well-known Treebank is the PENN Treebank (http://www.ldc.upenn.edu/ldc/online/treebank/index.html):

 PENN Treebank <= CLICK TO VIEW IN ORIGINAL SIZE!

The BabelSystem creates Tree-Diagrams of German Sentences

Example 1: Babel Query (Babel Online: http://www.dfki.de/~stefan/Babel/Interaktiv/):

Babel Query <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2: Babel Query Result:

Babel Query Result <= CLICK TO VIEW IN ORIGINAL SIZE!

Sometimes the bracket-based annotations are displayed with indentations so that they resemble the properties of a tree diagram (a system used by the Penn Treebank project). For instance:

[S
     [NP Claudia NP]
     [VP sat
            [PP on
                  [NP a stool NP]
            PP]
      VP]
S]

Because automatic parsing (via computer programs) has a lower success rate than part-of-speech annotation, it is often either post-edited by human analysts or carried out by hand (although possibly with the help of parsing software). The disadvantage of manual parsing, however, is inconsistency, especially where more than one person is parsing or editing the corpus, which can often be the case on large projects. The solution - more detailed guidelines, but even then there can occur ambiguities where more than one interpretation is possible.

4. Semantics

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2semant.htm)

Two types of semantic annotation can be identified:

1. The marking of semantic relationships between items in the text, for example the agents or patients of particular actions. This has scarcely begun to be widely accepted at the time of writing, although some forms of parsing capture much of its import.

2. The marking of semantic features of words in the text, essentially the annotation of word senses in one form or another. This has quite a long history, dating back to the 1960s.

There is no universal agreement about which semantic features ought to be annotated - in fact in the past much of the annotation was motivated by social scientific theories of, for instance, social interaction. However, Sedelow and Sedelow (1969) made use of Roget's Thesarus - in which words are organised into general semantic categories.

The example below (Wilson, forthcoming) is intended to give the reader an idea of the types of categories used in semantic tagging:

And		00000000
the		00000000
soldiers	23241000
platted		21072000
a		00000000
crown		21110400
of		00000000
thorns		13010000
and		00000000
put		21072000
it		00000000
on		00000000
his		00000000
head		21030000
and		00000000
they		00000000
put		21072000
on		00000000
him		00000000
a		00000000
purple		31241100
robe		21110321

The numeric codes stand for:

00000000        Low content word (and, the, a, of, on, his, they etc)
13010000        Plant life in general
21030000        Body and body parts
21072000        Object-oriented physical activity (e.g. put)
21110321        Men's clothing: outer clothing
21110400        Headgear
23231000        War and conflict: general
31241100        Colour

The semantic categories are represented by 8-digit numbers - the one above is based on that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three top level categories, which are themselves subdivided, and so on.

5. Discoursal and Text Linguistic Annotation

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2discour.htm)

Aspects of language at the levels of text and discourse are one of the least frequently encountered annotations in corpora. However, occasionally such annotations are applied.

Discourse tags

Stenström (1984) annotated the London-Lund spoken corpus with 16 "discourse tags". They included categories such as:

       "apologies"	e.g. sorry, excuse me 
       "greetings"	e.g. hello 
       "hedges"		e.g. kind of, sort of thing 
       "politeness"	e.g. please 
       "responses"	e.g. really, that's right 

Despite their potential role in the analysis of discourse these kinds of annotation have never become widely used, possibly because the linguistic categories are context-dependent and their identification in texts is a greater source of dispute than other forms of linguistic phenomena.

Anaphoric annotation

Cohesion is the vehicle by which elements in text are linked together, through the use of pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in English" (1976) was considered to be a turning point in linguistics, as it was the most influential account of cohesion. Anaphoric annotation is the marking of pronoun reference - our pronoun system can only be realised and understood by reference to large amounts of empirical data, in other words, corpora.

Anaphoric annotation can only be carried out by human analysts, since one of the aims of the annotation is to train computer programs with this data to carry out the task. There are only a few instances of corpora which have been anaphorically annotated; one of these is the Lancaster/IBManaphoric treebank, an example of which is given below:

A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2 [N the_AT (9 Charlotte_N1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0 rid_VVN of_IO [N 3 The above text has been part-of-speech tagged and skeleton parsed, as well as anaphorically annotated. The following codes explain the annotation:

6. Phonetic Transcription

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2phonet.htm)

Spoken language corpora can also be transcibed using a form of phonetic transcription. Not many examples of publicly available phonetically transcribed corpora exist at the time of writing. This is possibly because phonetic transcription is a form of annotation which needs to be carried out by humans rather than computers. Such humans have to be well skilled in the perception and transcription of speech sounds. Phonetic transcription is therefore a very time consuming task.

Another problem is that phonetic transcription works on the assumption that the speech signal can be divided into single, clearly demarcated "sounds", while in fact, these "sounds" do not have such clear boundaries, therefore what phonetic transcription takes to be the same sound, might be different according to context.

Nevertheless, phonetically transcribed corpora is extremely useful to the linguist who lacks the technological tools and expertise for the laboratory analysis of recorded speech. One such example is the MARSEC corpus (which is derived from the Lancaster/IBM Spoken English Corpus) and has been manipulated by the Universities of Lancaster and Leeds. The MARSEC corpus will include a phonetic transcription.

7. Prosody

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2prosody.htm)

Prosody refers to all aspects of the sound system above the level of segmental sounds e.g. stress, intonation and rhythm. The annotations in prosodically annotated corpora typically follow widely accepted descriptive frameworks for prosody such as that of O'Connor and Arnold (1961). Usually, only the most prominent intonations are annotated, rather than the intonation of every syllable. The example below is taken from the London-Lund corpus:


1 8 14 1470 1 1 A 11 ^what a_bout a cigar\ette# .	/
1 8 15 1480 1 1 A 20 *((4 sylls))*	/
1 8 14 1490 1 1 B 11 *I ^w\on't have one th/anks#* - - -	/
1 8 14 1500 1 1 A 11 ^aren't you .going to sit d/own# -	/
1 8 14 1510 1 1 B 11 ^[/\m]# -	/
1 8 14 1520 1 1 A 11 ^have my _coffee in p=eace# - - -	/
1 8 14 1530 1 1 B 11 ^quite a nice .room to !s\it in ((actually))#	/
1 8 14 1540 1 1 B 11 *^\isn't* it#	/
1 5 15 1550 1 1 A 11 *^y/\es#* - - - 	/

The codes used in this example are:

# end of tone group
^ onset
/ rising nuclear tone
\ falling nuclear tone
/\ rise-fall nuclear tone
_ level nuclear tone
[] enclose partial words and phonetic symbols. normal stress
! booster: higher pitch than preceding prominent syllable
= booster: continuance
(( )) unclear
* * simultaneous speech
- pause of one stress unit

Problems of Prosodic Corpora

  1. Judgements are inherently of an impressionistic nature. For example, the level of a tone movement is a difficult matter to agree upon. Some listeners may perceive a fall in pitch, while others may perceive a slight rise after the fall.
  2. Consistency is difficult to maintain, especially if more than one person annotates the corpus. (This can be alleviated to some degree by having two people both annotate a small part of the corpus.)
  3. Recoverability is difficult (see Leech's 1st Maxim) since prosodic features are carried by syllables rather than whole words - annotations appear within the words themselves making it difficult for software to retrieve the raw corpus.
  4. Sometimes special graphics characters are used to indicate prosodic phenomena. However, not all computers and printers can handle such characters. TEI guidelines for text encoding will hopefully alleviate these difficulties.

8. Problem-Oriented Tagging

(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2problem.htm)

Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby users will take a corpus, either already annotated, or unannotated, and add to it their own form of annotation, oriented particularly towards their own research goal. This differs in two ways from the other types of annotation we have examined in this session.

1. It is not exhaustive. Not every word (or sentence) is tagged - only those which are directly relevant to the research. This is something which problem-oriented tagging has in common with anaphoric annotation.

2. Annotation schemes are selected, not for broad coverage and theory-neutrality, but for the relevance of the distinctions which it makes to the specific questions that the researcher wishes to ask of his/her data.

4. Local Query & Concordance Software

4.1. Gofer

Example 1.: Boolean Search-Option for the Terms "Slouen*", "Sloven*", "Crain*"...:

Gofer Query <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2.: Search Result of the above Query:

Gofer Result <= CLICK TO VIEW IN ORIGINAL SIZE!

4.2. Monolingual Concordances:

4.2.1 CONC

Example 1.: P. Trubar, Ena Dolga Predguvor (1557), Word-Concordance, Include: All words

Tru DP Conc word concordance <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2.: same text base as in (1), Word-Concordance, Include: All words containing "ei"

Tru DP Conc <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 3: Concordance plus Frequency-Index (Truber, Dolga Predguvor):

Tru DP Conc <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 4: Concordance plus Frequency-Index (Truber, Dolga Predguvor ­ POS-tagged), Include "SUB":

Tru DP Conc SUB + Index <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 5: Concordance plus Frequency-Index (Truber, Dolga Predguvor ­ POS-tagged), Include "UND":

Tru DP Conc UND + Index <= CLICK TO VIEW IN ORIGINAL SIZE!

4.2.2 FreeText Browser

Example 1.: Truber, Introduction to the Slovene Translation of the New Testament 1577:

FreeText Search <= CLICK TO VIEW IN ORIGINAL SIZE!

4.3. Multilingual Parallel Concordance: PARACONC (by Michael Barlow)

Example 1.: Genesis 1,1 in Polish and Croatian, Search for "dzien":

ParaConc Search <= CLICK TO VIEW IN ORIGINAL SIZE!

5. SLAVIC TEXT CORPORA ON THE WWW

5.1. Slavic Text Corpora, online-query available

5.1.1. The Oslo Corpus of Bosnian Texts

Access restricted to researchers (http://www.tekstlab.uio.no/Bosnian/Corpus.html)

Example 1.: Search Mask, Query for all verbs containing "ovati":

Oslo Bosnian Corpus Query <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2.: Query Results

Oslo Bosnian Corpus Result <= CLICK TO VIEW IN ORIGINAL SIZE!

5.1.2. Institute of the Czech National Corpus

(http://mathesius.ff.cuni.cz/US/cnc/)

Example 1.: Query Mask, Search for "tenhle":

CNC Query <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2.: Result of Corpus Query for "tenhle":

CNC Query Result <= CLICK TO VIEW IN ORIGINAL SIZE!

5.1.3. IPI PAN Corpus Search

Text base: New & Old Testament, Prose texts, newspaper "Zycie Warszawy" (http://www.ipipan.waw.pl/mmgroup/Korpus/searchpage.html)

Example 1.: Search Mask, Query for "slowo Boga"

IPIPAN_Query <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2.: Search Results

IPIPAN_Query <= CLICK TO VIEW IN ORIGINAL SIZE!

5.1.4. Tübingen Slavic Corpora Online-Query

(currently still only experimental, no public access available)

Example 1: TUSLA Query

TUSLA Query <= CLICK TO VIEW IN ORIGINAL SIZE!

Example 2: TUSLA Query Result

TUSLA Query <= CLICK TO VIEW IN ORIGINAL SIZE!



5. Slavic Text Corpora on the WWW

Slavic Text Corpora, online-query available:

Slavic Text Corpora, online-query not available:

Multilingual: => Multext East Corpus multilingual (Multilingual Text & Corpora for Eastern and Central European Languages (1995-97) (http://nl.ijs.si/ME/CD/)

OCS:

Slovene:

Serbian/ Croatian:

Bulgarian:

Russian:

Acoustic Databases:

In Stage of Planning:

6. Bibliography

Adams, L. D. (adams@cs.pitt.edu) & David J. Birnbaum (djb@kathleen.slavic.pitt.edu) (1996): "The Relationship of Russian Rhyme to Russian Orthography (Modularization, Implementation, Report)". WebSite: Perspectives on Computer Programming for the Humanities. http://clover.slavic.pitt.edu/~djbpitt/rhyme/tt.html
Apresjan, Ju. D., I. M. Boguslavsky, L. L. Iomdin, A. V. Lazourski, L. G. Mitjushin, V. Z. Sannikov, L. L. Tsinman (1992): Lingvisticheskij protsessor dlja slozhnyx informatsionnyx sistem. (A linguistic processor for advanced information systems.) Moskva.
Ball, C. N. (1997): [Online-]Tutorial: Concordances and Corpora. Department of Linguistics, Georgetown University, Washington DC. http://www.georgetown.edu/cball/corpora/ tutorial.html
Boguslavsky, I. (1995): A bi-directional Russian-to-English machine translation system (ETAP-3). In: Proceedings of the Machine Translation Summit V. Luxembourg.
Boguslavsky, I., L. Tsinman (1992): "Semantics in a linguistic processor." In: Computers and Artificial Intelligence, vol. 11, N 4, 385-408.
Cermak, F. (1992): "Pocitacova lexikografie (poc itacovy fond cestiny)". In: Slovo a slovesnost 53, 41-48.
Cermak, F. (1995): "Jazykovy korpus: Prostr edek a zdroj poznani". SaS 56, 119-140.
Cermak, F. (1996): "The Czech National Corpus: A Brief Survey of the Current State". In: TELRI Newsletter 4 (1996) http://www.ids-mannheim.de/telri/newsletter/newsl4.html
Erjavec, T. , N. Ide, D. Tufis (1997): Encoding and Parallel Alignment of Linguistic Corpora in Six Central and Eastern European Languages. Presented at the Joint International Conference of the ACH-ALLC '97, June 1997.
Erjavec, T. (1997): Racunalniske zbirke besedil. In: Jezik in Slovstvo, 42/2-3, 81-96.
Erjavec, T., N. Ide, V. Petkevic, J. Veronis (1996): Multext-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. In: Proceedings of the First European TELRI Seminar: Language Resources for Language Technology, 87-98.
Erjavec, T. & P. Tancig (1990): An Integrated System for Morphological Analysis of the Slovenian Language. In: CoLing '90, Conference Proceedings, 293-298.
Hladká, B. & J. Hajic (1995): TELRI, Proceedings of the First European Seminar: A Simple Czech and English Probabilistic Tagger: A Comparison. Tihany, Hungary.
Jakopin, P. (1997) (with A. Bizjak): Part-of-Speech Tagging in the Slovenian Translation of Plato's Republic. In: TELRI Newsletter 5 (April 1997). http://www.ids-mannheim.de/ telri/newsletter/newsl5.html
Jakopin, P. (1995): EVA - A Textual Data Processing Tool. In: Proceedings of the first TELRI seminar: Language Resources for Language Technology, Tihany 1995.
Johansson, S. (1991) "Times change and so do corpora", In: Aijmer and Altenburg (eds.) English corpus linguistics: studies in Honour of Jan Svartvik, London, pp 305-14.
Leech, G. (1991) "The state of the art in corpus linguistics", In: Aijmer K. and Altenberg B. (eds.) English Corpus Linguistics: Studies in Honour of Jan Svartvik, 8-29. London.
Leech, G. (1992) "Corpora and theories of linguistic performance", In: Svartvik, J. Directions in Corpus Linguistics, pp 105-22. Berlin.
Leech, G. and A. Wilson (1994): "Morphosyntactic Annotation". EAGLES Document EAG-CSG/IR-T3.1 (Version of October 1994). Pisa: EAGLES Consortium.
Mcenery, T. & A. Wilson (1996): Corpus Linguistics. Edinburgh.
Seitz, E. (forthcoming): Digitale Textcorpora und Datenbanken: Neue Horizonte in der slavistischen Linguistik, in: Bulletin des Verbandes der Hochschullehrer für Slavistik (VHS) 1998
Sinclair, J. (1991): Corpus, Concordance, Collocation. Oxford.
Szafran, K. (1997): "Automatic Lemmatisation of Texts in Polish - Is It Possible?" In: Junghanns, U. & Zybatow, G. (Hrsg.): Formale Slavistik (= Leipziger Schriften zur Kultur-, Literatur-, Sprach- und Übersetzungswissenschaft; 7) Frankfurt a.M., 437-441.
Szafran, K. (1994): Automatyczna analiza fleksyjna tekstu polskiego (na podstawie schematycznego indeksu a tergo Jana Tokarskiego). Rozprawa doktorska. WydziaΩ Polonistyki UW, Warszawa.
Szpakowicz, S. (1978): Automatyczna analiza skΩadniowa polskich zdan pisanych. Rozprawa doktorska. Instytut Informatyki UW, Warszawa.
Tokarski, J. (1993): Schematyczny indeks a tergo polskich form wyrazowych. Opracowanie i redakcja: Zygmunt Saloni. Warszawa: Wydawnictwo Naukowe PWN.
Van Guilder, L. (1995): Automated Part of Speech Tagging: A Brief Overview. Handout for LING361, Fall 1995. Georgetown University, http://www.georgetown.edu/cball/ling361/ tagging_overview.html


Zurück zur Homepage von Elisabeth Seitz

Zurück zur Corpus-Seite des Slavischen Seminars


© Dr. Elisabeth Seitz, Universität Tübingen (elisabeth.seitz@uni-tuebingen.de) Last Update: 22.9.1998