- 1. Corpora & Databases - Introduction
- 2. Corpora on Local Machines vs. Corpora on the WWW
- 2.1 Why Use Machine-Readable Corpora?
- 2.2 Corpus Encoding
- 2.3 Network Serving of Texts
- 3. Corpora: Annotated or Unannotated?
- 3.1 Part-of-Speech Annotation
- 3.2 Lemmatisation
- 3.3 Parsing
- 3.4 Semantics
- 3.5 Discoursal and text linguistic annotation
- 3.6 Phonetic transcription
- 3.7 Prosody
- 3.8 Problem-oriented tagging
- 4. Local Query & Concordance Software
- 5. Slavic Text Corpora on the WWW
- 6. Bibliography
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2defin.htm)
The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
In principle, any collection of more than one text can be called a corpus, corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.
Why use a corpus? (Ball 1997)
Linguistics: to study linguistic competence or performance as revealed in naturally-occurring data. Most applications will require or lead to the creation of annotated text. Diachronic linguistics: texts are all we have; introspection worthless; better to analyze a systematic collection of data than to reuse/reanalyze others' examples. Computational linguistics: to train/test a natural language processing system on a representative sample of the kinds of texts the system is expected to process; to build large lexicons in a given domain ... Applied linguistics: First/second language acquisition research: supplement/replace elicitation, as in 'Linguistics' above Language teaching/learning: language for specific purposes (e.g. use newspaper corpora, corpora of scientific texts); to prepare vocabulary lists based on high-frequency lexical items; to prepare CLOZE tests; to answer ad hoc learner questions ('What's the difference between few and a few?'); to discover facts about language
Taxonomies of corpora (Ball 1997)
A. by medium: printed, electronic text, digitized speech, video (e.g. for ASL), mixed
B. by design method: balanced, pyramidal, opportunistic
C. language variables:
D. language states: synchronic vs. diachronic (e.g. Brown vs. Helsinki Diachronic corpus)
E. Plain vs. annotated
perfectly plain: e.g. Project Gutenberg texts, produced by scanning; no information about text (usually, not even edition) marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc. annotated with identifying information, e.g. edition date, author, genre, register, etc. annotated for part of speech, syntactic structure, discourse information, etc.
local Corpora: placed on a machine without connection to the WWW, accessible only locally
online-Corpora: located on a WWW-Server, accessible from outside
An example for a corpus not accessible on the Web is the Uppsala Corpus, a Russian language corpus that must be bought from the institution that created it and downloaded via ftp.
The guidelines of the Text Encoding Initiative (TEI) (http://etext.virginia.edu/TEI.html) utilize the Standard Generalized Markup Language (SGML), to create a single coherent framework (the TEI DTD, or Document Type Definition), which can be used to encode most any scholarly textual resource, in a manner that is hardware-, software-, and application-independent.
TEI Lite (http://www.uic.edu/orgs/tei/lite//) is a simplified version of the Text Encoding Initiative (TEI) Guidelines, which are addressed to anyone who wants to interchange information stored in an electronic form.
As explained in the TEI Lite introduction, the TEI Guidelines provide a means of making explicit certain features of a text in such a way as to aid the processing of that text by computer programs running on different machines. This process of making explicit we call markup or encoding. Any textual representation on a computer uses some form of markup; the TEI came into being partly because of the enormous variety of mutually incomprehensible encoding schemes currently besetting scholarship, and partly because of the expanding range of scholarly uses now being identified for texts in electronic form.
Structure Markup in TEI Lite for Slovene Translation of Plato's Republic (TELRI Newsletter No. 5)
The Slovene translation of Plato's 'Republic' was keyed-in by the Ljubljana 2 site (Institute for Slovene Language "Fran Ramovs ", Slovene Academy for Sciences and Arts, Ljubljana, Slovenia) in the text editor Eva. This version served as the starting point for encoding the document in TEI Lite conformant markup. The task of the uptranslation was begun at the summer Nancy workshop, and finished in Ljubljana. In total, the process took about three days. The up-translation was accomplished partly via search and replace operations and macros in the editor Emacs, and partially via small Perl programs.
The TEI Guidelines use the Standard Generalized Markup Language (SGML) to define their encoding scheme. SGML is an international standard (ISO 8879), used increasingly throughout the information processing industries, which makes possible a formal definition of an encoding scheme, in terms of elements and attributes, and rules governing their appearance within a text. In selecting from the several hundred SGML elements defined by the full TEI scheme, a useful 'starter set' has been identified comprising the elements which almost every user should know about.
Non-Roman Character and Glyph Presentation
One of SGML's features as an encoding meta-language is that it specifies a device-independent character encoding mechanism. SGML entities represent non-ASCII characters in the SGML source file, and substitutions are made to the correct presentational glyph by the viewing application software. The Transactions contains non-English terms and symbols which are not part of standard ASCII character set. This component of the project will be of even greater importance when working on the Russian texts.
The following diagram attempts to depict the various approaches to automatic POS tagging. In reality, the picture is much more complicated, since many tagging systems use aspects of some or all of these approaches (Van Guilder 1995):
Annotated vs. Unannotated Corpora:
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2encode.htm)
If a corpus is said to be unannotated, it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.
For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.
Leech's Maxims of Annotation
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2maxims.htm)
One of the first distinctions which can be made among POS taggers is in terms of the degree of automation of the training and tagging process. The terms commonly applied to this distinction are supervised vs. unsupervised. Supervised taggers typically rely on pre-tagged corpora to serve as the basis for creating any tools to be used throughout the tagging process, for example: the tagger dictionary, the word/tag frequencies, the tag sequence probabilities and/or the rule set.
Unsupervised models, on the other hand, are those which do not require a pretagged corpus but instead use sophisticated computational methods to automatically induce word groupings (i.e. tag sets) and based on those automatic groupings, to either calculate the probabilistic information needed by stochastic taggers or to induce the context rules needed by rule-based systems. Each of these approaches has pros and cons. (Van Guilder 1995)
The following table outlines the differences between these two approaches (Van Guilder 1995):
|
|
|
| selection of tagset/tagged corpus | induction of tagset using untagged training data |
| creation of dictionaries using tagged corpus | induction of dictionary using training data |
| calculation of disambiguation tools. may include: | induction of disambiguation tools. may include: |
| word frequencies | word frequencies |
| affix frequencies | affix frequencies |
| tag sequence probabilities | tag sequence probabilities |
| ěformulaicě expressions | - |
| tagging of test data using dictionary information | tagging of test data using induced dictionaries |
| disambiguation using statistical, hybrid or rule based approaches | disambiguation using statistical, hybrid or rule based approaches |
| calculation of tagger accuracy | calculation of tagger accuracy |
Types of Annotation
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2types.htm)
This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT
An Example from the Spoken English Corpus with C7 tagset: (source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2posex.htm)
Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0; ponies&NN2; '&POS; feet&NN2; ,&PUN; suddenly&AV0; heard&VVD-VVN; Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP; better&AV0; dig&VVB; out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC; polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI; playing&VVG; in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;
The codes used are:
COBUILD Part-of-speech tags (source: http://titania.cobuild.collins.co.uk/form.html)
The corpus has been tagged automatically with a statistical tagger. You can specify a search on word/TAG combinations by appending an oblique stroke and a part-of-speech tag. POS tags must be in uppercase. Here are some major POS tags:
NOUN a macro tag: stands for any noun tag
VERB a macro tag: stands for any verb tag
NN common noun
NNS noun plural
JJ adjective
AT definite and indefinite article
RB adverb
VB base-form verb
VBN past participle verb
VBG -ing form verb
VBD past tense verb
Morphosyntactic Tagging of 16th Century Slovene texts:
(TRU_NT77) Glih taku ta (ATTR1.1:)Turska (SUB1.2:/5:>4:)vera doli iemle inu she bo stuprou naprei manshi perhaiala. Sakai ty eni (2SUB1.6:_4:)Bashi (UNDsub)inu Turki nih (AKKOBJ1.2:)Otroke skriuaie kerszhuio. Inu ty eni (ATTR-1.1:)-mladi (SUB1.21)Turki (ATTR1.1:)vogerski desheli (taku sta meni pred enim leitom dua (SUB 2.6:>12)-Studenta is (ATTR1.1:)Ardelske deshele, is Sybenburga, poueidala) se vshulah ta (AKKOBJ1.1:) Cate-hi-sem vuzhe, Inu ty (pronSUB1.5:/8:)eni ozhitu to (AKKOBJ1.2:)vero (ATTR2.1:)-Kerszhansko sposnaio, inu se puste od Turkou, skamene- (2INF2.4:_6:)possuti (UNDverb)inu vmoriti.
(TRU_DP) (pronAKKOBJ1.5:)Tiga, praui (pronSUB2.1:)on , (SUB1.1:>1:)MoiSes nei dal, (pronSUB1.4:)on le to (AKKOBJ1.1:)PoStauo oSnanuie, Ampag ta Syn (ATTR2.1:)Boshy IeSus CriStus, ta (HTpronSUB1.2:>27/ 29/-37/-40/44)iSti Sam ie no tu (2ATTR1.2:_1:)prauu nebesku (4AKKOBJ1.22/24_21/23_17/19_15/13_8:/6:) blagu, odpuSzhane (ATTR1.1:)vSeh Grehou, to Prauizo (UNDsub)inu brumo, (HTSUB1.3:)kir pred Bugo— vela, inu (ATTR1.1:)vSe tu, (pronAKKOBJ1.2:)kar (HTSU=PR)moramo (INF2.1:)imeiti, aku (HTSU=PR)hozhmo vnebeSSa (INF2.2)priti, dobil (UNDverb)inu Saslushil per Bugi, Inu tu (pronAKKOBJ1.3:/6:)iStu (ATTR2.2:)vSe no— ponuie (UNDverb)inu naprei daie vtim Euangeliu, inu (SU=PR)vely no— vSem (pronAKKOBJ1.1:)tu (INF2.4:)vSeti Sto Vero od nega SabSton.
AKKOBJ = Accusative (direct) Object
ATTR = adjectival or participal Attribute - Noun
INF = Infinitive Construction
pron = pronominal
SUB = Subject - Predicate
2SUB = double Subject
UNDsub = double construction x1 AND x2 with x = noun
UNDverb = double construction x1 AND x2 with x = verb
.1, .2, .3 etc. = x is 1/ 2/ 3... graphic units away from y
1. = x stands bevore y
2. = x stands after y
/ = parallel syntactic construction x/y1-x/y2
_ = parallel syntactic construction x1_y-x2_y
> = infinite verb form > finite
The result of quantitative Analysis of Subject-Predicate-Frames can be presented as follows:
2. Lemmatisation
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2lemma.htm)
Lemmatisation is a procedure of finding the lemma (i.e. textual representation of the basic form and its grammatical categorisation) and the grammatical categorisation of the given, analysed form. The textual form is treated unilaterally, as a string of letters between delimiters. (Szafran 1997, 437). It is closely allied to the identification of parts-of-speech and involves the reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants, and to produce frequency and distribution information for the lexeme. See the example below - the fourth column contains the lemmatised words:
N12:0510g - PPHS1m He he N12:0510h - VVDv studied study N12:0510i - AT the the N12:0510j - NN1c problem problem N12:0510k - IF for for N12:0510m - DD221 a a N12:0510n - DD222 few few N12:0510p - NNT2 seconds second N12:0520a - CC and and N12:0520b - VVDv thought think N12:0520c - IO of of N12:0520d - AT1 a a N12:0520e - NNc means means N12:0520f - IIb by by N12:0520g - DDQr which which N12:0520h - PPH1 it it N12:0520i - VMd might may N12:0520j - VB0 be be N12:0520k - VVNt solved solve N12:0520m - YF +. -Slavic example: Tokarski's Index (Szafran 1997, 437)
Tokarski's Index is an a tergo sorted list of all possible endings of words in Polish together with endings of the corresponding base-form and their grammatical categorisation. Both endings are considered here strictly unilaterally as strings of characters only. The list consists of ca. 18,000 entries. The Index is based on SΩownik Je zyka Polskiego PAN edited by W. Doroszewski (cf. Doroszewski (1959-1969)). There are four types of entries but we exclude here one of the groups as not useful for our purposes. Entries in that group collect information only from other groups of entries, and show this information in a shorter and more compact way.
The vast majority of entries in the Index are basic entries. These entries consist of four columns. This is illustrated in (1):
(1) -ac mII N pajac, kac, plac, pac, materac, palac (10) -ac z II lG -aca plac, mac, prac, tac (10)
Each entry contains an ending of the described forms (first column), an ending of related basic forms (third column), the grammatical categorisation of the form (second column after the space) and grammatical categorisation of the basic form (second column before the space) and examples (fourth column).
Another group of entries are called referential entries, which have a different format:
(2) .ich A 10 Þ .im A 4 srogich, pruskich, zadnich, twoich, czyich (3,300) (Szafran 1997, 438)
Instead of introducing a group of basic entries for describing forms we can often use a referential entry. It points to another group of entries in the Index. Finally, a third, very small, group of entries are called the special entries. One such entry is given in (3):
(3) Ces´ [2 formy] C + )es´ jakes´, osioΩes´, kto´rymes´, winienes´, juz es´They describe a case - occurring relatively often in Polish - of compound forms consisting of two parts.
Dealing with Homonyms (Szafran 1997, 439)
A homonym is a word with the same textual shape (spelling) as another word but with a different meaning and origin. Let us consider the word biaΩka. According to the described procedure in the successive steps of our algorithm we consider forms: biaΩka, iaΩka, aΩka, Ωka, ka, a. For the forms biaΩka, iaΩka, aΩka there are no basic entries in the Index. For the form Ωka. according to the Index (cf. entries in lines 41-47 on the 58th page of the Index) we have the following output for the basic forms (leaving aside the grammatical categorisation of the analysed form for the moment):
(4) (bialka, ndm) - indeclinable (bialek, mIII) - substantive, masculine (bialko, nII) - substantive , neuter (bialka, z III) - substantive, feminine (bialka, blp) - substantive, plurale tantum (bialki, A) - adjective (bialkac, I) - verb
Finally, for the forms ka and a we have again no basic entries. Potentially all of the obtained basic forms may be the correct Polish lexemes. Without additional information it is impossible to decide whether they belong to contemporary Polish or not.
3. Parsing
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2parse.htm)
Parsing involves the procedure of bringing basic morphosyntactic categories into high-level syntactic relationships with one another. This is probably the most commonly encountered form of corpus annotation after part-of-speech tagging. Parsed corpora are sometimes known as tree-banks. This term alludes to the tree diagrams or "phrase markers" used in parsing. The probably most well-known Treebank is the PENN Treebank (http://www.ldc.upenn.edu/ldc/online/treebank/index.html):
<= CLICK TO VIEW IN ORIGINAL SIZE!
The BabelSystem creates Tree-Diagrams of German Sentences
Example 1: Babel Query (Babel Online: http://www.dfki.de/~stefan/Babel/Interaktiv/):
<= CLICK TO VIEW IN ORIGINAL SIZE!
Example 2: Babel Query Result:
<= CLICK TO VIEW IN ORIGINAL SIZE!
Sometimes the bracket-based annotations are displayed with indentations so that they resemble the properties of a tree diagram (a system used by the Penn Treebank project). For instance:
[S
[NP Claudia NP]
[VP sat
[PP on
[NP a stool NP]
PP]
VP]
S]
Because automatic parsing (via computer programs) has a lower success rate than part-of-speech annotation, it is often either post-edited by human analysts or carried out by hand (although possibly with the help of parsing software). The disadvantage of manual parsing, however, is inconsistency, especially where more than one person is parsing or editing the corpus, which can often be the case on large projects. The solution - more detailed guidelines, but even then there can occur ambiguities where more than one interpretation is possible.
4. Semantics
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2semant.htm)
Two types of semantic annotation can be identified:
1. The marking of semantic relationships between items in the text, for example the agents or patients of particular actions. This has scarcely begun to be widely accepted at the time of writing, although some forms of parsing capture much of its import.
2. The marking of semantic features of words in the text, essentially the annotation of word senses in one form or another. This has quite a long history, dating back to the 1960s.
There is no universal agreement about which semantic features ought to be annotated - in fact in the past much of the annotation was motivated by social scientific theories of, for instance, social interaction. However, Sedelow and Sedelow (1969) made use of Roget's Thesarus - in which words are organised into general semantic categories.
The example below (Wilson, forthcoming) is intended to give the reader an idea of the types of categories used in semantic tagging:
And 00000000 the 00000000 soldiers 23241000 platted 21072000 a 00000000 crown 21110400 of 00000000 thorns 13010000 and 00000000 put 21072000 it 00000000 on 00000000 his 00000000 head 21030000 and 00000000 they 00000000 put 21072000 on 00000000 him 00000000 a 00000000 purple 31241100 robe 21110321
The numeric codes stand for:
00000000 Low content word (and, the, a, of, on, his, they etc) 13010000 Plant life in general 21030000 Body and body parts 21072000 Object-oriented physical activity (e.g. put) 21110321 Men's clothing: outer clothing 21110400 Headgear 23231000 War and conflict: general 31241100 Colour
The semantic categories are represented by 8-digit numbers - the one above is based on that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three top level categories, which are themselves subdivided, and so on.
5. Discoursal and Text Linguistic Annotation
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2discour.htm)
Aspects of language at the levels of text and discourse are one of the least frequently encountered annotations in corpora. However, occasionally such annotations are applied.
Discourse tags
Stenström (1984) annotated the London-Lund spoken corpus with 16 "discourse tags". They included categories such as:
"apologies" e.g. sorry, excuse me
"greetings" e.g. hello
"hedges" e.g. kind of, sort of thing
"politeness" e.g. please
"responses" e.g. really, that's right
Despite their potential role in the analysis of discourse these kinds of annotation have never become widely used, possibly because the linguistic categories are context-dependent and their identification in texts is a greater source of dispute than other forms of linguistic phenomena.
Anaphoric annotation
Cohesion is the vehicle by which elements in text are linked together, through the use of pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in English" (1976) was considered to be a turning point in linguistics, as it was the most influential account of cohesion. Anaphoric annotation is the marking of pronoun reference - our pronoun system can only be realised and understood by reference to large amounts of empirical data, in other words, corpora.
Anaphoric annotation can only be carried out by human analysts, since one of the aims of the annotation is to train computer programs with this data to carry out the task. There are only a few instances of corpora which have been anaphorically annotated; one of these is the Lancaster/IBManaphoric treebank, an example of which is given below:
A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2 [N the_AT (9 Charlotte_N1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0 rid_VVN of_IO [N 3
The above text has been part-of-speech tagged and skeleton parsed, as well as anaphorically
annotated. The following codes explain the annotation:
Spoken language corpora can also be transcibed using a form of phonetic transcription. Not many examples of publicly available phonetically transcribed corpora exist at the time of writing. This is possibly because phonetic transcription is a form of annotation which needs to be carried out by humans rather than computers. Such humans have to be well skilled in the perception and transcription of speech sounds. Phonetic transcription is therefore a very time consuming task.
Another problem is that phonetic transcription works on the assumption that the speech signal can be divided into single, clearly demarcated "sounds", while in fact, these "sounds" do not have such clear boundaries, therefore what phonetic transcription takes to be the same sound, might be different according to context.
Nevertheless, phonetically transcribed corpora is extremely useful to the linguist who lacks the technological tools and expertise for the laboratory analysis of recorded speech. One such example is the MARSEC corpus (which is derived from the Lancaster/IBM Spoken English Corpus) and has been manipulated by the Universities of Lancaster and Leeds. The MARSEC corpus will include a phonetic transcription.
Prosody refers to all aspects of the sound system above the level of segmental sounds e.g. stress, intonation and rhythm. The annotations in prosodically annotated corpora typically follow widely accepted descriptive frameworks for prosody such as that of O'Connor and Arnold (1961). Usually, only the most prominent intonations are annotated, rather than the intonation of every syllable. The example below is taken from the London-Lund corpus:
The codes used in this example are:
Problems of Prosodic Corpora
Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby users will take a corpus, either already annotated, or unannotated, and add to it their own form of annotation, oriented particularly towards their own research goal. This differs in two ways from the other types of annotation we have examined in this session.
1. It is not exhaustive. Not every word (or sentence) is tagged - only those which are directly relevant to the research. This is something which problem-oriented tagging has in common with anaphoric annotation.
2. Annotation schemes are selected, not for broad coverage and theory-neutrality, but for the relevance of the distinctions which it makes to the specific questions that the researcher wishes to ask of his/her data.
Example 2.: Search Result of the above Query:
Example 2.: same text base as in (1), Word-Concordance, Include: All words containing "ei"
Example 3: Concordance plus Frequency-Index (Truber, Dolga Predguvor):
Example 4: Concordance plus Frequency-Index (Truber, Dolga Predguvor POS-tagged), Include "SUB":
Example 5: Concordance plus Frequency-Index (Truber, Dolga Predguvor POS-tagged), Include "UND":
Example 1.: Search Mask, Query for all verbs containing "ovati":
Example 2.: Query Results
Example 1.: Query Mask, Search for "tenhle":
Example 2.: Result of Corpus Query for "tenhle":
Example 1.: Search Mask, Query for "slowo Boga"
Example 2.: Search Results
Example 1: TUSLA Query
Example 2: TUSLA Query Result
Slavic Text Corpora, online-query not available:
Multilingual:
=> Multext East Corpus multilingual (Multilingual Text & Corpora for Eastern and Central European Languages (1995-97) (http://nl.ijs.si/ME/CD/)
OCS:
Slovene:
Serbian/ Croatian:
Bulgarian:
Russian:
Acoustic Databases:
In Stage of Planning:
6. Phonetic Transcription
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2phonet.htm)
7. Prosody
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2prosody.htm)
1 8 14 1470 1 1 A 11 ^what a_bout a cigar\ette# . /
1 8 15 1480 1 1 A 20 *((4 sylls))* /
1 8 14 1490 1 1 B 11 *I ^w\on't have one th/anks#* - - - /
1 8 14 1500 1 1 A 11 ^aren't you .going to sit d/own# - /
1 8 14 1510 1 1 B 11 ^[/\m]# - /
1 8 14 1520 1 1 A 11 ^have my _coffee in p=eace# - - - /
1 8 14 1530 1 1 B 11 ^quite a nice .room to !s\it in ((actually))# /
1 8 14 1540 1 1 B 11 *^\isn't* it# /
1 5 15 1550 1 1 A 11 *^y/\es#* - - - /
# end of tone group
^ onset
/ rising nuclear tone
\ falling nuclear tone
/\ rise-fall nuclear tone
_ level nuclear tone
[] enclose partial words and phonetic symbols. normal stress
! booster: higher pitch than preceding prominent syllable
= booster: continuance
(( )) unclear
* * simultaneous speech
- pause of one stress unit
8. Problem-Oriented Tagging
(source: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2problem.htm)
4. Local Query & Concordance Software
4.1. Gofer
Example 1.: Boolean Search-Option for the Terms "Slouen*", "Sloven*", "Crain*"...:
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
4.2. Monolingual Concordances:
4.2.1 CONC
Example 1.: P. Trubar, Ena Dolga Predguvor (1557), Word-Concordance, Include: All words
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
4.2.2 FreeText Browser
Example 1.: Truber, Introduction to the Slovene Translation of the New Testament 1577:
<= CLICK TO VIEW IN ORIGINAL SIZE!
4.3. Multilingual Parallel Concordance: PARACONC (by Michael Barlow)
Example 1.: Genesis 1,1 in Polish and Croatian, Search for "dzien":
<= CLICK TO VIEW IN ORIGINAL SIZE!
5. SLAVIC TEXT CORPORA ON THE WWW
5.1. Slavic Text Corpora, online-query available
5.1.1. The Oslo Corpus of Bosnian Texts
Access restricted to researchers (http://www.tekstlab.uio.no/Bosnian/Corpus.html)
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
5.1.2. Institute of the Czech National Corpus
(http://mathesius.ff.cuni.cz/US/cnc/)
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
5.1.3. IPI PAN Corpus Search
Text base: New & Old Testament, Prose texts, newspaper "Zycie Warszawy" (http://www.ipipan.waw.pl/mmgroup/Korpus/searchpage.html)
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
5.1.4. Tübingen Slavic Corpora Online-Query
(currently still only experimental, no public access available)
<= CLICK TO VIEW IN ORIGINAL SIZE!
<= CLICK TO VIEW IN ORIGINAL SIZE!
5. Slavic Text Corpora on the WWW
Slavic Text Corpora, online-query available:
6. Bibliography
Adams, L. D. (adams@cs.pitt.edu) & David J. Birnbaum (djb@kathleen.slavic.pitt.edu) (1996): "The Relationship of Russian Rhyme to Russian Orthography (Modularization, Implementation, Report)". WebSite: Perspectives on Computer Programming for the Humanities. http://clover.slavic.pitt.edu/~djbpitt/rhyme/tt.html
Apresjan, Ju. D., I. M. Boguslavsky, L. L. Iomdin, A. V. Lazourski, L. G. Mitjushin, V. Z. Sannikov, L. L. Tsinman (1992): Lingvisticheskij protsessor dlja slozhnyx informatsionnyx sistem. (A linguistic processor for advanced information systems.) Moskva.
Ball, C. N. (1997): [Online-]Tutorial: Concordances and Corpora. Department of Linguistics, Georgetown University, Washington DC. http://www.georgetown.edu/cball/corpora/ tutorial.html
Boguslavsky, I. (1995): A bi-directional Russian-to-English machine translation system (ETAP-3). In: Proceedings of the Machine Translation Summit V. Luxembourg.
Boguslavsky, I., L. Tsinman (1992): "Semantics in a linguistic processor." In: Computers and Artificial Intelligence, vol. 11, N 4, 385-408.
Cermak, F. (1992): "Pocitacova lexikografie (poc itacovy fond cestiny)". In: Slovo a slovesnost 53, 41-48.
Cermak, F. (1995): "Jazykovy korpus: Prostr edek a zdroj poznani". SaS 56, 119-140.
Cermak, F. (1996): "The Czech National Corpus: A Brief Survey of the Current State". In: TELRI Newsletter 4 (1996) http://www.ids-mannheim.de/telri/newsletter/newsl4.html
Erjavec, T. , N. Ide, D. Tufis (1997): Encoding and Parallel Alignment of Linguistic Corpora in Six Central and Eastern European Languages. Presented at the Joint International Conference of the ACH-ALLC '97, June 1997.
Erjavec, T. (1997): Racunalniske zbirke besedil. In: Jezik in Slovstvo, 42/2-3, 81-96.
Erjavec, T., N. Ide, V. Petkevic, J. Veronis (1996): Multext-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. In: Proceedings of the First European TELRI Seminar: Language Resources for Language Technology, 87-98.
Erjavec, T. & P. Tancig (1990): An Integrated System for Morphological Analysis of the Slovenian Language. In: CoLing '90, Conference Proceedings, 293-298.
Hladká, B. & J. Hajic (1995): TELRI, Proceedings of the First European Seminar: A Simple Czech and English Probabilistic Tagger: A Comparison. Tihany, Hungary.
Jakopin, P. (1997) (with A. Bizjak): Part-of-Speech Tagging in the Slovenian Translation of Plato's Republic. In: TELRI Newsletter 5 (April 1997). http://www.ids-mannheim.de/ telri/newsletter/newsl5.html
Jakopin, P. (1995): EVA - A Textual Data Processing Tool. In: Proceedings of the first TELRI seminar: Language Resources for Language Technology, Tihany 1995.
Johansson, S. (1991) "Times change and so do corpora", In: Aijmer and Altenburg (eds.) English corpus linguistics: studies in Honour of Jan Svartvik, London, pp 305-14.
Leech, G. (1991) "The state of the art in corpus linguistics", In: Aijmer K. and Altenberg B. (eds.) English Corpus Linguistics: Studies in Honour of Jan Svartvik, 8-29. London.
Leech, G. (1992) "Corpora and theories of linguistic performance", In: Svartvik, J. Directions in Corpus Linguistics, pp 105-22. Berlin.
Leech, G. and A. Wilson (1994): "Morphosyntactic Annotation". EAGLES Document EAG-CSG/IR-T3.1 (Version of October 1994). Pisa: EAGLES Consortium.
Mcenery, T. & A. Wilson (1996): Corpus Linguistics. Edinburgh.
Seitz, E. (forthcoming): Digitale Textcorpora und Datenbanken: Neue Horizonte in der slavistischen Linguistik, in: Bulletin des Verbandes der Hochschullehrer für Slavistik (VHS) 1998
Sinclair, J. (1991): Corpus, Concordance, Collocation. Oxford.
Szafran, K. (1997): "Automatic Lemmatisation of Texts in Polish - Is It Possible?" In: Junghanns, U. & Zybatow, G. (Hrsg.): Formale Slavistik (= Leipziger Schriften zur Kultur-, Literatur-, Sprach- und Übersetzungswissenschaft; 7) Frankfurt a.M., 437-441.
Szafran, K. (1994): Automatyczna analiza fleksyjna tekstu polskiego (na podstawie schematycznego indeksu a tergo Jana Tokarskiego). Rozprawa doktorska. WydziaΩ Polonistyki UW, Warszawa.
Szpakowicz, S. (1978): Automatyczna analiza skΩadniowa polskich zdan pisanych. Rozprawa doktorska. Instytut Informatyki UW, Warszawa.
Tokarski, J. (1993): Schematyczny indeks a tergo polskich form wyrazowych. Opracowanie i redakcja: Zygmunt Saloni. Warszawa: Wydawnictwo Naukowe PWN.
Van Guilder, L. (1995): Automated Part of Speech Tagging: A Brief Overview. Handout for LING361, Fall 1995. Georgetown University, http://www.georgetown.edu/cball/ling361/ tagging_overview.html
Zurück zur Homepage von Elisabeth Seitz
Zurück zur Corpus-Seite des Slavischen Seminars
© Dr. Elisabeth Seitz, Universität Tübingen (elisabeth.seitz@uni-tuebingen.de) Last Update: 22.9.1998