Corpus Structure & Corpus Segments

1. Structure of Corpus

Click in the model area to enlarge it.

Download pdf manual

2. Structure of Data Implementation into Databases

Click in the model area to enlarge it.

3. Fictional Topography Models - Character Wandering Network Models - Place Density

The models use vectorised maps that are conceived in Adobe Illustrator. The purpose of these maps is to provide a basic topographic fundanet into which layers of fictional topographies, network models and site density models are fed. For the fictional prose topography of the second half of the 19th century, Karl Kořistka's Plan of Prague with the extent of the floods of 1845 and 1872 from 1873 is used as the initial model. (see)
Caracter Wandering Network Models of character journeys show the connections between the places where a character or characters move. You can click on Network Model Type Visualization to see the network model types.
Place desnity map shows each location according to its frequency load in a given text. The size of the circle is calculated as 2πr, where r is the relative frequency of the toponym.

4. Frequency of Toponyms and Locations – Prague and Non-Prague Locations – Sacred Spaces – GIS

The toponym frequency graphs are obtained by filtering toponyms from a special TXT file, which is manually extracted from the text. Autoname detection using the so-called Name Entity Recogniction method (for Czech names, see Czech Named Entity Corpus 1.0) turns out to be not quite suitable, as it does not distinguish some proproia that are not named by their explicit name. The frequencies of synonymous naming of toponyms are grouped under a single name; for example, in Arbes's Saint Xavier we find the following ways of naming the St. Nicholas Cathedral in Mala Strana: "St. Nicholas, St. Nicholas Cathedral, St. Nicholas Cathedral, St. Nicholas Cathedral, St. Nicholas Cathedral, Lesser Town Cathedral". The list is available in Open data.

By locations, we generally mean Prague and non-Prague locations. Here, we distinguish between whether the action takes place in Prague (in this case, regardless of the specific topographical destination) or outside Prague. This criterion will of course vary depending on the urban development of the city, especially since the end of the 19th century. In an interpretation that takes this aspect into account, it is therefore important to consider individual locations in relation to the historical transformation of Prague's topography. For example, Smíchov in Arbes' romances is a space beyond the borders of Prague.

The sacred places form categories: temples, ancient temple, cathedral, monastery, church, parts of the church, interior of the church, rectory and cemetery. These categories are chosen according to the entries from the thematic Thesaurus of Czech Language.

GIS maps show the location of fictional places according to coordinates (longitude, latitude). Places that still exist are distinguished from those that have disappeared.

5. Sentence length - Boxplots - TTR/MATTR - Entropy - Frequency of Text Segments - Narrative Rhythm

This section contains information regarding the text segments that were analyzed. They are listed on the y axis of the Sentence Lengths chart, which shows the average sentence lengths of single-character segments as a function of the specific text, as well as across the entire author coprus (lower area of the chart). The second graph contains boxplots (box plots) that show the dispersion of text by sentence lengths. A boxplot is a representation of the values between the first (Q25) and third quartiles (Q75), i.e., the variance between 25% - 75% of the values. Inside the boxplot is a meridian (Q50) showing the mean value(s) in the set of numerical sequences. Outliers are also part of the plots.


Example:
1, 2, 5 ,8, 15, 16, 16, 17, 18, 20, 25, 30
An ordered numeric series of measured/acquired values (in our case, this would be the individual sentence lengths in the text) has a size of 12 items. Q25 = 3, Q50 = 6, Q75 = 9. The boxplot will include the area between the intervals Q25 and Q75, which means the numbers: 5 , 8, 15, 16, 17, 18, where Q50 = 16. In an ordered set of values, the boxplot is the expression of those values that lie in this interval. The meridian is the mean, not the arithmetic mean, which in this case is 14.4. The resulting boxplot shows a higher concentration of values in the 3rd quartile, i.e. in the range between 50% - 75%:



The measurement of text segments is based on the distinction of basic speech types in the narrative and some selected parts. This distinction does not pursue a narratological goal, but seeks to delimit the speech segments of a narrative as formally as possible. Such a procedure is necessary if we intend to eliminate subjective interpretation as much as possible in the delineation of speech segments. Therefore, we only use categories that exhibit relatively explicit flags to label these segments (the tagset is available in the Open Data section). Specifically, we distinguish:



Although the above segmentation is far from ideal, as some interpretation is not entirely unavoidable in delineating these parts, an effort has been made to make the delimitation process as clear as possible with respect to clearly distinguishable segments. The aim, as stated, is not, in fact, a narratological analysis of the segments, but their formal delineation for the purpose of their quantitative analysis. On the other hand, this definition must work at least in the basic terms of literary scientific categories. The quantitative analysis of these segments, expressed in terms of word counts, makes it possible to observe the frequency stratification of segments in individual texts (for comparative purposes, relative frequencies are used), the variations in their use and functional load. Each segment can also be further analysed in terms of other quantitative criteria such as sentence lengths, word types, concordance or most frequently used lexis. While the tools for this type of analysis are no longer publicly available as part of this project; these text segments are available in the Open Data section in TXT files for possible further processing.
The analysis of the speech segments includes a rhythmic analysis of the narrative segments, which is based on measuring the size of the corresponding narrative block per word count and its position in the text.


Type token ratio (TTR) is the relationship between unique words (type) and all lexical units (tokens). See .


Entropy expresses the degree of disorder of a system. The higher the value, the less ordered the system is, which means that its units are less predictable. See .

6. Sentiment Analysis - Word Cluster Models

Sentiment analysis is carried out on the basis of the list of lemmas made available by Kateřina Veselovská and Ondřej Bojar within the project SubLex 1.0. In the course of the Czech prose corpus project, we are currently trying to work with the lexicon excerpted from the Thesaurus of the Czech Language, specifically with entries containing lexemes for expressing emotions. Their provisional listing is available in Open Data.
Word cluster models are the set of the most frequent autosemantics in a given work. The size of a word in the model corresponds to its frequency load. In particular, synsemantics or other words are removed during the analysis. Their list is in the Open Data section. Clustering models are used to track the main motifs.

The Clusters of Sentiment section allows you to search for thematic clusters of emotions in each prose of the author corpus. Clusters are defined according to the Thesaurus of the Czech Language by Aleš Klégr et al.

7. Search for Word Types by Simplified Tags, Percentage of Word Types

The search for word types is performed in a database of morphologically tagged texts using the MorhoDita tool developed by Milan and Jana Straka, respectively UDPipe. After entering a collapsed tag (the first position in a 15-position tag) for a particular word species, the corresponding lemmas are retrieved, sorted in descending order of frequency. On the right, pie charts of the percentages of word types in a given work can be viewed. Each author's subcorpus can be searched with respect to a given work. Databases of lemmatized and morphologically tagged texts are accessible in the Open Data section.

8. Search by Complete Position Tags

To generate morphologicla tags use these tool. After the morphological tag is generated, you need to replace the last two characters with one character: - . For example, to find all adjectives, feminine, singular, accusative, 1st degree, affirmative, you will get this complex tag in the generator: AAFS4----1A---. * Replace the last two characters of .* with one character - , so that the resulting tag looks like this: AAFS4----1A----. Enter it in this form into a search engine. The output is the relevant tokens with their frequency information. It is essential that all tag positions are occupied. For example: NNFP4-----A---.* is a correctly specified complete morphological tag. After removing the last two characters and replacing them with -, we get a tag that can be searched. On the other hand, this NNF..-----.---.* tag is incomplete. After removing the last two characters and replacing them with the above character, the tag will not work.

9. Conkordance (KWIC)

Concordance search (5 positions on the left, 5 positions on the right of the searched word including punctuation) is performed in the database of lemmatized texts. Within each author sub-corpus, concordances can be searched with respect to the respective work and also in the whole author corpus. It is also possible to search the entire author corpora and the corpus of all authors.

10. Collocation

Collocations of right context are counting by associative measures: logDice, MI-score, T-score. See .

11. Lemma Search & Listing of Unique Shapes - Frequency Lists

The search for lemmas and individual word types is carried out in the database of lemmatized texts. After entering the desired lemma in the appropriate search field, basic information about the lemma is displayed, i.e. absolute and relative frequencies, as well as word forms (unique tokens) that occur in the text. The output includes a complete frequency list of all lemmas downloadable in pdf for the respective text.

12. Open Data

After logging into the database on the home page or on this link the following data is available to the user. Data will be continuously added and updated:


The structure of JSON file containing dictionary for thematic sentiment analysis:


{
     "NAME OF EMOTIONAL"   :    [ emotional lemmas ]
}

The structure of JSON files containing whole corpus:

{
     "TITLE"  :   {
                      "AUTHOR"   :   author name,
                      "BORN"   :   date of author born,
                      "DEATH"   :   date of author death,
                      "1. PUB PUB"   :   1. publiciton of the title,
                      "ACTUAL PUB"   :   actual publication,
                      "TEXT"   :   text of tille,
                      "LEMMA"   :   lemmas,
                      "MORPHO TAGS"   :   morphological tags
                  }
}


The structure of CSV file:


author authors names; object
year_birth number of author birth date; int64
year_dead number of author dead date; int64
title names of titles; object
year_published year of first publication; int64
num_sentences number of sentences; int64
num_tokens number of tokens; int64
num_lemmas number of lemmas; int64
longest_sentence longest sentence; int64
longest_word longest word; int64

Recommendations: to load the table correctly in Python use this command:

pd.read_csv("tabulka_dat.csv").drop(columns=["Unnamed: 0"])

13. Composition of Corpus

For individual works, the cartographic databases may not always correlate with the linguistic corpus. This is because the authors who are excerpted do not have all their works thematically focused on Prague topography. However, the purpose of the linguistic corpus part is to realise, as far as possible, the entire prose corpus of a given author, which is true even for prose works that do not explicitly thematise Prague topography. The linguistic corpus is both a functional complement to the database of cartographic models and can be used for independent searches in the authors' prose corpora.
For a preview of the list of texts being processed, click here .

14. Word in Context

Search for the right and left context based on a choice of two words. The search engine determines whether the words you enter occur in a context no more than 3 positions to the right or left from the first word.

15. Tetxs Database

The text database contains raw texts that are part of the corpus. The database is gradually being added to. The texts can be obtained e.g. by web scraping.

16. Relationships between Characters in a Fictional Worlds

The charts show the relationships between the characters in each episode. Relationships between characters are realized by dialogue or narration referring to dialogue between characters.

17. Dispersion of Narrative Segments on the Timeline

The graph shows the quantitative distribution of the selected segments on the timeline.

18. Basic Statistics

The graph shows the quantitative distribution of the lemmas, tokens, average sentences lenghts and median sentences lenghts. Selecting the author's name from the menu below will display the relevant quantitative and statistical values.

19. Stylometry

Within the search of the whole corpus, it is possible to view selected graphs of stylometric analysis that show the intensity of network connectivity between parts (network graph) and a clear structuring of cluster relationships (dendrogram graph). Both charts are generated by Stylo (Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: a package for computational text analysis. R Journal 8(1): 107-121. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html)

20. How to Quote

Změlík, Richard: Literary Cartographic and Quantitative Models of Czech Novels from the 19th to 21st Century. FF UP, Olomouc. Dostupný z WWW: https://korpusprozy.com/ [citováno: den, měsíc, rok]




last update: