Labs: kartičky | natural-language-processing

for the sake of our homework … $p(w_1,w_2,\dots,w_n)\approx \prod_i p(w_i)$ (unigram)

or $\approx\prod_i p(w_{i}\mid w_{i-1})$ (bigram)

task

create text corpus
- popular corpora
  - Brown, CNK, PDT
  - Common Crawl, Internet Archive, arXiv
  - RedPajama
- what we can use
  - books (Gutenberg)
  - Wikipedia, W2C
- we should store the data as 1 plaintext file, the size should be at least 500k tokens (words), UTF-8, at least 2 languages (one of them should be English), one file for each language
unigram statistics
- guess 10 frequent words
- compute 10 frequent words
- show least frequent words
- show unigram p's of efverything
tokenize data
- Sacremoses
- we may use truecase
bigram stats
- guess top 10 frequent bigram
- function for joint & conditional probability
  - for most frequent unigrams (or some words)
- show top 10 bigrams

homework assignment

corpora

information retrieval – assignment

implement simple IR system
we will be given a document collection and queries
the goal is to process the queries, generate the output, and evaluate the results
steps
- inverted index
- boolean query operators (AND, OR, AND NOT)
- query set
- evaluation (precision, recall)
- submission
the test data should not be distributed
the topics are quite rich – we need only the num and the query
- the description and the narative we used for human evaluation of the document relevance
there are parallel queries in both languages, we should use them for the respective documents
we index only the textual content of the documents
the files are in SGML, XML parser might not work, we can use regex
the minimum is to do lowercasing and punctuation removal
- we can do also lemmatisation, stemming etc. but it is not necessary
details tomorrow on the website

Levenshtein edit distance

basic operations: insert, delete, replace
Damerau-Levenshtein: transposition as a fourth possible operation
weighted edit distance – weights correspond to the distance on the keyboard (if we want to detect mistyping/misspelling)