this dir | view | cards | source | edit | dark
top
Labs
- for the sake of our homework … p(w1,w2,…,wn)≈∏ip(wi) (unigram)
- or ≈∏ip(wi∣wi−1) (bigram)
- n-gram language models
- task
- create text corpus
- popular corpora
- Brown, CNK, PDT
- Common Crawl, Internet Archive, arXiv
- RedPajama
- what we can use
- books (Gutenberg)
- Wikipedia, W2C
- we should store the data as 1 plaintext file, the size should be at least 500k tokens (words), UTF-8, at least 2 languages (one of them should be English), one file for each language
- unigram statistics
- guess 10 frequent words
- compute 10 frequent words
- show least frequent words
- show unigram p's of efverything
- tokenize data
- Sacremoses
- we may use truecase
- bigram stats
- guess top 10 frequent bigram
- function for joint & conditional probability
- for most frequent unigrams (or some words)
- show top 10 bigrams
- homework assignment
- estimate character 3-gram (trojice znaků) probabilities from data
- get test data, compute its probability
- calculate cross-entropy of the test data
- calculate probability that test data belong to the language
- deadline cca in 4 weeks
- corpora
- TEITOK Universal Dependencies
- querying in CQL
- information retrieval – assignment
- implement simple IR system
- we will be given a document collection and queries
- the goal is to process the queries, generate the output, and evaluate the results
- steps
- inverted index
- boolean query operators (AND, OR, AND NOT)
- query set
- evaluation (precision, recall)
- submission
- the test data should not be distributed
- the topics are quite rich – we need only the num and the query
- the description and the narative we used for human evaluation of the document relevance
- there are parallel queries in both languages, we should use them for the respective documents
- we index only the textual content of the documents
- the files are in SGML, XML parser might not work, we can use regex
- the minimum is to do lowercasing and punctuation removal
- we can do also lemmatisation, stemming etc. but it is not necessary
- details tomorrow on the website
- Levenshtein edit distance
- basic operations: insert, delete, replace
- Damerau-Levenshtein: transposition as a fourth possible operation
- weighted edit distance – weights correspond to the distance on the keyboard (if we want to detect mistyping/misspelling)