this dir | view | cards | source | edit | dark top

Zkouška

Zkouška
1. Introduction to Machine Learning

Easy: Explain how reinforcement learning differs from supervised and unsupervised learning in terms of the type of input the learning algorithms use to improve model performance.

1. Introduction to Machine Learning

Medium: Explain why we need separate training and test data. What is generalization, and how does the concept relate to underfitting and overfitting?

1. Introduction to Machine Learning

Medium: Define the prediction function of a linear regression model and write down L2L^2-regularized mean squared error loss.

1. Introduction to Machine Learning

Medium: Starting from the unregularized sum of squares error of a linear regression model, show how the explicit solution can be obtained, assuming XTX\boldsymbol X^T \boldsymbol X is invertible.

2. Linear Regression, SGD

Medium: Describe standard gradient descent and compare it to stochastic (i.e., online) gradient descent and minibatch stochastic gradient descent. Explain what it is used for in machine learning.

2. Linear Regression, SGD

Easy: Explain possible intuitions behind L2L^2 regularization.

2. Linear Regression, SGD

Easy: Explain the difference between hyperparameters and parameters.

2. Linear Regression, SGD

Medium: Write an L2L^2-regularized minibatch SGD algorithm for training a linear regression model, including the explicit formulas (i.e., formulas you would need to code it with numpy) of the loss function and its gradient.

2. Linear Regression, SGD

Medium: Does the SGD algorithm for linear regression always find the best solution on the training data? If yes, explain under what conditions it happens; if not, explain why it is not guaranteed to converge. What properties of the error function does this depend on?

2. Linear Regression, SGD

Medium: After training a model with SGD, you ended up with a low training error and a high test error. Using the learning curves, explain what might have happened and what steps you might take to prevent this from happening.

2. Linear Regression, SGD

Medium: You were given a fixed training set and a fixed test set, and you are supposed to report model performance on that test set. You need to decide what hyperparameters to use. How will you proceed and why?

2. Linear Regression, SGD

Easy: What methods can be used to normalize feature values? Explain why it is useful.

3. Perceptron, Logistic Regression

Medium: Define binary classification, write down the perceptron algorithm, and show how a prediction is made for a given data instance x\boldsymbol x.

3. Perceptron, Logistic Regression

Hard: For discrete random variables, define entropy, cross-entropy, and Kullback-Leibler divergence, and prove the Gibbs inequality (i.e., that KL divergence is non-negative).

3. Perceptron, Logistic Regression

Easy: Explain the notion of likelihood in maximum likelihood estimation. What likelihood are we estimating in machine learning, and why do we do it?

3. Perceptron, Logistic Regression

Hard: Describe maximum likelihood estimation as minimizing NLL, cross-entropy, and KL divergence and explain whether they differ or are the same and why.

3. Perceptron, Logistic Regression

Easy: Provide an intuitive justification for why cross-entropy is a good optimization objective in machine learning. What distributions do we compare in cross-entropy? Why is it good when the cross-entropy is low?

3. Perceptron, Logistic Regression

Medium: Considering the binary logistic regression model, write down its parameters (including their size) and explain how we decide what classes the input data belong to (including the explicit formula for the sigmoid function).

3. Perceptron, Logistic Regression

Hard: Write down an L2L^2-regularized minibatch SGD algorithm for training a binary logistic regression model, including the explicit formulas (i.e., formulas you would need to code it in numpy) of the loss function and its gradient (saying just \nabla is not enough).

4. Multiclass Logistic Regression, Multilayer Perceptron

Medium: Define mean squared error and show how it can be derived using MLE. What assumptions do we make during such derivation?

4. Multiclass Logistic Regression, Multilayer Perceptron

Medium: Considering KK-class logistic regression model, write down its parameters (including their size) and explain how we decide what classes the input data belong to (including the formula for the softmax function).

4. Multiclass Logistic Regression, Multilayer Perceptron

Easy: Explain the relationship between the sigmoid function and softmax.

σ(x)=softmax([x,0])0=exex+e0=11+ex\sigma(x)=\text{softmax}([x,0])_0=\frac{e^x}{e^x+e^0}=\frac1{1+e^{-x}}

4. Multiclass Logistic Regression, Multilayer Perceptron

Easy: Show that the softmax function is invariant towards constant shift.

softmax(z+c)i=ezi+cjezj+c=ezijezjecec=softmax(z)i\text{softmax}(z+c)_i=\frac{e^{z_i+c}}{\sum_je^{z_j+c}}=\frac{e^{z_i}}{\sum_j e^{z_j}}\cdot\frac{e^c}{e^c}=\text{softmax}(z)_i

4. Multiclass Logistic Regression, Multilayer Perceptron

Hard: Write down an L2L^2-regularized minibatch SGD algorithm for training a KK-class logistic regression model, including the explicit formulas (i.e., formulas you would use to code it in numpy) of the loss function and its gradient.

4. Multiclass Logistic Regression, Multilayer Perceptron

Medium: Prove that decision regions of a multiclass logistic regression are convex.

4. Multiclass Logistic Regression, Multilayer Perceptron

Medium: Considering a single-layer MLP with DD input neurons, HH hidden neurons, KK output neurons, hidden activation ff, and output activation aa, list its parameters (including their size) and write down how the output is computed.

4. Multiclass Logistic Regression, Multilayer Perceptron

Medium: List the definitions of frequently used MLP output layer activations (the ones producing parameters of a Bernoulli distribution and a categorical distribution). Then, write down three commonly used hidden layer activations (sigmoid, tanh, ReLU). Explain why identity is not a suitable activation for hidden layers.

5. MLP, Softmax as MaxEnt classifier, F1 score

Hard: Considering a single-layer MLP with DD input neurons, a ReLU hidden layer with HH units, and a softmax output layer with KK units, write down the explicit formulas (i.e., formulas you would use to code it in numpy) of the gradient of all the MLP parameters (two weight matrices and two bias vectors), assuming input x\boldsymbol x, target tt, and negative log likelihood loss.

5. MLP, Softmax as MaxEnt classifier, F1 score

Medium: Formulate the computation of MLP as a computation graph. Explain how such a graph can be used to compute the gradients of the parameters in the back-propagation algorithm.

5. MLP, Softmax as MaxEnt classifier, F1 score

Medium: Formulate the Universal approximation theorem and explain in words what it says about multi-layer perceptron.

5. MLP, Softmax as MaxEnt classifier, F1 score

Medium: How do we search for a minimum of a function f(x):RDRf(\boldsymbol x): \mathbb{R}^D \rightarrow \mathbb{R} subject to equality constraints g1(x)=0,,gm(x)=0g_1(\boldsymbol x)=0, \ldots, g_m(\boldsymbol x)=0?

5. MLP, Softmax as MaxEnt classifier, F1 score

Medium: Prove which categorical distribution with NN classes has maximum entropy.

5. MLP, Softmax as MaxEnt classifier, F1 score

Hard: Consider derivation of softmax using maximum entropy principle, assuming we have a dataset of NN examples (xi,ti),xiRD,ti{1,2,,K}(x_i, t_i), x_i \in \mathbb{R}^D, t_i \in \{1, 2, \ldots, K\}. Formulate the three conditions we impose on the searched π:RDRK\pi: \mathbb{R}^D \rightarrow \mathbb{R}^K, and write down the Lagrangian to be minimized. Explain in words what is the interpretation of the conditions.

5. MLP, Softmax as MaxEnt classifier, F1 score

Medium: Define precision (including true positives and others), recall, F1F_1 score, and FβF_\beta score (we stated several formulations for F1F_1 and FβF_\beta scores; any one of them will do).

5. MLP, Softmax as MaxEnt classifier, F1 score

Medium: Explain the difference between micro-averaged and macro-averaged F1F_1 scores. List a few examples of when you would use them.

5. MLP, Softmax as MaxEnt classifier, F1 score

Easy: Explain (using examples) why accuracy is not a suitable metric for unbalanced target classes, e.g., for a diagnostic test for a contagious disease.

6. Representing Text (TF-IDF, Word2Vec)

Easy: Explain how the TF-IDF weight of a given document-term pair is computed.

6. Representing Text (TF-IDF, Word2Vec)

Easy: What is Zipf's law? Explain how it can be used to provide intuitive justification for using the logarithm when computing IDF.

6. Representing Text (TF-IDF, Word2Vec)

Medium: Define conditional entropy and mutual information, write down the relation between them, and finally prove that mutual information is zero if and only if the two random variables are independent (you do not need to prove statements about DKLD_\textrm{KL}).

6. Representing Text (TF-IDF, Word2Vec)

Medium: Show that TF-IDF terms can be considered portions of suitable mutual information.

6. Representing Text (TF-IDF, Word2Vec)

Easy: Explain the concept of word embedding in the context of MLP and how it relates to representation learning.

6. Representing Text (TF-IDF, Word2Vec)

Medium: Describe the skip-gram model trained using negative sampling. What is it used for? What are the input and output of the algorithm?

6. Representing Text (TF-IDF, Word2Vec)

Easy: How would you train a part-of-speech tagger (i.e., you want to assign each word to its part of speech) if you could only use pre-trained word embeddings and MLP classifier?

7. K Nearest Neighbors, Naive Bayes

Medium: Describe the prediction of kk for the nearest neighbors, both for regression and classification. Define LpL_p norm and describe uniform, inverse, and softmax weighting.

7. K Nearest Neighbors, Naive Bayes

Medium: Show that L2L^2-regularization can be obtained from a suitable prior by Bayesian inference (from the MAP estimate).

7. K Nearest Neighbors, Naive Bayes

Medium: Write down how p(Ckx)p(C_k | \boldsymbol x) is approximated in a Naive Bayes classifier, explicitly state the Naive Bayes assumption, and show how the prediction is performed.

7. K Nearest Neighbors, Naive Bayes

Medium: Considering a Gaussian Naive Bayes, describe how probabilities p(xdCk)p(x_d | C_k) are modeled (what distribution and which parameters it has) and how we estimate it during fitting.

7. K Nearest Neighbors, Naive Bayes

Medium: Considering a Bernoulli Naive Bayes, describe how probabilities p(xdCk)p(x_d | C_k) are modeled (what distribution and which parameters it has) and how we estimate it during fitting.

7. K Nearest Neighbors, Naive Bayes

Medium: What measures can we take to prevent numeric instabilities in the Naive Bayes classifier, particularly if the probability density is too high in Gaussian Naive Bayes and there are zero probabilities in Bernoulli Naive Bayes?

7. K Nearest Neighbors, Naive Bayes

Easy: What is the difference between discriminative and (classical) generative models?

8. Correlation, Model Combination

Medium: Prove that independent discrete random variables are uncorrelated.

8. Correlation, Model Combination

Medium: Write down the definition of covariance and Pearson correlation coefficient ρ\rho, including its range.

8. Correlation, Model Combination

Medium: Explain how Spearman's rank correlation coefficient and Kendall's rank correlation coefficient are computed (there is no need to describe the Pearson correlation coefficient).

8. Correlation, Model Combination

Easy: Describe setups where a correlation coefficient might be a good evaluation metric.

8. Correlation, Model Combination

Easy: Describe under what circumstance correlation can be used to assess the validity of evaluation metrics.

8. Correlation, Model Combination

Medium: Define Cohen's κ\kappa and explain what it is used for when preparing data for machine learning.

8. Correlation, Model Combination

Easy: Assuming you have collected data for classification by letting people annotate data instances. How do you estimate a reasonable range for classifier performance?

8. Correlation, Model Combination

Hard: Considering an averaging ensemble of MM models, prove the relation between the average mean squared error of the ensemble and the average error of the individual models, assuming the model errors have zero means and are uncorrelated. Use a formula to explain what uncorrelated errors mean in this context.

8. Correlation, Model Combination

Medium: Explain knowledge distillation: what it is used for, describe how it is done. What is the loss function? How does it differ from standard training?

9. Decision Trees, Random Forests

Medium: In a regression decision tree, state what values are kept in internal nodes, define the squared error criterion, and describe how a leaf is split during training (without discussing splitting constraints).

9. Decision Trees, Random Forests

Medium: Explain the CART algorithm for constructing a decision tree. Explain the relationship between the loss function that is optimized during the decision tree construction and the splitting criterion that is during the node splitting.

9. Decision Trees, Random Forests

Medium: In a KK-class classification decision tree, state what values are kept in internal nodes, define the Gini index, and describe how a node is split during training (without discussing splitting constraints).

9. Decision Trees, Random Forests

Medium: In a KK-class classification decision tree, state what values are kept in internal nodes, define the entropy criterion, and describe how a node is split during training (without discussing splitting constraints).

9. Decision Trees, Random Forests

Hard: For binary classification using decision trees, derive the Gini index from a squared error loss.

9. Decision Trees, Random Forests

Hard: For KK-class classification using decision trees, derive the entropy criterion from a non-averaged NLL loss.

9. Decision Trees, Random Forests

Medium: Describe how a random forest is trained (including bagging and a random subset of features) and how prediction is performed for regression and classification.

10. Gradient Boosted Decision Trees

Easy: Explain the main differences between random forests and gradient-boosted decision trees.

10. Gradient Boosted Decision Trees

Medium: Explain the intuition for second-order optimization using Newton's root-finding method or Taylor expansions.

10. Gradient Boosted Decision Trees

Hard: Write down the loss function that we optimize in gradient-boosted decision trees while constructing tt^\mathrm{} tree. Then, define gig_i and hih_i and show the value wTw_\mathcal{T} of optimal prediction in node T\mathcal{T} and the criterion used during node splitting.

10. Gradient Boosted Decision Trees

Medium: For a KK-class classification, describe how to perform prediction with a gradient boosted decision tree trained for TT time steps (how the individual trees perform prediction and how are the KTK \cdot T trees combined to produce the predicted categorical distribution).

10. Gradient Boosted Decision Trees

Easy: What type of data are gradient boosted decision trees suitable for as opposed to multilayer perceptron? Explain the intuition why it is the case.

11. SVD, PCA, k-means

Medium: Formulate SVD decomposition of matrix X\boldsymbol X, describe properties of individual parts of the decomposition. Explain what the reduced version of SVD is.

11. SVD, PCA, k-means

Medium: Formulate the Eckart-Young theorem. Provide an interpretation of what the theorem says and why it is useful.

11. SVD, PCA, k-means

Medium: Explain how to compute the PCA of dimension MM using the SVD decomposition of a data matrix X\boldsymbol X, and why it works.

11. SVD, PCA, k-means

Hard: Given a data matrix X\boldsymbol X, write down the algorithm for computing the PCA of dimension MM using the power iteration algorithm.

11. SVD, PCA, k-means

Easy: List at least two applications of SVD or PCA.

11. SVD, PCA, k-means

Hard: Describe the KK-means algorithm, including the kmeans++ initialization. What is it used for? What is the loss function that the algorithm optimizes? What can you say about the algorithm convergence?

11. SVD, PCA, k-means

Medium: Name at least two clustering algorithms. What is their main principle? How do they differ?

12. Statistical Hypothesis Testing, Model Comparison

Medium: Considering statistical hypothesis testing, define type I errors and type II errors (in terms of the null hypothesis). Finally, define what a significance level is.

12. Statistical Hypothesis Testing, Model Comparison

Medium: Explain what a test statistic and a p-value are.

12. Statistical Hypothesis Testing, Model Comparison

Medium: Write down the steps of a statistical hypothesis test, including a definition of a p-value.

  1. formulujeme nulovou hypotézu H0H_0 a případně alternativní hypotézu H1H_1
  2. vybereme testovou statistiku (testovací kritérium)
  3. spočítáme pozorovanou hodnotu testové statistiky (pro naměřená data)
  4. spočítáme pp-hodnotu, což je pravděpodobnost, že hodnota statistiky bude aspoň tak extrémní jako pozorovaná hodnota, pokud platí H0H_0
  5. zamítneme H0H_0 (ve prospěch H1H_1), pokud je pp-hodnota menší než zvolená hladina významnosti α\alpha (obvykle se používá α=5%\alpha=5\,\%)
12. Statistical Hypothesis Testing, Model Comparison

Medium: Explain the differences between a one-sample test, a two-sample test, and a paired test.

12. Statistical Hypothesis Testing, Model Comparison

Medium: When considering the multiple comparison problem, define the family-wise error rate and prove the Bonferroni correction, which allows limiting the family-wise error rate by a given α\alpha.

12. Statistical Hypothesis Testing, Model Comparison

Medium: For a trained model and a given test set with NN examples and metric EE, write how to estimate 95% confidence intervals using bootstrap resampling.

12. Statistical Hypothesis Testing, Model Comparison

Medium: For two trained models and a given test set with NN examples and metric EE, explain how to perform a paired bootstrap test that the first model is better than the other.

12. Statistical Hypothesis Testing, Model Comparison

Medium: For two trained models and a given test set with NN examples and metric EE, explain how to perform a random permutation test that the first model is better than the other with a significance level α\alpha.

13. Machine Learning Ethics, Final Summary

Medium: Explain the difference between deontological and utilitarian ethics. List examples of how these theoretical frameworks can be applied in machine learning ethics.

13. Machine Learning Ethics, Final Summary

Easy: List at least two potential ethical problems related to data collection.

13. Machine Learning Ethics, Final Summary

Easy: List at least two potential ethical problems that can originate in model evaluation.

13. Machine Learning Ethics, Final Summary

Easy: List at least one example of an ethical problem that can originate in model design or model development.

13. Machine Learning Ethics, Final Summary

Easy: Under what circumstances could train-test mismatch be an ethical problem?

Hurá, máš hotovo! 🎉
Pokud ti moje kartičky pomohly, můžeš mi koupit pivo.