DL in NLP 2020. Spring. Quiz 3

JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

Neural Networks. Part 2

Some questions can be not mentioned in the lecture explicitly, but you can still use logic and google.
Credit for some questions: cs231n.stanford.edu

Email *

Github account *

What is the derivative of a sigmoid function?

sigmoid(x) * (1 - sigmoid(x))

sigmoid(x)

x^2 * sigmoid(x)

sigmoid(x)^2 - sigmoid(x)

Clear selection

How do we compute gradients in backpropagation algorithm?

They are estimated with finite differences

They are computed symbolically and represented in a closed form

They are computed with the rule of derivative of the composition of functions

Clear selection

Default choice of nonlinearity

Sigmoid

Tanh

ReLU

Maxout

ELU

Clear selection

Why?

Fast computation

It is OX-symmetrical

In is OY-symmetrical

It produces more complex functions with less layers

It doest not saturate in +region

It converges faster (in practice)

What are the main drawbacks of the ReLU activation function?

it's not symmetric around 0

it's not smooth and cannot be differentiated

it can zero out all the gradients from some point in the training process

y = max(0, x @ W + b), dout - downstream gradient, @ - matrix multiplication, all other operation are element-wise. What is d(loss)/dW?

W @ max(0, y) * dout

x @ W

x.T @ dout * (y > 0))

W @ max(0, y) * dout + dy / db

x.T @ dout * (y > 0)) + dy / db

Clear selection

Gradients with respect to x, y, z, w. Green numbers are forward pass. Red number number is a downstream gradient. Format your answer according to the pattern: x.xx, y.yy, z.zz, w.ww. Example answer: -3.00, 9.60, 18.66, -1.00

What is a good way of weights initialization?

All 0's

Small random numbers

Normal distribution

All = constant > 0

Clear selection

Where is the place of the BatchNorm layer in the FFNN architecture?

Before the activation function

After the activation function

Before the input layer

Clear selection

The BatchNorm layer normalizes data over which axis?

Over instance axis

Over feature axis

Both

Clear selection

What is a better way of searching neural net's hyperparameters?

Grid search

Random search

Gradient descent

Clear selection

Why?

Good combinations of hyperparameters are not probable

Random search produce more diverse sets of hyperparameters

Gradient search allows the algorithm to converge faster

Clear selection

Can we use different learning rates at different layers of a neural network?

Yes

Clear selection

When neural network training should be stopped?

When train loss becomes constant

When train loss is zero

When validation loss starts to increase

Clear selection

What should be done if train loss is much less than validation loss?

Probably nothing at all - the model learned everything it can, maybe is just a noise in the dataset

Increase regularization

Collect more data

Check that your train one-hot encoding is consistent with your validation one-hot encoding

Check for a data leak

Make a hyperparameter search

Reduce model capacity

Check that all labels are in the training data

Try to change learning rate or to schedule it differently

Check your data preprocessing algorithm

Your questions about the lecture (if any)

A copy of your responses will be emailed to the address you provided.

Submit

Clear form

Never submit passwords through Google Forms.

reCAPTCHA

Privacy Terms

This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy

Forms