JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.
DL in NLP 2020. Spring. Quiz 3
Neural Networks. Part 2
Some questions can be not mentioned in the lecture explicitly, but you can still use logic and google.
Credit for some questions:
cs231n.stanford.edu
Sign in to Google
to save your progress.
Learn more
* Indicates required question
Email
*
Your email
Github account
*
Your answer
What is the derivative of a sigmoid function?
sigmoid(x) * (1 - sigmoid(x))
sigmoid(x)
x^2 * sigmoid(x)
sigmoid(x)^2 - sigmoid(x)
Clear selection
How do we compute gradients in backpropagation algorithm?
They are estimated with finite differences
They are computed symbolically and represented in a closed form
They are computed with the rule of derivative of the composition of functions
Clear selection
Default choice of nonlinearity
Sigmoid
Tanh
ReLU
Maxout
ELU
Clear selection
Why?
Fast computation
It is OX-symmetrical
In is OY-symmetrical
It produces more complex functions with less layers
It doest not saturate in +region
It converges faster (in practice)
What are the main drawbacks of the ReLU activation function?
it's not symmetric around 0
it's not smooth and cannot be differentiated
it can zero out all the gradients from some point in the training process
y = max(0, x @ W + b), dout - downstream gradient, @ - matrix multiplication, all other operation are element-wise. What is d(loss)/dW?
W @ max(0, y) * dout
x @ W
x.T @ dout * (y > 0))
W @ max(0, y) * dout + dy / db
x.T @ dout * (y > 0)) + dy / db
Clear selection
Gradients with respect to x, y, z, w. Green numbers are forward pass. Red number number is a downstream gradient. Format your answer according to the pattern: x.xx, y.yy, z.zz, w.ww. Example answer: -3.00, 9.60, 18.66, -1.00
Your answer
What is a good way of weights initialization?
All 0's
Small random numbers
Normal distribution
All = constant > 0
Clear selection
Where is the place of the BatchNorm layer in the FFNN architecture?
Before the activation function
After the activation function
Before the input layer
Clear selection
The BatchNorm layer normalizes data over which axis?
Over instance axis
Over feature axis
Both
Clear selection
What is a better way of searching neural net's hyperparameters?
Grid search
Random search
Gradient descent
Clear selection
Why?
Good combinations of hyperparameters are not probable
Random search produce more diverse sets of hyperparameters
Gradient search allows the algorithm to converge faster
Clear selection
Can we use different learning rates at different layers of a neural network?
Yes
No
Clear selection
When neural network training should be stopped?
When train loss becomes constant
When train loss is zero
When validation loss starts to increase
Clear selection
What should be done if train loss is much less than validation loss?
Probably nothing at all - the model learned everything it can, maybe is just a noise in the dataset
Increase regularization
Collect more data
Check that your train one-hot encoding is consistent with your validation one-hot encoding
Check for a data leak
Make a hyperparameter search
Reduce model capacity
Check that all labels are in the training data
Try to change learning rate or to schedule it differently
Check your data preprocessing algorithm
Your questions about the lecture (if any)
Your answer
A copy of your responses will be emailed to the address you provided.
Submit
Clear form
Never submit passwords through Google Forms.
reCAPTCHA
Privacy
Terms
This content is neither created nor endorsed by Google.
Report Abuse
-
Terms of Service
-
Privacy Policy
Forms