Configuration: L8-L8-D8-D4 (L: LSTM, D: Dense)

|
1) Evolution of weights
- • Panel B1: The thickness of lines decreases significantly over time,
indicating that there is a major reduction in the magnitude of these parameters reduces
during the training process,
hence less contribution of this node to the next output.
- • Panel A2: There are several negative weights switch into positive right after the first
epoch,
resulted in an all-positive set of parameters in later epochs, hence the positive
contribution to the following layer.
- • However in panel B2: Major of the parameters have remarkable changes:
The originally thick lines decrease their width, the originally thin, positive lines switch
into negative and adjust to
the larger magnitude.
There is a corresponding movement in the training - testing loss line chart, where the curve witness
a
turning point (the 5th or 6th epoch) in both training and testing MSE curves.
2) Learning process
- • Panel A3: In the early stage of training process (epoch 3),
the scatterplots for training MSE and testing MSE both contain a vertical formation of
outputs.
This can be explained by the activation function ReLU: Negative input will
result in
zero output,
as can be seen in the first, third and fourth nodes in the last Dense layer right before
the final output.
At this stage, the learning process just started and parameters are not tuned properly.
- • Panel B3: As we move on to epoch 10, the outputs are now align with the target in better
shape. Notice that at the last Dense layer, the only one node has positive weight is the
second one from top down, with the outputs align in similar direction as target, whereas the
other three nodes possess opposite orientation to the target, hence their negative weights.
This observation correlates with the nature of neural network and machine learning in general: On
the process of minimizing loss, there are rewards for positive contributions and penalties for
negative contributions.
Without proper arrangement and visual representation, we would not be able to discern these
visual characteristics of parameters in complex neural networks.
|