JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]


Journal of Information Science and Engineering, Vol. 36 No. 5, pp. 1007-1019


Vanishing Gradient Analysis in Stochastic Diagonal Approximate Greatest Descent Optimization


HONG HUI TAN AND KING HANN LIM
Department of Electrical and Computer Engineering
Curtin University
CDT 250, 98009 Miri, Sarawak, Malaysia
E-mail: tan.honghui@postgrad.curtin.edu.my


Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.


Keywords: Stochastic diagonal approximate greatest descent, vanishing gradient, learning rate tuning, activation function, adaptive step-length

  Retrieve PDF document (JISE_202005_05.pdf)