GNGTS 2023 - Atti del 41° Convegno Nazionale

Session 1.1 - POSTER GNGTS 2023 Methods In the last few years, Machine Learning has seen rapid growth in its popularity, thanks to the development of the Deep Learning framework (LeCun et al., 2015) that has consistently improved the ability to provide accurate classification and recognition on large datasets. Among Deep Learning Networks, the Convolutional Neural Networks (CNNs) are specialized to process data that come in forms of array (1D array for time-series data, like seismic signals, or 2D array for images). CNN is structured as a series of stages, composed of two types of layers: convolutional layers and pooling layers. Each unit of the convolutional layer makes a local weighted sum of a local patch of channels of the previous layer (convolution with a local kernel), this is then passed through a non-linearity, such as the rectified linear unit function (ReLU). The pooling layer replaces the output of the net at a certain location with a summary statistics of the nearby output. CNN architecture leverages 2 ideas that can help improve the machine learning system: sparse connectivity (accomplished by making the kernel smaller than the input) and parameter sharing (accomplished by tied weights, with the same kernel applied to different locations). The CNN model used for the training process is summarized in Tab. 1. The kernel sizes of the 5 convolutional layers are shown in Tab. 1. The convolution layers use the rectified linear unit (ReLU) as the activation function. We used the Max-Pool method for the pooling layer, computing the maximum value over the neighborhood. As usual, the last layer is a fully connected layer with the softsign function as activation function, while the output layer uses sigmoid activation. We chose cross-entropy as the loss function to be minimized during the training phase, as usual in the case of binary classification tasks. We divided the dataset in training-set, validation-set and test-set. The training process is performed using a variant of the stochastic gradient descent algorithm, the ADAM algorithm (Goodfellow et al., 2016), that is an optimization strategy with adaptive steps size exploiting estimation of first-moment and second-moment of the gradient. We use mini-batches of 512 seismograms, using the default learning rate and the numerical stabilization parameter ε is chosen equal to . η = 10 −3 10 −3 Fig. 2 shows a representation of the network used. We use early stopping to prevent overfitting. The training stops automatically if the validation loss does not decrease for 3 consecutive epochs, and the best set of learned model weights (one with lowest validation loss) is saved. In each training, we augmented the training dataset twice by flipping the seismograms and assigning the flipped traces an opposite label, in order to equalize the number of data with upward and downward polarity.