Week 5 — Predicting Birth Coordinates of Music Using Neural Network

Muhammet Subaşı
5 min readMay 16, 2021

In this glorious week, we want to talk about neural network implementation in predicting the geographical origin of music.

https://realpython.com/learning-paths/machine-learning-python/

Introduction

Today we will talk about our neural network model, algorithms, and methods that we use to find the origin of music around the world.

Feature Importance

Feature importance is a really important area for neural networks. The topics such as which variables are mostly used to make predictions, the existence of correlations, possible causal relationships, help us understand the success of neural networks in mimicking real intelligence. Because we humans also consider these relationships in the decisions we make in our real lives.

Accordingly, we used eli5 Permutation Importance for feature importance which has built-in support for common ML frameworks such as sklearn, Keras.

The idea behind Permutation Importance is whatever we’re looking at to measure scores it measures how the score decreases when a feature is not available. This method is also known as “Mean Decrease Accuracy (MDA)”. The logic of implementation is to replace each feature with a column filled with noise values ​​and look at varying accuracy rates. This is especially useful for datasets with a large number of features like ours.

Below is the feature importance ranking of our dataset. We will look at this table and select the most appropriate number of features.

Model

We decided to use Multilayer Perception (MLP) regressor. Because our aim was to predict 2 numeric values as latitude and longitude from numeric feature values of music. Our Multilayer Perception (MLP) model consists of 1 input layer 2 hidden layers with 12 nodes for each and an output layer with 2 nodes which are latitude and longitude.

After predicting latitude and longitude values, we would need to calculate the distance between the predicted coordinates and the actual coordinates. For this purpose we used the function below:

def distance(lat1, lon1, lat2, lon2):
p = 0.017453292519943295
c = cos
a = 0.5 - c((lat2 - lat1) * p)/2 + c(lat1 * p) * c(lat2 * p) * (1 - c((lon2 - lon1) * p))/2
return 12742 * asin(sqrt(a))

This function calculates the distance in kilometers between 2 points on Earth’s surface. We couldn’t use traditional methods such as Euclidean Distance to calculate the distance. Because the distance on Earth’s surface is not linear since its shape is irregularly ellipsoid. Moreover, the distance between 2 consecutive longitudes is not constant. It decreases from the equator to the poles.

Optimized Hyperparameters

In order to improve our model’s success, we have to optimize the model’s hyperparameters. Since we are using MLPRegressor we have a lot of parameters to optimize. Under this header, we are going to explain every parameter that we optimized. Let’s start!

random state: This parameter determines random number generation for weights and bias initialization so as all you know there is no good or bad number for this parameter so we choose 1 for our parameter.

hidden_layer_sizes: This parameter resolves hidden layer size in the model and the number of nodes in a hidden layer. Optimizing these numbers depending on data’s input and output size. Since our dataset is not that complicated we choose 2 as our hidden layer size and for determining the number of nodes in a hidden layer we apply (number of inputs + number of outputs)^⁰.5 + (1 to 10) approach and we test the mean squared error for all of the possible hidden layers and we find that best combination is 2 hidden layers with 12 nodes.

activation: As for activation function there is 4 possible candidate and they are ‘relu’, ‘identity’, ‘logistic’, ‘tanh’. We try these candidates and find that the ‘identity’ activation function is best for our model and as we mentioned before since our dataset is not complicated and kind of linear, this activation function makes sense.

solver: For solver for the weight optimization there is ‘adam’, ‘sgd’,’lbfgs’. Between these three optimization approaches, we found that ‘adam’ is the best for our model and dataset because since our dataset is not that small in terms of number ‘lbfgs’ will be not efficient but ‘adam’ is fitting our data pretty well.

batch_size: We tried different numbers of batch sizes. we tried with 64,128,256,400,500 etc and find that 128 batch size giving us the best mean squared error for our data.

alpha: This parameter is basically a parameter for L2 regularization penalty parameter. We tried this parameter different numbers as well such as 0.001,0.1,1,100,1000 and we find that 1000 is the best number for this parameter for our model. Since this is a regression problem and our target labels are different from each other this makes sense.

max_iter: It’s the max number of iterations for the model. When using the ‘adam’ solver this parameter becomes epoch size for the model. Like the others, we tried with a different number of max_iter’s and we see that this can be quite different because 4000 and 200 give almost similar results in terms of mean squared error so we can say that, we have to deal with this parameter until the end of the project.

Until next week we will try different parameters, approaches on our dataset and model, and finally, share the results for our task.

--

--