First of all, we approached the problem which is finding the origin of music as a classification problem. We convert longitude and latitude values to the country labels as below.
Actually, we are expected higher prediction result because we think that there are enough data for training. We split the dataset by 80% training and 20% test. We think that the reason for this is the data is not distributed homogeneously over the countries. Some countries are seen in the dataset rarely and some countries are seen frequently. For example, Belize is seen approximately 10 times however Turkey is seen more than 60 times.
The results of what we get in the below;
Random Forest Classifier
Our first algorithm is Random Forest, which is an ensemble learning algorithm that tries to combine multiple hypotheses to make better predictions. We try to different parameters for this model but the best of these parameters are;
rfc = RandomForestClassifier(n_estimators=1000, random_state=1, min_samples_leaf=1)
We used logistic regression for our second attempt. Logistic regression is a statistical model that uses a logistic function to model a given dataset. This model works better with binary dependent variables but it also works quite well with multi-class predictions. We tried different hyperparameters, at the end of these tries we found these parameters as the best parameters for higher accuracy.
lr = LogisticRegression(max_iter=500, random_state=1, multi_class='multinomial')
Support Vector Machines
Support vector machines are supervised learning models that analyze data for classification and regression analysis. Since we convert our longitude and latitude values into country labels, we preferred Support Vector Classification.
The best predicting result for the SVC algorithm is 42,4%. We obtained this result when the penalty parameter is 0,1 by using linear kernel and gamma scalping. Increasing penalty parameter generally increases predicting results but we got the best result when the penalty parameter is its minimum.
svc = svm.SVC(kernel='linear', gamma='scale', C=.1)
These are accuracy results. We understood that these results are not higher as expected. Next week we decide on what we are doing with this accuracy. We will either improve these results with feature selection and deep learning or we will change our perspective on the problem.
sklearn.metrics.plot_confusion_matrix - scikit-learn 0.24.2 documentation