Week 2— Describe Dataset & Analyzing Related Works

3 min readApr 18, 2021

We spent this week diving deeper into the related works and understanding what is going on with data.

Team: Muhammet Subaşı, Harun Bürkük, Mustafa Korkmazlar

source: https://www.boldbi.com/blog/visualize-geographical-data-using-map-dashboard

Firstly, if we talk about why we choose this dataset, this dataset promise challenge because of provides working with geospatial data. Geospatial data (also known as “spatial data”) is used to describe data that represents features or objects on the Earth’s surface. Whether it’s man-made or natural, if it has to do with a specific location on the globe, it’s geospatial.

Analyzing Related Work

Our related paper writers started with describe the problem. They decided to cast the problem as a regression problem (predicting a spherical coordinate, latitude, longitude) rather than a classification problem(predicting a country/area). The reason for this is; there are a large number of countries/areas and a low number of examples per country/area.

Secondly, they described challenging situations. The problem is of technical interest to the data analysis task because of the special characteristics of the representation of global geographical positions. Longitude is discontinuous, and latitude/longitude grids of equal degree size have different areas — larger as they approach the equator. So they decided to not use linear regression. To deal with this we applied K-Nearest Neighbor and Random forest regression methods for prediction.

Conclusion of the paper, they tried RFC and KNN but KNN gives more accurate results. After applying the permutation test, RFC and KNN gave better results than the permutation test, thus we can infer that there does exist a connection between the audio features and geographical locations.

Also, an interesting result of this paper even if Greece and Taiwan have the same track number, Greece gives more accurate results because of the smaller area.

Describe Dataset

In below, you see the distributions of countries in the dataset and worldwide map. In order to find feature importances, we used the ExtraTreeClassifier model from sklearn library. Most of the features importance in almost the same range some of them have more feature importance than others. You can see the results below. Also we say that most of the features are over 0.010.