Discover your new favorite coffee shop with Clustering and Parallel Coordinates | by Vasilis Kalyvas

In this article, we will try to determine the most famous venues for Athens, Greece (where I also live).

Athens, the capital and largest city of Greece, is one of the world’s oldest cities with its recorded history spanning over 3,000 years. It was the center for arts, learning, philosophy and the birthplace of democracy, having a great cultural and political impact on the European continent.

Because of its ancient monuments, works of art, landmarks and museums, Athens remains nowadays one of the most famous and attractive places for visitors from all over the world. As such, it would be interesting to explore the variety of venues around the city centre and gain insights on the most popular places. In order to do that, we will perform Clustering analysis with “sklearn” library to find similar regions in terms of their venues and, finally, create a map of the regions and their corresponding clusters.

Clustering, belonging to Unsupervised Machine Learning algorithms, will be beneficial for this type of analysis as we will be able to find similarities for unknown regions and categorize the different areas of Athens city centre.

That would be very helpful for both tourists and travel officers to better plan their trips and provide more personalized offers, according to anyone’s needs.

Furthermore, this analysis would be beneficial for business owners to better understand the different regions and select the appropriate place to open or relocate their business.

The analysis will be performed with the help of Wikipedia, some Python libraries and Foursquare, the famous location data platform.

Starting with the location, the regions of Athens will be acquired from Wikipedia and passed to “geopy” in order to return their coordinates (latitudes and longitudes). “geopy” does not provide information about every region in Athens but it is still very sufficient.

#### get all the regions on Athens from "https://en.wikipedia.org/wiki/Athens"
regions = 'Omonoia, Syntagma, Exarcheia, Agios Nikolaos, Neapolis, Lykavittos, Lofos Strefi, Lofos Finopoulou, Lofos Filopappou, Pedion Areos, Metaxourgeio, Aghios Kostantinos, Larissa Station, Kerameikos, Psiri, Monastiraki, Gazi, Thission, Kapnikarea, Aghia Irini, Aerides, Anafiotika, Plaka, Acropolis, Pnyka, Makrygianni, Lofos Ardittou, Zappeion, Aghios Spyridon, Pangrati, Kolonaki, Dexameni, Evaggelismos, Gouva, Aghios Ioannis, Neos Kosmos, Koukaki, Kynosargous, Fix, Ano Petralona, Kato Petralona, Rouf, Votanikos, Profitis Daniil, Akadimia Platonos, Kolonos, Kolokynthou, Attikis Square, Lofos Skouze, Sepolia, Kypseli, Aghios Meletios, Nea Kypseli, Gyzi, Polygono, Ampelokipoi, Panormou-Gerokomeio, Pentagono, Ellinorosson, Nea Filothei, Ano Kypseli, Tourkovounia-Lofos Patatsou, Lofos Elikonos, Koliatsou, Thymarakia, Kato Patisia, Treis Gefyres, Aghios Eleftherios, Ano Patisia, Kypriadou, Menidi, Prompona, Aghios Panteleimonas, Pangrati, Goudi, Vyronas, Ilisia'# split the "regions" correctly
athens_regions = regions.split(', ')
#### in order to get the coordinates, we will pass every region to "geopy", 
#### so we construct the following:
# for coordinates:
latitudes = []
longitudes = []
# unfortunately, "geopy" does not have coordinates for all regions 
# (we will address this issue with "try-except", so we store them here:
regions_no_geodata = []
# these are the final regions, for which we finally have coordinates
final_athens_regions = []
# "geopy":
for i in range(len(athens_regions)):
try:
address = '{}, Athens'.format(athens_regions[i])
geolocator = Nominatim(user_agent="athens")
location = geolocator.geocode(address)
latitudes.append(location.latitude)
longitudes.append(location.longitude)
final_athens_regions.append(athens_regions[i])
except AttributeError:
regions_no_geodata.append(athens_regions[i])
print('In total, we will work with {} regions in Athens.'.format(len(final_athens_regions)))
# result:
In total, we will work with 52 regions in Athens.

# transform into a dataframe by combining regions with their coordinates
athens_df = pd.DataFrame(data={'Region':final_athens_regions, 'Latitude':latitudes, 'Longitude':longitudes})
athens_df

Let’s now see all these regions on the map, with the “folium” library:

# we will get coordinates for Athens and, then, see all of its regions on the map, with the help of "folium"
address = 'Athens'
geolocator = Nominatim(user_agent="athens")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Athens are Latitude: {} and Longitude: {}.'.format(latitude, longitude))# create map of Athens using latitude and longitude values
athens_map = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, region in zip(athens_df['Latitude'], athens_df['Longitude'], athens_df['Region']):
label = '{}'.format(region)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(athens_map)  
athens_map

Next is the venues’ data acquisition from Foursquare. A developer account had already been created and credentials were provided. With the API requests, we are able to get all venues for every region, along with their coordinates and category type.

# we pass the credentials in order to acquire location data from Foursquare
CLIENT_ID = '..' # Foursquare ID
CLIENT_SECRET = '..' # Foursquare Secret
ACCESS_TOKEN = '..' # FourSquare Access Token
VERSION = '20180604'venues_list=[]
for name, lat, lng in zip(final_athens_regions, latitudes, longitudes):
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, 3000, 200)
results = requests.get(url).json()["response"]['groups'][0]['items']
venues_list.extend([
[name, v['venue']['name'], v['venue']['id'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],  
v['venue']['categories'][0]['name']
] for v in results])
nearby_venues = pd.DataFrame(venues_list)
nearby_venues.columns = ['Region', 'Venue', 'Venue ID',
'Venue Latitude', 'Venue Longitude',
'Venue Primary Category']
# let's see the venues in every region
nearby_venues

# now, let's see the 20 most popular types of venues
nearby_venues['Venue Primary Category'].value_counts().head(20)

Now, we will decide on a few steps.

We will select the 20 most popular types of venues, so as to perform a representative analysis.
We are not interested in Gyms (not a tourist attraction).
We will combine “cafe” and “coffee shops” into one category.
We will combine “Meze Restaurant” and “Greek Restaurant” into one category (as “meze” refers to traditional greek food).

# construct "venues"
venues = [i for i in nearby_venues['Venue Primary Category'].value_counts().head(20).index]# delete "Gym"
del(venues[venues.index('Gym')])
# select only rows with specific venues according to "venues" 
specific_nearby_venues = nearby_venues[nearby_venues['Venue Primary Category'].isin(venues)]
# reset the index of the dataframe, as we selected rows
specific_nearby_venues.reset_index(drop=True, inplace=True)
# convert "Coffee Shop" to "Cafe" and "Meze Restaurant" to "Greek Restaurant"
specific_nearby_venues['Venue Primary Category'] = np.where(specific_nearby_venues['Venue Primary Category']=='Coffee Shop', 'Café',
np.where(specific_nearby_venues['Venue Primary Category']=='Meze Restaurant','Greek Restaurant', specific_nearby_venues['Venue Primary Category']))
specific_nearby_venues

We will perform one hot encoding in order to transform the venue category of every row into features of 0s and 1s. This is a necessary part when having categorical features.

# one hot encoding
athens_onehot = pd.get_dummies(specific_nearby_venues[['Venue Primary Category']], prefix="", prefix_sep="")# add neighborhood column back to dataframe
athens_onehot['Region'] = specific_nearby_venues['Region'] 
# move region column as the first column
fixed_columns = [athens_onehot.columns[-1]] + list(athens_onehot.columns[:-1])
athens_onehot = athens_onehot[fixed_columns]
athens_onehot.head()

We see that we have multiple rows per region (equal to the number of venues in this region). So, for clustering, we have to group the data so as to have one row per region and compare their venues to find how similar they are.

# we can see how any venues of every type we have in every region
athens_grouped = athens_onehot.groupby('Region').sum().reset_index()
athens_grouped

Table 6 — Regions with total venues encoded

Now, we construct a dataframe to see the 10 most popular venues for every region (not necessary step for clustering):

# sort venues in descending order
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Region']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Region'] = athens_grouped['Region']
for ind in np.arange(athens_grouped.shape[0]):
venues_sorted.iloc[ind, 1:] = return_most_common_venues(athens_grouped.iloc[ind, :], num_top_venues)
venues_sorted

Table 7 — Regions with their most popular venues

We first have to scale the data, because clustering is based on distances. That means that very popular venues might “dominate” the less popular ones and the results would not be realistic.

athens_grouped_clustering = athens_grouped.drop('Region', 1)
athens_grouped_clustering_scaled = MinMaxScaler().fit_transform(athens_grouped_clustering.values)

Clustering, as stated in the beginning, is an Unsupervised Learning Algorithm and that means it is not based on labeled data and we also do not know the appropriate number of clusters that would give the best performance.

So, we will try to find the best number of clusters based on inertia (a measure of how internally coherent clusters are) and the “elbow method”.

# set the meximum number of clusters to 41, perform clustering with "for" loop to try all possible number of clusters,
#  compute the inertia and store the results 
number_clusters = 41
kmeans_tests = [KMeans(n_clusters=cluster, init='k-means++', n_init=10, random_state=0) for cluster in range(1, number_clusters)]
scores = [kmeans_tests[test].fit(athens_grouped_clustering_scaled).inertia_ for test in range(len(kmeans_tests))]plt.plot(range(1, number_clusters), scores)
plt.xlabel('# Clusters')
plt.ylabel('Inertia')
plt.show()

According to the “elbow method”, we can set the appropriate number of clusters between 4 and 10 because, as the number of clusters increases there is no significant decrease of inertia (i.e. increase of performance), so there is no point working with more clusters.

However, for simplicity reasons, let’s set the number of clusters to 4.

# set number of clusters
kclusters = 4# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=10, random_state=0)
kmeans.fit(athens_grouped_clustering_scaled)
clusters = kmeans.predict(athens_grouped_clustering_scaled)

The dataset has many features (i.e. every region has multiple venues) and it is not easy to plot the data.

For this reason, we introduce PCA (Principal Component Analysis), another Unsupervised Leaning Algorithm, which decreases the number of features in such a way that new features are constructed in order to keep as much explainability as possible. In this way, we can also view the data in two dimensions.

# we specify 2 principal components (as new features) to keep the model simple and, also, view the results in 2D
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(athens_grouped_clustering_scaled)# construct a dataframe with PC1, PC2 and the correct cluster for every region
reduced_df = pd.DataFrame(data_reduced, columns=['PC1','PC2'])
reduced_df['cluster'] = clusters
# plot the clusters
fig = plt.figure(figsize=(7,5))
sns.scatterplot(data=reduced_df, x='PC1', y='PC2', hue='cluster', palette='bright')

We see that 4 clusters are formed (as specified) and they are a bit distinct from each other.

# compute the variance ratio explained by PCA:
pca.explained_variance_ratio_# result:
array([0.36809032, 0.23998701])

In total, the two new features explain about 61% of the initial variance in the data. It is fair enough, taking into consideration that we have 17 venues for every region.

Another metric for clustering is the Silhouette Score, which is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It ranges from -1 (worst) to 1 (best), while a score of 0 indicates overlapping clusters.

# compute the silhouette score
silhouette_score(athens_grouped_clustering_scaled, clusters)# result:
0.2677782885403751

For the final part of our analysis, we have 1) Parallel Plots and 2) Folium Map.

The Parallel Plots show the similarities (in terms of features) between regions of the same cluster, so we should add the information about the cluster of every region to the previous dataset as computed and then scale the data for the parallel plots to work properly (results will be discussed later). We, finally, plot regions in the map, clustered according to their venues with the help of “folium”.

We need to construct two new dataframes, in order to be able to plot the clusters and see them on the map. Firstly, we add the information about the cluster of every region as computed and then scale the data for the parallel plots to work properly.

# first, define the same colors, in order to be aligned with the previous cluster plot
colors_d = {0:'#0529f7', 1:'#f7b205', 2:'#05f709', 3:'#ff0000'}# drop the region name
athens_grouped_scaled = pd.DataFrame(athens_grouped_clustering_scaled, columns=athens_grouped.drop(['Region'], axis=1).columns)
# add the cluster labels
athens_grouped_scaled['cluster'] = clusters
# define the parallel plots for every cluster
plt.figure(figsize=(24,2))
parallel_coordinates(athens_grouped_scaled[athens_grouped_scaled['cluster']==0], 'cluster', color=colors_d[0])
plt.figure(figsize=(24,2))
parallel_coordinates(athens_grouped_scaled[athens_grouped_scaled['cluster']==1], 'cluster', color=colors_d[1])
plt.figure(figsize=(24,2))
parallel_coordinates(athens_grouped_scaled[athens_grouped_scaled['cluster']==2], 'cluster', color=colors_d[2])
plt.figure(figsize=(24,2))
parallel_coordinates(athens_grouped_scaled[athens_grouped_scaled['cluster']==3], 'cluster', color=colors_d[3])

These plots (which look like spaghetti!) might seem very messy but they can provide some interesting insights.

Every plot represents a cluster (see label at the top-right) and shows how many venues of all types each cluster has, according to the density of the lines.

Cluster 0 (blue) consists mainly of Bars, Falafel Restaurants, Kafenios (traditional greek cafeterias), Historic places and some Theaters. This seems like regions in the centre of Athens, with their historic places and monuments.
Cluster 1 (orange) seems balanced overall, having a bit of everything but no Historic sites or Hotels. It seems to refer to more suburban areas rather than the city centre.
Cluster 2 (green) seems similar to cluster 0 (blue) but with more Historic sites, Ice Cream shops, Plazas, Pizza and Souvlaki shops but, on the contrary, fewer Kafenios, Theaters and Movie Theaters. This also seems like being at the centre of Athens.
Cluster 3 (red) mainly consists of Restaurants, Food places and Theaters, probably outside the centre. It seems like cluster 1 (orange) but with less options.

Finally, let’s see the clusters on the map and check our thoughts.

# add clustering labels in order to combine names of regions, coordinates and cluster labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)# merge data with venues to add latitude/longitude for each neighborhood
athens_merged = athens_df.join(venues_sorted.set_index('Region'), on='Region')
athens_merged.head()

Table 8 — Regions with venues and clusters

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(athens_merged['Latitude'], athens_merged['Longitude'], athens_merged['Region'], athens_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=colors_d[cluster],
fill=True,
fill_color=colors_d[cluster],
fill_opacity=0.7).add_to(map_clusters)
# see the clusters on the map!
map_clusters