A Delightful Tale of Two Cities!(Story telling using data!

12 min readOct 16, 2020

An analysis on comparing two cities which are global, multicultural, and cosmopolitan cities found at the heart of two European nations

Keywords: Data science, Machine learning, Python, Web scraping, Foursquare

Image credit: https://thecamerasview.files.wordpress.com/2015/09/dscf3183.jpg

Image Credit: https://www.widest.com/wp-content/uploads/Aerial-view-of-London-at-night.jpg

The post covers the methodology and analysis used for the final capstone project in the IBM Data Science Professional course. Detailed report, code and results can be found on Github and are linked towards the end of the post

1. Introduction

Picking a city, when it comes to London and Paris is always a hard decision as both these cities are truly global, multicultural, and cosmopolitan cities found at the heart of two European nations. Along with being two of Europe’s most important diplomatic centres, they are major centres for finance, commerce, sciences, fashion, arts, culture and gastronomy. Both London (capital of the United Kingdom) and Paris (capital of France) have a rich history and are two of the most visited and sought-after cities in Europe. London is the largest city within the UK and stands on River Thames in South East England. Paris, on the other hand, is located in the north-central part of the nation. Similar to London, the city also stands along a river, commonly known as the Seine River.

Our goal is to perform a comparison of the two cities to see how similar or dissimilar they are. Such techniques allow users to identify similar neighborhoods among cities based on amenities or services being offered locally, and thus can help in understanding the local area activities, what are the hubs of different activities, how citizens are experiencing the city, and how they are utilizing its resources.

What kind of clientele would benefit from such an analysis?

A potential job seeker with transferable skills may wish to search for jobs in selective cities which provide the most suitable match for their qualifications and experience in terms of salaries, social benefits, or even in terms of a culture fit for expats.
Further, a person buying or renting a home in a new city may want to look for recommendations for locations in the city similar to other cities known to them.
Similarly, a large corporation looking to expand its locations to other cities might benefit from such an analysis.
Many within-city urban planning computations might also benefit from modelling a city’s relationship to other cities.

Business Problem

The aim is to help tourists choose their destinations depending on the experiences that the neighborhoods have to offer and what they would want to have. This model also helps people make decisions if they are thinking about migrating to London or Paris or even if they wish to relocate neighborhoods within the city. Our findings will help stakeholders make informed decisions and address any concerns they have, including the different kinds of cuisines, provision stores and what the city has to offer.

2. Data Description

We require geographical location data for both London and Paris. Postal codes in each city serve as a starting point. Using Postal codes, we can find out the neighborhoods, boroughs, venues and their most popular venue categories.

London

To derive our solution, We scrape our data from https://en.wikipedia.org/wiki/List_of_areas_of_London

This Wikipedia page has information about all the neighbourhoods; we limit it to London.

borough: Name of Neighborhood
town: Name of the borough
post_code: Postal codes for London.

This Wikipedia page lacks information about geographical locations. To solve this problem, we use ArcGIS API.

ArcGIS API

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups.

More specifically, we use ArcGIS to get the geographical locations of the neighbourhoods of London. Adding the following columns to our initial data set prepares our data.

latitude: Latitude for Neighborhood
longitude: Longitude for Neighborhood

Paris

To derive our solution, We leverage JSON data available at https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e

The JSON file has data about all the neighbourhoods in France; we limit it to Paris.

postal_code: Postal codes for France
nom_comm: Name of Neighborhoods in France
nom_dept: Name of the boroughs, equivalent to towns in France
geo_point_2d: Tuple containing the latitude and longitude of the Neighborhoods.

Foursquare API Data

We will need data about different venues in different neighbourhoods of that specific borough. To gain that information, we will use “Foursquare” location information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each neighbourhood. For each neighbourhood, we have chosen the radius to be 500 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

Neighbourhood: Name of the Neighborhood
Neighbourhood Latitude: Latitude of the Neighborhood
Neighbourhood Longitude: Longitude of the Neighborhood
Venue: Name of the Venue
Venue Latitude: Latitude of Venue
Venue Longitude: Longitude of Venue
Venue Category: Category of Venue

Based on all the information collected for both London and Paris, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings. Using this data, our stakeholders can take the necessary decision.

3. Methodology

We will be creating our model with the help of Python, so we start by importing all the required packages. The code is available on GitHub to follow along.

import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from sklearn.cluster import KMeans

Package breakdown:

Pandas: To collect and manipulate data in JSON and HTML and then data analysis
requests: Handle HTTP requests
matplotlib: Detailing the generated maps
folium: Generating maps of London and Paris
sklearn: To import K Means machine learning model.

The approach taken here is to explore each of the cities individually, plot the map to show the neighbourhoods being considered and then build our model by clustering all of the similar neighbourhoods together and finally plot the new map with the clustered neighbourhoods. We draw insights and then compare and discuss our findings.

Data Collection

In the data collection stage, we begin with collecting the required data for the cities of London and Paris. We need data that has the postal codes, neighborhoods and boroughs specific to each of the cities.

To collect data for London, using pandas, we scrape the List of areas of London Wikipedia page to take the 2nd table:

The data looks like this:

To collect data for Paris, we download the JSON file containing all the postal codes of France from https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e

Using Pandas, we load the table after reading the JSON file:

List of districts in the city of Paris (Districts 1–4 are combined as 1).

Data Pre processing

For London, We replace the spaces with underscores in the title. The borough column has numbers within square brackets that we remove using:

london data preprocess

For Paris, we break down each of the nested fields and create the dataframe that we need:

Feature Selection

For both of our datasets, we need only the borough, neighbourhood, postal codes and geolocations (latitude and longitude). So we end up selecting the columns that we need by:

feature selection

Feature Engineering

Both of our Datasets contain information related to all the cities in the country. We can narrow down and further process the data by selecting only the neighborhoods of ‘London’ and ‘Paris’.

feature engineering

Looking over our London dateset, we can see that we don’t have the geo location data. We need to extrapolate the missing data for our neighbourhoods. We perform this by leveraging the ArcGIS API. With the Help of ArcGIS API we can get the latitude and longitude of our London neighborhood data.

Defining London ArcGIS geocode function to return latitude and longitude.

Passing postal codes of London to get the geographical coordinates

coordinates_latlng_uk = geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

london_merged = pd.concat([df1,lat_uk.astype(float), lng_uk.astype(float)], axis=1)
london_merged.columns= ['borough','town','post_code','latitude','longitude']
london_merged

The resulting dataframe after doing feature engineering to london dataset

As for our Paris dataset, we don’t need to get the Geo coordinates using an external data source or collect it with the ArcGIS API call since we already have it stored in the geo_point_2d column as a tuple in the df_paris dataframe.

We just need to extract the latitude and longitude from the column:

paris_lat = paris_latlng.apply(lambda x: x.split(',')[0])
paris_lat = paris_lat.apply(lambda x: x.lstrip('['))
 
paris_lng = paris_latlng.apply(lambda x: x.split(',')[1])
paris_lng = paris_lng.apply(lambda x: x.rstrip(']'))
 
paris_geo_lat  = pd.DataFrame(paris_lat.astype(float))
paris_geo_lat.columns=['Latitude']
 
paris_geo_lng = pd.DataFrame(paris_lng.astype(float))
paris_geo_lng.columns=['Longitude']

We then create our Paris dataset with the required information:

paris_combined_data = pd.concat([df_paris.drop('geo_point_2d', axis=1), paris_geo_lat, paris_geo_lng], axis=1)
paris_combined_data

Resulting dataframe after feature engineered paris dataset

Note: Both the datasets have been properly processed and formatted. Since the same steps are applied to both the datasets of London and Paris, we will be discussing the code for only the London dataset for simplicity.

Visualizing the Neighborhoods of London and Paris

London:

Now that our datasets are ready, using the Folium package, we can visualize the maps of London and Paris with the neighbourhoods that we collected.

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in London.

Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

Function to get the venues of both London and Paris neighbourhoods

Getting the venues in London

top 100 venues list in london

Sampling our data

Wow, we have scraped together 10629 records for venues. This will definitely make the clustering interesting.

We need to now see how many Venue Categories are there for further processing

getting frequent venues list

293 venues are most occurred or frequently visited

We can see 293 records, just goes to show how diverse and interesting the place is.

We need to Encode our venue categories to get a better result for our clustering

Since we are trying to find out what are the different kinds of venue categories present in each neighbourhood and then calculate the top 10 common venues to base our similarity on, we use the One Hot Encoding to work with our categorical data type of the venue categories. This helps to convert the categorical data into numeric data.

We won’t be using label encoding in this situation since label encoding might cause our machine learning model to have a bias or a sort of ranking which we are trying to avoid by using One Hot Encoding.

We perform one-hot encoding and then calculate the mean of the grouped venue categories for each of the neighbourhoods.

Data with the mean values for each of the Neighborhoods after one-hot encoding

Top Venues in the Neighborhoods

In our next step, We need to sort, rank and label the top venue categories in our neighbourhood.

Let’s define a function to get the top venue categories in the neighbourhood

Top Common venues in each Neighborhood ranked.

4. Model Building — K Means

Moving on to the most exciting part — Model Building! We will be using K Means Clustering Machine learning algorithm to cluster similar neighbourhoods together. There are many different cluster sizes that we can select, we will be going with the number of clusters as 5 to keep it as optimized as possible.

Our model has labelled each of the neighbourhoods, we add the label into our dataset.

We then join London_merged dataframe with our neighborhood venues sorted dataframe to add latitude & longitude for each of the neighbourhoods to prepare it for visualization.

Visualizing the clustered Neighborhoods

Our data is processed, missing data is collected and compiled. The Model is built. All that’s remaining is to see the clustered neighbourhoods on the map. Again, we use Folium package to do so.

We drop all the NaN( Not a Number) values to prevent data skew

london_data_nonan = london_data.dropna(subset=['Cluster Labels'])

Map of clustered neighbourhoods of London

Map of clustered neighbourhoods of Paris

Examining our Clusters

We could examine our clusters by expanding on our code using the Cluster Labels column:

Cluster 1

london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 1, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Cluster 2

london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 2, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Cluster 3

london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 3, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Cluster 4

london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 4, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Cluster 5

london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 5, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Results and Discussion

The neighbourhoods of London are very multicultural. There are a lot of different cuisines including Indian, Italian, Turkish and Chinese. London seems to take a step further in this direction by having a lot of restaurants, bars, juice bars, coffee shops, Fish and Chips shop and Breakfast spots. It has a lot of shopping options too with that of the Flea markets, flower shops, fish markets, Fishing stores, clothing stores. The main modes of transport seem to be Buses and trains. For leisure, the neighbourhoods are set up to have lots of parks, golf courses, zoo, gyms and Historic sites. Overall, the city of London offers a multicultural, diverse and certainly entertaining experience.

Paris is relatively small in size geographically. It has a wide variety of cuisines and eateries including French, Thai, Cambodian, Asian, Chinese etc. There are a lot of hangout spots including many Restaurants and Bars. Paris has a lot of Bistros. Different means of public transport in Paris which includes buses, bikes, boats or ferries. For leisure and sightseeing, there are a lot of Plazas, Trails, Parks, Historic sites, clothing shops, Art galleries and Museums. Overall, Paris seems like the relaxing vacation spot with a mix of lakes, historic spots and a wide variety of cuisines to try out.

Conclusion

The purpose of this project was to explore the cities of London and Paris and see how attractive it is to potential tourists and migrants. We explored both the cities based on their postal codes and then extrapolated the common venues present in each of the neighbourhoods finally concluding with clustering similar neighbourhoods together.

We could see that each of the neighbourhoods in both the cities have a wide variety of experiences to offer which is unique in its own way. The cultural diversity is quite evident which also gives the feeling of a sense of inclusion.

Both Paris and London seem to offer a vacation stay or a romantic getaway with a lot of places to explore, beautiful landscapes, amazing food and a wide variety of culture. Overall, it’s up-to-the stakeholders to decide which experience they would prefer more and which would more to their liking.

The detailed code is available on GitHub . Thanks for reading!