Creating Visualizations with SQL query Data
Introduction
Hello Everyone!
In today’s post, we will learn how to create a database and how to use SQL query! Let’s begin
Create a Database
We will begin by importing two necessary packages: pandas and sqlite3. Both of these will allow us to create a SQL query.
import pandas as pd
import sqlite3
Next, we will read in our data! We have three CSV files we will read in, and they will all be their own separate table.
Let’s read in temps_stacked.csv, which contains the columns ID, Year, Month, and temp.
temps = pd.read_csv("temps_stacked.csv")
temps.head()
ID | Year | Month | Temp | |
---|---|---|---|---|
0 | ACW00011604 | 1961 | 1 | -0.89 |
1 | ACW00011604 | 1961 | 2 | 2.36 |
2 | ACW00011604 | 1961 | 3 | 4.72 |
3 | ACW00011604 | 1961 | 4 | 7.73 |
4 | ACW00011604 | 1961 | 5 | 11.28 |
Next, let’s read in countries.csv, which contains the country’s abbreviation and the country name.
countries = pd.read_csv('countries.csv')
#renames columns to get rid of white spaces
countries = countries.rename(columns = {"FIPS 10-4": "FIPS-10-4"})
#renames Name column to Country
countries = countries.rename(columns = {"Name": "Country"})
countries.head()
FIPS-10-4 | ISO 3166 | Country | |
---|---|---|---|
0 | AF | AF | Afghanistan |
1 | AX | - | Akrotiri |
2 | AL | AL | Albania |
3 | AG | DZ | Algeria |
4 | AQ | AS | American Samoa |
Lastly, we will read in stations-metadata.csv, which contains ID, Latitude, Longitude, the station level, and the name of the city.
stations = pd.read_csv('station-metadata.csv')
stations.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
We create the database by opening a connection using sqlite3.connect. We then read in our three tables before we close the connection.
conn = sqlite3.connect("temps.db")
#creating database with three tables
temps.to_sql("temperatures", conn, if_exists = "replace", index=False)
countries.to_sql("countries", conn, if_exists = "replace", index=False)
stations.to_sql("stations", conn, if_exists = "replace", index=False)
#closing connection
conn.close()
Congradulations! You’ve just created database! Now, let’s create a SQL query function using data from our database.
Write a Query Function
Let’s write our query function. We begin by naming the function and inputting our desired parameters which, in this case, includes our desired country, our desired year range, and desired month. Please see the function docstring for additional information.
def query_climate_database(country, year_begin, year_end, month):
'''Returns a data frame containing the name, coordinates, country, year, month, and temperature
for a specified country, year range, and month
@param country: string; name of a country for which data should be returned
@param year_begin: integer; earliest year for which data should be returned
@param: year_end: integer; latest year for which data should be returned
@param: month: integer; month for which data should be returned
@return: dataframe containing data that matches specifications
'''
conn = sqlite3.connect("temps.db")
#selects name, latitude, and longitude from stations, country from countries, and
#year, month, and temp from temperatures
#joins stations, temperatures, and countries dataframes
cmd = "SELECT S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp \
FROM temperatures T \
LEFT JOIN stations S ON T.id = S.id \
LEFT JOIN countries C ON SUBSTRING(T.id,1,2) = C.'FIPS-10-4' \
WHERE (year >=" + str(year_begin) + ") AND (year <=" + str(year_end) + ") AND (month=" + str(month) + ") AND (country= \"" + str(country) + "\")"
df = pd.read_sql(cmd, conn)
#closing connection
conn.close()
#returns dataframe with specifications in place
return df
We create a SQL query, turn it into a dataframe, and return that dataframe. It should tell us the the name, latittude, longitude, country, year, month, and temperature for a specified country, in a specified year range, for a specified month.
Let’s test it out to see if it works. We will look at the data from India for the month of January from years 1988 to 2020.
query_climate_database(country="India",
year_begin = 1980,
year_end = 2020,
month = 1)
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 |
1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 |
2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 |
3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 |
4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 |
... | ... | ... | ... | ... | ... | ... | ... |
3147 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 |
3148 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 |
3149 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 |
3150 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 |
3151 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 |
3152 rows × 7 columns
Success! Let’s test it out again, but we want to see the data for China in the month of February from 2000 to 2010.
query_climate_database(country="China",
year_begin = 2000,
year_end = 2010,
month = 2)
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | KING'S_PARK | 22.333 | 114.167 | China | 2000 | 2 | 15.22 |
1 | KING'S_PARK | 22.333 | 114.167 | China | 2001 | 2 | 18.37 |
2 | KING'S_PARK | 22.333 | 114.167 | China | 2002 | 2 | 18.92 |
3 | MOHE | 52.133 | 122.517 | China | 2000 | 2 | -22.75 |
4 | MOHE | 52.133 | 122.517 | China | 2001 | 2 | -26.50 |
... | ... | ... | ... | ... | ... | ... | ... |
3974 | YUANLING | 28.470 | 110.400 | China | 2006 | 2 | 5.93 |
3975 | YUANLING | 28.470 | 110.400 | China | 2007 | 2 | 10.71 |
3976 | YUANLING | 28.470 | 110.400 | China | 2008 | 2 | 5.63 |
3977 | YUANLING | 28.470 | 110.400 | China | 2009 | 2 | 9.99 |
3978 | YUANLING | 28.470 | 110.400 | China | 2010 | 2 | 7.56 |
3979 rows × 7 columns
Great! Now that we have the data we want, let’s create some interesting visualizations with it!
Write a Geographic Scatter Function for Yearly Temperature Increases
For this visualization, we are interested in creating a map that displays the yearly temperature increase per city in a country. To do this, we must first calculate the yearly temperature increase per city, specifically the estimated yearly increase in temperature in Celsius. We’ll complete this task by running a linear regression model with Temperature as our target variable and Year as our predictor. The slope calculated by the model is what we are looking for.
from sklearn.linear_model import LinearRegression
def coef(data_group):
'''Fits variables "Year" and "Temp" from data_group to a linear regression model
@data_group: dataframe
@return: integer value; slope of the linear regression model
'''
X = data_group[["Year"]]
y = data_group["Temp"]
LR = LinearRegression()
#fit to model
LR.fit(X, y)
slope = LR.coef_[0]
return slope
Now that we have a function to calculate estimated yearly temperature increase, let’s create a function that will allow us to plot thiese estimates by city for a country.
We begin by importing the necessary package: plotly. Our parameters this time is the same as our parameters for our query function, but with a parameter min_obs and kwargs (which allows us to make additional arguments) added on. Please see the docstring for more specific information.
Since we want our input to match the minimum number of years required, we’ll throw a value error in the case where the difference between our latest year and earliest year is less than our min_obs parameter.
Next, we’ll use our previous function to get the data we want. Once this is done, we’ll take the dataframe returned by the function and group it by name, latitude, and longitude and apply the coef function we created to it. This should give us another dataframe, but with the estimates yearly temperature increase for the cities within a country, along with its longitude and latitude. Before we plot the values, we must round them to the nearest thousandths place, and rename the column to Estimated Yearly Increase (°C).
Moving onto the plot, we’ll be taking our data and inputting the latitude and longitude values of each city so that we can plot them on the map. We’ll set the color parameter equal to the Estimated Yearly Increase (°C).
import pandas as pd
from plotly import express as px
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
'''Returns a map with cities in specified country plotted, color based on average yearly
temperature change
@param country: string; name of a country for which data should be returned
@param year_begin: integer; earliest year for which data should be returned
@param: year_end: integer; latest year for which data should be returned
@param: month: integer; month for which data should be returned
@param: min_obs: integer: minimum required number of years of data for any given station
@param **kwargs: additional keyword arguments passed to px.scatter_mapbox()
@return: map with specified data plotted
'''
months = {1:"January",
2:"February",
3:"March",
4:"April",
5:"May",
6:"June",
7:"July",
8:"August",
9:"September",
10:"October",
11:"November",
12:"December"}
#throws value error if number of years is less than the minimum number of observations
if (year_end - year_begin) < min_obs:
raise ValueError("The amount of years does not satisfy number of minimum observations")
#calls on function query_climate_database to get dataframe with specifications
df = query_climate_database(country=country,
year_begin = year_begin,
year_end = year_end,
month = month)
#group by name, latitude, and longitude; applies coef function
coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef)
coefs = coefs.reset_index()
#rename column with slopes
coefs = coefs.rename(columns={0:"Estimated Yearly Increase (°C)"})
#rounds values to nearest thousandths place
coefs["Estimated Yearly Increase (°C)"] = round(coefs["Estimated Yearly Increase (°C)"], 3)
fig = px.scatter_mapbox(coefs,
#latitude of city
lat = "LATITUDE",
#lonitude of city
lon = "LONGITUDE",
#colors map based on increase
color="Estimated Yearly Increase (°C)",
#message displayed when hovered over
hover_name = "NAME",
#centers colorbar at 0
color_continuous_midpoint=0,
#aspect ratio
height = 250,
#title
title="Estimates of yearly increase in temperature in " + months[month] + " for stations in " + country + ", years " + str(year_begin) + "-" + str(year_end),
#additional arguments specified by user
**kwargs)
fig.update_layout(margin={"r":0, "l": 0, "b":0})
return(fig.show())
Now that we have our function, let’s test it out! Let’s see what the map of the Estimated Yearly Increase for India in the month of January between 1980 and 2020 with minimum of 10 years looks like. For additional arguments, we’ll choose a map style, color scale style, and zoom.
color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
Let’s try it out with a different country! Now, we’ll see what the map of the Estimated Yearly Increase for China in the month of February between 2000 and 2010 with minimum of 10 years looks like. We will also use a different color scale this time.
fig = temperature_coefficient_plot("China", 2000, 2010, 2,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale="inferno")
Now we will be creating two other plots using plotly and the imported data!
Plot 1
For our first plot, we answer the following question:
How does the average temperature change in a given timeframe between different cities in a country?
Let’s begin by creating our first plot!
We begin by creating another SQL query to get the data we want for our plot. In order to answer our question, we will need a country, beginning year, and ending year.
def country_temp_database(country, year_begin, year_end):
'''Returns a data frame containing the name, year and temperature for a specified country in a specified year range
@param country: string; name of a country for which data should be returned
@param year_begin: integer; earliest year for which data should be returned
@param: year_end: integer; latest year for which data should be returned
@return: dataframe containing data that matches specifications
'''
conn = sqlite3.connect("temps.db")
cmd = "SELECT S.name, T.month, C.country, T.year, T.temp \
FROM temperatures T \
LEFT JOIN stations S ON T.id = S.id \
LEFT JOIN countries C ON SUBSTRING(T.id,1,2) = C.'FIPS-10-4' \
WHERE (year >=" + str(year_begin) + ") AND (year <=" + str(year_end) + ") AND (country= \"" + str(country) + "\")"
df = pd.read_sql(cmd, conn)
#closing connection
conn.close()
#returns dataframe with specifications in place
return df
Now that we have our SQL query, let’s test it out on a country. We will be using Greece:
country_temp_database("Greece", 1990, 2005)
NAME | Month | Country | Year | Temp | |
---|---|---|---|---|---|
0 | CORFU | 1 | Greece | 1990 | 9.21 |
1 | CORFU | 2 | Greece | 1990 | 11.48 |
2 | CORFU | 3 | Greece | 1990 | 12.69 |
3 | CORFU | 4 | Greece | 1990 | 14.96 |
4 | CORFU | 5 | Greece | 1990 | 19.09 |
... | ... | ... | ... | ... | ... |
5602 | KALAMAIWAS_167240 | 8 | Greece | 2005 | 26.30 |
5603 | KALAMAIWAS_167240 | 9 | Greece | 2005 | 23.20 |
5604 | KALAMAIWAS_167240 | 10 | Greece | 2005 | 18.10 |
5605 | KALAMAIWAS_167240 | 11 | Greece | 2005 | 13.70 |
5606 | KALAMAIWAS_167240 | 12 | Greece | 2005 | 11.00 |
5607 rows × 5 columns
Next, we want to create a function to create our plot. To condense our plot, we will allow to user to choose to see the top n occuring cities. Please see the code comments for the process of creating this function.
def avg_city_temp(country, year_begin, year_end, n, **kwargs):
'''Returns line graph of average temperatures of cities in specified country over specified time rangr
@param country: string; name of a country for which data should be returned
@param year_begin: integer; earliest year for which data should be returned
@param year_end: integer; latest year for which data should be returned
@param n: integer; n number of cities occuring most in data to be used in graph
@param **kwargs: additional keyword arguments passed to px.scatter_mapbox()
@return: line graph with specifications plotted
'''
#returns list of top n cities that occur the most in the data
cities = country_temp_database(country, year_begin, year_end).groupby(["NAME"])["Country"].count().sort_values(ascending=False).head(n).index.to_list()
#returns dataframe of name, country, year, temp for specified country in specified year range
data = country_temp_database(country, year_begin, year_end)
#groups data by city by year
grouped_data = data[data['NAME'].isin(cities)].groupby(["NAME", "Year"])["Temp"].mean()
#save to CSV file; while transform was an option, it resulted in a strange graph
grouped_data.to_csv(country + ".csv")
#read CSV file
new_data = pd.read_csv(country + ".csv")
#create line graph
fig = px.line(new_data,
x="Year",
y="Temp",
#each line is dif color corresponding to city
color="NAME",
#title specific to country
title="Average Temperature of Top " + str(n) + " Occuring Cities in " + country + " Data per Year from " + str(year_begin) + "-" + str(year_end),
**kwargs)
fig.show()
Again, let’s test this function out on Greece! Let’s say we want to see the change in average temperature for the top 10 occuring cities in the Greece data from 1990-2005. We will call the function as follows:
avg_city_temp("Greece", 1990, 2005, 10)
Great, it works! Let’s move onto plot 2.
Plot 2
For our second plot, we answer the following question:
How does station level affect the average temperature of different cities between different countries?
Similar to our first plot, we will be creating another SQL query to give us the information we need to create our second plot. In this case, we want the country, city names, station levels, and temperatures. We will also limit the number of countries chosen by the user to 3. We will also limit it down to one year.
def stations_database(country1, country2, country3, year):
'''Returns a data frame containing the country, cities, station levels and temperature for 3 countries chosen by user
@param country1: string; country chosen by user
@param country2: string; country chosen by user
@param country3: string; country chosen by user
@param year: integer; year
@return: dataframe containing data that matches specifications
'''
conn = sqlite3.connect("temps.db")
cmd = "SELECT C.country, S.name, S.STNELEV, T.temp \
FROM temperatures T \
LEFT JOIN stations S ON T.id = S.id \
LEFT JOIN countries C ON SUBSTRING(S.id,1,2) = C.'FIPS-10-4' \
WHERE (year =" + str(year) + ") AND (country= \"" + str(country1) + "\") OR (country= \"" + str(country2) + "\") OR (country= \"" + str(country3) + "\")"
df = pd.read_sql(cmd, conn)
#closing connection
conn.close()
#returns dataframe with specifications in place
return df
Let’s test it out on three random cities for the year 1990:
stations_database("China", "India", "Greece", 1990)
Country | NAME | STNELEV | Temp | |
---|---|---|---|---|
0 | China | HONG_KONG_OBSERVATORY | 62.0 | 15.98 |
1 | China | HONG_KONG_OBSERVATORY | 62.0 | 16.57 |
2 | China | HONG_KONG_OBSERVATORY | 62.0 | 18.94 |
3 | China | HONG_KONG_OBSERVATORY | 62.0 | 21.70 |
4 | China | HONG_KONG_OBSERVATORY | 62.0 | 24.92 |
... | ... | ... | ... | ... |
116479 | India | DIU | 15.0 | 26.90 |
116480 | India | DIU | 15.0 | 27.80 |
116481 | India | DIU | 15.0 | 27.09 |
116482 | India | DIU | 15.0 | 25.85 |
116483 | India | DIU | 15.0 | 23.20 |
116484 rows × 4 columns
Great, it works!
Now, let’s create a function that will give us our figure. Within our function, we must first clean up our data by creating a column for the average temperature per city. Since the mean temperature for a city is consistent throughout the cities’ entires, we will be removing all but one instance of each city in each country. We then create 3 different dataframes for each country, which we will use to construct our plot. We make three subplots, and each scatterplot is added by adding a trace. Once we update our titles and margins, we’re good to go!
def temp_per_stn(country1, country2, country3, year, n, **kwargs):
'''Returns 3 scatterplots for 3 countries chosen by the user. Each plot shows the mean temperature vs. station level (log).
@param country1: string; country chosen by user
@param country2: string; country chosen by user
@param country3: string; country chosen by user
@param year: integer; year
@param n: number of cities in each country user would like to see (first n cities in data)
@return: Scatterplots that match specifications
'''
## clean data to use for plot
#call on stations_database for three countries
df = stations_database(country1, country2, country3, year)
#get mean for each station level, make new column
df["mean_temp"] = df.groupby(['NAME'])["Temp"].transform(np.mean)
#same mean for each city, so drop city if already in dataframe
df = df.drop_duplicates(subset=['NAME'])
#create three different dataframes based on country, n=# of cities user wants to see
df1 = df[df["Country"] == country1].head(n)
df2 = df[df["Country"] == country2].head(n)
df3 = df[df["Country"] == country3].head(n)
#make subplots
fig = sp.make_subplots(rows=1, cols=3)
#add trace for first country
fig.add_trace(go.Scatter(x=df1['STNELEV'],
y=df1["mean_temp"],
mode='markers',
name=country1),row=1, col=1
)
#add trace for second country
fig.add_trace(go.Scatter(x=df2['STNELEV'],
y=df2["mean_temp"],
mode='markers',
name=country2),row=1, col=2
)
#add trace for third country
fig.add_trace(go.Scatter(x=df3['STNELEV'],
y=df3["mean_temp"],
mode='markers',
name=country3),row=1, col=3
)
#add x-axis title for each plot
for i in [1,2,3]:
fig.update_xaxes(title_text="Station Level (log)", type="log", row=1, col=i)
#add y-axis title
fig.update_yaxes(title_text="Mean Temperature", row=1, col=1)
#update title and legend
fig.update_layout(title_text="Mean Temperature per Station Level for " + str(n) + " Cities in " + country1 + ", " + country2 + " and " + country3,
showlegend=False)
fig.update_layout(margin={"r":0, "l": 0, "b":0})
fig.show()
Now, let’s try it our on 3 countries:
temp_per_stn("China", "India", "Greece", 1990, 10)
It works!
That’s it for today. Thank you so much for reading and I’ll see you next time!