Creating Visualizations with SQL query Data

Introduction

Hello Everyone!

In today’s post, we will learn how to create a database and how to use SQL query! Let’s begin

Create a Database

We will begin by importing two necessary packages: pandas and sqlite3. Both of these will allow us to create a SQL query.

import pandas as pd
import sqlite3

Next, we will read in our data! We have three CSV files we will read in, and they will all be their own separate table.

Let’s read in temps_stacked.csv, which contains the columns ID, Year, Month, and temp.

temps = pd.read_csv("temps_stacked.csv")
temps.head()

	ID	Year	Month	Temp
0	ACW00011604	1961	1	-0.89
1	ACW00011604	1961	2	2.36
2	ACW00011604	1961	3	4.72
3	ACW00011604	1961	4	7.73
4	ACW00011604	1961	5	11.28

Next, let’s read in countries.csv, which contains the country’s abbreviation and the country name.

countries = pd.read_csv('countries.csv')
#renames columns to get rid of white spaces
countries = countries.rename(columns = {"FIPS 10-4": "FIPS-10-4"})
#renames Name column to Country
countries = countries.rename(columns = {"Name": "Country"})
countries.head()

	FIPS-10-4	ISO 3166	Country
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

Lastly, we will read in stations-metadata.csv, which contains ID, Latitude, Longitude, the station level, and the name of the city.

stations = pd.read_csv('station-metadata.csv')
stations.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

We create the database by opening a connection using sqlite3.connect. We then read in our three tables before we close the connection.

conn = sqlite3.connect("temps.db")

#creating database with three tables
temps.to_sql("temperatures", conn, if_exists = "replace", index=False)
countries.to_sql("countries", conn, if_exists = "replace", index=False)
stations.to_sql("stations", conn, if_exists = "replace", index=False)

#closing connection
conn.close()

Congradulations! You’ve just created database! Now, let’s create a SQL query function using data from our database.

Write a Query Function

Let’s write our query function. We begin by naming the function and inputting our desired parameters which, in this case, includes our desired country, our desired year range, and desired month. Please see the function docstring for additional information.

def query_climate_database(country, year_begin, year_end, month):
    
   '''Returns a data frame containing the name, coordinates, country, year, month, and temperature
   for a specified country, year range, and month
   
   @param country: string; name of a country for which data should be returned
   @param year_begin: integer; earliest year for which data should be returned
   @param: year_end: integer; latest year for which data should be returned
   @param: month: integer; month for which data should be returned
   @return: dataframe containing data that matches specifications
   '''

    conn = sqlite3.connect("temps.db")

    #selects name, latitude, and longitude from stations, country from countries, and 
    #year, month, and temp from temperatures
    #joins stations, temperatures, and countries dataframes 
    cmd = "SELECT S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp \
            FROM temperatures T \
            LEFT JOIN stations S ON T.id = S.id \
            LEFT JOIN countries C ON SUBSTRING(T.id,1,2) = C.'FIPS-10-4' \
            WHERE (year >=" + str(year_begin) + ") AND (year <=" + str(year_end) + ") AND (month=" + str(month) + ") AND (country= \"" + str(country) + "\")"

    df = pd.read_sql(cmd, conn)

    #closing connection
    conn.close()

    #returns dataframe with specifications in place
    return df

We create a SQL query, turn it into a dataframe, and return that dataframe. It should tell us the the name, latittude, longitude, country, year, month, and temperature for a specified country, in a specified year range, for a specified month.

Let’s test it out to see if it works. We will look at the data from India for the month of January from years 1988 to 2020.

query_climate_database(country="India",
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
0	PBO_ANANTAPUR	14.583	77.633	India	1980	1	23.48
1	PBO_ANANTAPUR	14.583	77.633	India	1981	1	24.57
2	PBO_ANANTAPUR	14.583	77.633	India	1982	1	24.19
3	PBO_ANANTAPUR	14.583	77.633	India	1983	1	23.51
4	PBO_ANANTAPUR	14.583	77.633	India	1984	1	24.81
...	...	...	...	...	...	...	...
3147	DARJEELING	27.050	88.270	India	1983	1	5.10
3148	DARJEELING	27.050	88.270	India	1986	1	6.90
3149	DARJEELING	27.050	88.270	India	1994	1	8.10
3150	DARJEELING	27.050	88.270	India	1995	1	5.60
3151	DARJEELING	27.050	88.270	India	1997	1	5.70

3152 rows × 7 columns

Success! Let’s test it out again, but we want to see the data for China in the month of February from 2000 to 2010.

query_climate_database(country="China",
                       year_begin = 2000, 
                       year_end = 2010,
                       month = 2)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
0	KING'S_PARK	22.333	114.167	China	2000	2	15.22
1	KING'S_PARK	22.333	114.167	China	2001	2	18.37
2	KING'S_PARK	22.333	114.167	China	2002	2	18.92
3	MOHE	52.133	122.517	China	2000	2	-22.75
4	MOHE	52.133	122.517	China	2001	2	-26.50
...	...	...	...	...	...	...	...
3974	YUANLING	28.470	110.400	China	2006	2	5.93
3975	YUANLING	28.470	110.400	China	2007	2	10.71
3976	YUANLING	28.470	110.400	China	2008	2	5.63
3977	YUANLING	28.470	110.400	China	2009	2	9.99
3978	YUANLING	28.470	110.400	China	2010	2	7.56

3979 rows × 7 columns

Great! Now that we have the data we want, let’s create some interesting visualizations with it!

Write a Geographic Scatter Function for Yearly Temperature Increases

For this visualization, we are interested in creating a map that displays the yearly temperature increase per city in a country. To do this, we must first calculate the yearly temperature increase per city, specifically the estimated yearly increase in temperature in Celsius. We’ll complete this task by running a linear regression model with Temperature as our target variable and Year as our predictor. The slope calculated by the model is what we are looking for.

from sklearn.linear_model import LinearRegression

def coef(data_group):
    '''Fits variables "Year" and "Temp" from data_group to a linear regression model 
    
    @data_group: dataframe
    @return: integer value; slope of the linear regression model
    '''
    X = data_group[["Year"]]
    y = data_group["Temp"]
    LR = LinearRegression()
    #fit to model
    LR.fit(X, y)
    slope = LR.coef_[0]
    return slope

Now that we have a function to calculate estimated yearly temperature increase, let’s create a function that will allow us to plot thiese estimates by city for a country.

We begin by importing the necessary package: plotly. Our parameters this time is the same as our parameters for our query function, but with a parameter min_obs and kwargs (which allows us to make additional arguments) added on. Please see the docstring for more specific information.

Since we want our input to match the minimum number of years required, we’ll throw a value error in the case where the difference between our latest year and earliest year is less than our min_obs parameter.

Next, we’ll use our previous function to get the data we want. Once this is done, we’ll take the dataframe returned by the function and group it by name, latitude, and longitude and apply the coef function we created to it. This should give us another dataframe, but with the estimates yearly temperature increase for the cities within a country, along with its longitude and latitude. Before we plot the values, we must round them to the nearest thousandths place, and rename the column to Estimated Yearly Increase (°C).

Moving onto the plot, we’ll be taking our data and inputting the latitude and longitude values of each city so that we can plot them on the map. We’ll set the color parameter equal to the Estimated Yearly Increase (°C).

import pandas as pd
from plotly import express as px

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    '''Returns a map with cities in specified country plotted, color based on average yearly
    temperature change
    
    @param country: string; name of a country for which data should be returned
    @param year_begin: integer; earliest year for which data should be returned
    @param: year_end: integer; latest year for which data should be returned
    @param: month: integer; month for which data should be returned
    @param: min_obs: integer: minimum required number of years of data for any given station
    @param **kwargs: additional keyword arguments passed to px.scatter_mapbox()
    @return: map with specified data plotted 
    '''
    months = {1:"January",
             2:"February",
             3:"March",
             4:"April",
             5:"May",
             6:"June",
             7:"July",
             8:"August",
             9:"September",
             10:"October",
             11:"November",
             12:"December"}
    
    #throws value error if number of years is less than the minimum number of observations
    if (year_end - year_begin) < min_obs:
        raise ValueError("The amount of years does not satisfy number of minimum observations")
    
    #calls on function query_climate_database to get dataframe with specifications
    df = query_climate_database(country=country,
                       year_begin = year_begin, 
                       year_end = year_end,
                       month = month)
    
    #group by name, latitude, and longitude; applies coef function 
    coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef)

    coefs = coefs.reset_index()
    #rename column with slopes 
    coefs = coefs.rename(columns={0:"Estimated Yearly Increase (°C)"})
    #rounds values to nearest thousandths place
    coefs["Estimated Yearly Increase (°C)"] = round(coefs["Estimated Yearly Increase (°C)"], 3)
    
    fig = px.scatter_mapbox(coefs,  
                        #latitude of city
                        lat = "LATITUDE",
                        #lonitude of city
                        lon = "LONGITUDE", 
                        #colors map based on increase
                        color="Estimated Yearly Increase (°C)",
                        #message displayed when hovered over
                        hover_name = "NAME", 
                        #centers colorbar at 0
                        color_continuous_midpoint=0,
                        #aspect ratio
                        height = 250, 
                        #title
                        title="Estimates of yearly increase in temperature in " + months[month] + " for stations in " + country + ", years " + str(year_begin) + "-" + str(year_end),
                        #additional arguments specified by user
                        **kwargs) 

    fig.update_layout(margin={"r":0, "l": 0, "b":0}) 
    return(fig.show())

Now that we have our function, let’s test it out! Let’s see what the map of the Estimated Yearly Increase for India in the month of January between 1980 and 2020 with minimum of 10 years looks like. For additional arguments, we’ll choose a map style, color scale style, and zoom.

color_map = px.colors.diverging.RdGy_r 

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

Let’s try it out with a different country! Now, we’ll see what the map of the Estimated Yearly Increase for China in the month of February between 2000 and 2010 with minimum of 10 years looks like. We will also use a different color scale this time.

fig = temperature_coefficient_plot("China", 2000, 2010, 2, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale="inferno")

Now we will be creating two other plots using plotly and the imported data!

Plot 1

For our first plot, we answer the following question:

How does the average temperature change in a given timeframe between different cities in a country?

Let’s begin by creating our first plot!

We begin by creating another SQL query to get the data we want for our plot. In order to answer our question, we will need a country, beginning year, and ending year.

def country_temp_database(country, year_begin, year_end):
    
    '''Returns a data frame containing the name, year and temperature for a specified country in a specified year range
   
    @param country: string; name of a country for which data should be returned
    @param year_begin: integer; earliest year for which data should be returned
    @param: year_end: integer; latest year for which data should be returned
    @return: dataframe containing data that matches specifications
    '''

    conn = sqlite3.connect("temps.db")
        
    cmd = "SELECT S.name, T.month, C.country, T.year, T.temp \
            FROM temperatures T \
            LEFT JOIN stations S ON T.id = S.id \
            LEFT JOIN countries C ON SUBSTRING(T.id,1,2) = C.'FIPS-10-4' \
            WHERE (year >=" + str(year_begin) + ") AND (year <=" + str(year_end) + ") AND (country= \"" + str(country) + "\")"

    df = pd.read_sql(cmd, conn)

    #closing connection
    conn.close()

    #returns dataframe with specifications in place
    return df

Now that we have our SQL query, let’s test it out on a country. We will be using Greece:

country_temp_database("Greece", 1990, 2005)

	NAME	Month	Country	Year	Temp
0	CORFU	1	Greece	1990	9.21
1	CORFU	2	Greece	1990	11.48
2	CORFU	3	Greece	1990	12.69
3	CORFU	4	Greece	1990	14.96
4	CORFU	5	Greece	1990	19.09
...	...	...	...	...	...
5602	KALAMAIWAS_167240	8	Greece	2005	26.30
5603	KALAMAIWAS_167240	9	Greece	2005	23.20
5604	KALAMAIWAS_167240	10	Greece	2005	18.10
5605	KALAMAIWAS_167240	11	Greece	2005	13.70
5606	KALAMAIWAS_167240	12	Greece	2005	11.00

5607 rows × 5 columns

Next, we want to create a function to create our plot. To condense our plot, we will allow to user to choose to see the top n occuring cities. Please see the code comments for the process of creating this function.

def avg_city_temp(country, year_begin, year_end, n, **kwargs):
    '''Returns line graph of average temperatures of cities in specified country over specified time rangr
    
    @param country: string; name of a country for which data should be returned
    @param year_begin: integer; earliest year for which data should be returned
    @param year_end: integer; latest year for which data should be returned
    @param n: integer; n number of cities occuring most in data to be used in graph
    @param **kwargs: additional keyword arguments passed to px.scatter_mapbox()
    @return: line graph with specifications plotted
    '''
    
    #returns list of top n cities that occur the most in the data
    cities = country_temp_database(country, year_begin, year_end).groupby(["NAME"])["Country"].count().sort_values(ascending=False).head(n).index.to_list()
    
    #returns dataframe of name, country, year, temp for specified country in specified year range
    data = country_temp_database(country, year_begin, year_end)
    
    #groups data by city by year
    grouped_data = data[data['NAME'].isin(cities)].groupby(["NAME", "Year"])["Temp"].mean()
    
    #save to CSV file; while transform was an option, it resulted in a strange graph
    grouped_data.to_csv(country + ".csv")
    
    #read CSV file
    new_data = pd.read_csv(country + ".csv")
    
    #create line graph
    fig = px.line(new_data, 
                  x="Year", 
                  y="Temp", 
                  #each line is dif color corresponding to city
                  color="NAME", 
                  #title specific to country
                  title="Average Temperature of Top " + str(n) + " Occuring Cities in " +  country + " Data per Year from " + str(year_begin) + "-" + str(year_end),
                  **kwargs)
    
    fig.show()

Again, let’s test this function out on Greece! Let’s say we want to see the change in average temperature for the top 10 occuring cities in the Greece data from 1990-2005. We will call the function as follows:

avg_city_temp("Greece", 1990, 2005, 10)

Great, it works! Let’s move onto plot 2.

Plot 2

For our second plot, we answer the following question:

How does station level affect the average temperature of different cities between different countries?

Similar to our first plot, we will be creating another SQL query to give us the information we need to create our second plot. In this case, we want the country, city names, station levels, and temperatures. We will also limit the number of countries chosen by the user to 3. We will also limit it down to one year.

def stations_database(country1, country2, country3, year):
    
    '''Returns a data frame containing the country, cities, station levels and temperature for 3 countries chosen by user
   
    @param country1: string; country chosen by user
    @param country2: string; country chosen by user
    @param country3: string; country chosen by user
    @param year: integer; year
    @return: dataframe containing data that matches specifications
    '''

    conn = sqlite3.connect("temps.db")
    
    cmd = "SELECT C.country, S.name, S.STNELEV, T.temp \
            FROM temperatures T \
            LEFT JOIN stations S ON T.id = S.id \
            LEFT JOIN countries C ON SUBSTRING(S.id,1,2) = C.'FIPS-10-4' \
            WHERE (year =" + str(year) + ") AND (country= \"" + str(country1) + "\") OR (country= \"" + str(country2) + "\") OR (country= \"" + str(country3) + "\")"

    df = pd.read_sql(cmd, conn)

    #closing connection
    conn.close()

    #returns dataframe with specifications in place
    return df

Let’s test it out on three random cities for the year 1990:

stations_database("China", "India", "Greece", 1990)

	Country	NAME	STNELEV	Temp
0	China	HONG_KONG_OBSERVATORY	62.0	15.98
1	China	HONG_KONG_OBSERVATORY	62.0	16.57
2	China	HONG_KONG_OBSERVATORY	62.0	18.94
3	China	HONG_KONG_OBSERVATORY	62.0	21.70
4	China	HONG_KONG_OBSERVATORY	62.0	24.92
...	...	...	...	...
116479	India	DIU	15.0	26.90
116480	India	DIU	15.0	27.80
116481	India	DIU	15.0	27.09
116482	India	DIU	15.0	25.85
116483	India	DIU	15.0	23.20

116484 rows × 4 columns

Great, it works!

Now, let’s create a function that will give us our figure. Within our function, we must first clean up our data by creating a column for the average temperature per city. Since the mean temperature for a city is consistent throughout the cities’ entires, we will be removing all but one instance of each city in each country. We then create 3 different dataframes for each country, which we will use to construct our plot. We make three subplots, and each scatterplot is added by adding a trace. Once we update our titles and margins, we’re good to go!

def temp_per_stn(country1, country2, country3, year, n, **kwargs):
    '''Returns 3 scatterplots for 3 countries chosen by the user. Each plot shows the mean temperature vs. station level (log).
   
    @param country1: string; country chosen by user
    @param country2: string; country chosen by user
    @param country3: string; country chosen by user
    @param year: integer; year
    @param n: number of cities in each country user would like to see (first n cities in data)
    @return: Scatterplots that match specifications
    '''
    
    ## clean data to use for plot
    #call on stations_database for three countries
    df = stations_database(country1, country2, country3, year)
    #get mean for each station level, make new column
    df["mean_temp"] = df.groupby(['NAME'])["Temp"].transform(np.mean)
    #same mean for each city, so drop city if already in dataframe
    df = df.drop_duplicates(subset=['NAME'])

    #create three different dataframes based on country, n=# of cities user wants to see
    df1 = df[df["Country"] == country1].head(n)
    df2 = df[df["Country"] == country2].head(n)
    df3 = df[df["Country"] == country3].head(n)
    
    #make subplots
    fig = sp.make_subplots(rows=1, cols=3) 
    
    #add trace for first country
    fig.add_trace(go.Scatter(x=df1['STNELEV'], 
                             y=df1["mean_temp"], 
                             mode='markers', 
                             name=country1),row=1, col=1
    )

    #add trace for second country
    fig.add_trace(go.Scatter(x=df2['STNELEV'], 
                               y=df2["mean_temp"], 
                               mode='markers', 
                               name=country2),row=1, col=2
    )

    #add trace for third country
    fig.add_trace(go.Scatter(x=df3['STNELEV'], 
                               y=df3["mean_temp"], 
                               mode='markers', 
                               name=country3),row=1, col=3
    )
    
    #add x-axis title for each plot
    for i in [1,2,3]:
        fig.update_xaxes(title_text="Station Level (log)", type="log", row=1, col=i)
        
    #add y-axis title
    fig.update_yaxes(title_text="Mean Temperature", row=1, col=1)
    
    #update title and legend
    fig.update_layout(title_text="Mean Temperature per Station Level for " + str(n) + " Cities in " + country1 + ", " + country2 + " and " + country3,
                     showlegend=False)
    
    fig.update_layout(margin={"r":0, "l": 0, "b":0}) 

    fig.show()

Now, let’s try it our on 3 countries:

temp_per_stn("China", "India", "Greece", 1990, 10)

It works!

That’s it for today. Thank you so much for reading and I’ll see you next time!

Written on April 24, 2022