Creating Visualizations with SQL query Data

Introduction

Hello Everyone!

In today’s post, we will learn how to create a database and how to use SQL query! Let’s begin

Create a Database

We will begin by importing two necessary packages: pandas and sqlite3. Both of these will allow us to create a SQL query.

import pandas as pd
import sqlite3

Next, we will read in our data! We have three CSV files we will read in, and they will all be their own separate table.

Let’s read in temps_stacked.csv, which contains the columns ID, Year, Month, and temp.

temps = pd.read_csv("temps_stacked.csv")
temps.head()
ID Year Month Temp
0 ACW00011604 1961 1 -0.89
1 ACW00011604 1961 2 2.36
2 ACW00011604 1961 3 4.72
3 ACW00011604 1961 4 7.73
4 ACW00011604 1961 5 11.28

Next, let’s read in countries.csv, which contains the country’s abbreviation and the country name.

countries = pd.read_csv('countries.csv')
#renames columns to get rid of white spaces
countries = countries.rename(columns = {"FIPS 10-4": "FIPS-10-4"})
#renames Name column to Country
countries = countries.rename(columns = {"Name": "Country"})
countries.head()
FIPS-10-4 ISO 3166 Country
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa

Lastly, we will read in stations-metadata.csv, which contains ID, Latitude, Longitude, the station level, and the name of the city.

stations = pd.read_csv('station-metadata.csv')
stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

We create the database by opening a connection using sqlite3.connect. We then read in our three tables before we close the connection.

conn = sqlite3.connect("temps.db")

#creating database with three tables
temps.to_sql("temperatures", conn, if_exists = "replace", index=False)
countries.to_sql("countries", conn, if_exists = "replace", index=False)
stations.to_sql("stations", conn, if_exists = "replace", index=False)

#closing connection
conn.close()

Congradulations! You’ve just created database! Now, let’s create a SQL query function using data from our database.

Write a Query Function

Let’s write our query function. We begin by naming the function and inputting our desired parameters which, in this case, includes our desired country, our desired year range, and desired month. Please see the function docstring for additional information.

def query_climate_database(country, year_begin, year_end, month):
    
   '''Returns a data frame containing the name, coordinates, country, year, month, and temperature
   for a specified country, year range, and month
   
   @param country: string; name of a country for which data should be returned
   @param year_begin: integer; earliest year for which data should be returned
   @param: year_end: integer; latest year for which data should be returned
   @param: month: integer; month for which data should be returned
   @return: dataframe containing data that matches specifications
   '''

    conn = sqlite3.connect("temps.db")

    #selects name, latitude, and longitude from stations, country from countries, and 
    #year, month, and temp from temperatures
    #joins stations, temperatures, and countries dataframes 
    cmd = "SELECT S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp \
            FROM temperatures T \
            LEFT JOIN stations S ON T.id = S.id \
            LEFT JOIN countries C ON SUBSTRING(T.id,1,2) = C.'FIPS-10-4' \
            WHERE (year >=" + str(year_begin) + ") AND (year <=" + str(year_end) + ") AND (month=" + str(month) + ") AND (country= \"" + str(country) + "\")"

    df = pd.read_sql(cmd, conn)

    #closing connection
    conn.close()

    #returns dataframe with specifications in place
    return df

We create a SQL query, turn it into a dataframe, and return that dataframe. It should tell us the the name, latittude, longitude, country, year, month, and temperature for a specified country, in a specified year range, for a specified month.

Let’s test it out to see if it works. We will look at the data from India for the month of January from years 1988 to 2020.

query_climate_database(country="India",
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

Success! Let’s test it out again, but we want to see the data for China in the month of February from 2000 to 2010.

query_climate_database(country="China",
                       year_begin = 2000, 
                       year_end = 2010,
                       month = 2)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 KING'S_PARK 22.333 114.167 China 2000 2 15.22
1 KING'S_PARK 22.333 114.167 China 2001 2 18.37
2 KING'S_PARK 22.333 114.167 China 2002 2 18.92
3 MOHE 52.133 122.517 China 2000 2 -22.75
4 MOHE 52.133 122.517 China 2001 2 -26.50
... ... ... ... ... ... ... ...
3974 YUANLING 28.470 110.400 China 2006 2 5.93
3975 YUANLING 28.470 110.400 China 2007 2 10.71
3976 YUANLING 28.470 110.400 China 2008 2 5.63
3977 YUANLING 28.470 110.400 China 2009 2 9.99
3978 YUANLING 28.470 110.400 China 2010 2 7.56

3979 rows × 7 columns

Great! Now that we have the data we want, let’s create some interesting visualizations with it!

Write a Geographic Scatter Function for Yearly Temperature Increases

For this visualization, we are interested in creating a map that displays the yearly temperature increase per city in a country. To do this, we must first calculate the yearly temperature increase per city, specifically the estimated yearly increase in temperature in Celsius. We’ll complete this task by running a linear regression model with Temperature as our target variable and Year as our predictor. The slope calculated by the model is what we are looking for.

from sklearn.linear_model import LinearRegression

def coef(data_group):
    '''Fits variables "Year" and "Temp" from data_group to a linear regression model 
    
    @data_group: dataframe
    @return: integer value; slope of the linear regression model
    '''
    X = data_group[["Year"]]
    y = data_group["Temp"]
    LR = LinearRegression()
    #fit to model
    LR.fit(X, y)
    slope = LR.coef_[0]
    return slope

Now that we have a function to calculate estimated yearly temperature increase, let’s create a function that will allow us to plot thiese estimates by city for a country.

We begin by importing the necessary package: plotly. Our parameters this time is the same as our parameters for our query function, but with a parameter min_obs and kwargs (which allows us to make additional arguments) added on. Please see the docstring for more specific information.

Since we want our input to match the minimum number of years required, we’ll throw a value error in the case where the difference between our latest year and earliest year is less than our min_obs parameter.

Next, we’ll use our previous function to get the data we want. Once this is done, we’ll take the dataframe returned by the function and group it by name, latitude, and longitude and apply the coef function we created to it. This should give us another dataframe, but with the estimates yearly temperature increase for the cities within a country, along with its longitude and latitude. Before we plot the values, we must round them to the nearest thousandths place, and rename the column to Estimated Yearly Increase (°C).

Moving onto the plot, we’ll be taking our data and inputting the latitude and longitude values of each city so that we can plot them on the map. We’ll set the color parameter equal to the Estimated Yearly Increase (°C).

import pandas as pd
from plotly import express as px

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    '''Returns a map with cities in specified country plotted, color based on average yearly
    temperature change
    
    @param country: string; name of a country for which data should be returned
    @param year_begin: integer; earliest year for which data should be returned
    @param: year_end: integer; latest year for which data should be returned
    @param: month: integer; month for which data should be returned
    @param: min_obs: integer: minimum required number of years of data for any given station
    @param **kwargs: additional keyword arguments passed to px.scatter_mapbox()
    @return: map with specified data plotted 
    '''
    months = {1:"January",
             2:"February",
             3:"March",
             4:"April",
             5:"May",
             6:"June",
             7:"July",
             8:"August",
             9:"September",
             10:"October",
             11:"November",
             12:"December"}
    
    #throws value error if number of years is less than the minimum number of observations
    if (year_end - year_begin) < min_obs:
        raise ValueError("The amount of years does not satisfy number of minimum observations")
    
    #calls on function query_climate_database to get dataframe with specifications
    df = query_climate_database(country=country,
                       year_begin = year_begin, 
                       year_end = year_end,
                       month = month)
    
    #group by name, latitude, and longitude; applies coef function 
    coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef)

    coefs = coefs.reset_index()
    #rename column with slopes 
    coefs = coefs.rename(columns={0:"Estimated Yearly Increase (°C)"})
    #rounds values to nearest thousandths place
    coefs["Estimated Yearly Increase (°C)"] = round(coefs["Estimated Yearly Increase (°C)"], 3)
    
    fig = px.scatter_mapbox(coefs,  
                        #latitude of city
                        lat = "LATITUDE",
                        #lonitude of city
                        lon = "LONGITUDE", 
                        #colors map based on increase
                        color="Estimated Yearly Increase (°C)",
                        #message displayed when hovered over
                        hover_name = "NAME", 
                        #centers colorbar at 0
                        color_continuous_midpoint=0,
                        #aspect ratio
                        height = 250, 
                        #title
                        title="Estimates of yearly increase in temperature in " + months[month] + " for stations in " + country + ", years " + str(year_begin) + "-" + str(year_end),
                        #additional arguments specified by user
                        **kwargs) 

    fig.update_layout(margin={"r":0, "l": 0, "b":0}) 
    return(fig.show())

Now that we have our function, let’s test it out! Let’s see what the map of the Estimated Yearly Increase for India in the month of January between 1980 and 2020 with minimum of 10 years looks like. For additional arguments, we’ll choose a map style, color scale style, and zoom.

color_map = px.colors.diverging.RdGy_r 

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

Let’s try it out with a different country! Now, we’ll see what the map of the Estimated Yearly Increase for China in the month of February between 2000 and 2010 with minimum of 10 years looks like. We will also use a different color scale this time.

fig = temperature_coefficient_plot("China", 2000, 2010, 2, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale="inferno")

Now we will be creating two other plots using plotly and the imported data!

Plot 1

For our first plot, we answer the following question:

How does the average temperature change in a given timeframe between different cities in a country?

Let’s begin by creating our first plot!

We begin by creating another SQL query to get the data we want for our plot. In order to answer our question, we will need a country, beginning year, and ending year.

def country_temp_database(country, year_begin, year_end):
    
    '''Returns a data frame containing the name, year and temperature for a specified country in a specified year range
   
    @param country: string; name of a country for which data should be returned
    @param year_begin: integer; earliest year for which data should be returned
    @param: year_end: integer; latest year for which data should be returned
    @return: dataframe containing data that matches specifications
    '''

    conn = sqlite3.connect("temps.db")
        
    cmd = "SELECT S.name, T.month, C.country, T.year, T.temp \
            FROM temperatures T \
            LEFT JOIN stations S ON T.id = S.id \
            LEFT JOIN countries C ON SUBSTRING(T.id,1,2) = C.'FIPS-10-4' \
            WHERE (year >=" + str(year_begin) + ") AND (year <=" + str(year_end) + ") AND (country= \"" + str(country) + "\")"

    df = pd.read_sql(cmd, conn)

    #closing connection
    conn.close()

    #returns dataframe with specifications in place
    return df

Now that we have our SQL query, let’s test it out on a country. We will be using Greece:

country_temp_database("Greece", 1990, 2005)
NAME Month Country Year Temp
0 CORFU 1 Greece 1990 9.21
1 CORFU 2 Greece 1990 11.48
2 CORFU 3 Greece 1990 12.69
3 CORFU 4 Greece 1990 14.96
4 CORFU 5 Greece 1990 19.09
... ... ... ... ... ...
5602 KALAMAIWAS_167240 8 Greece 2005 26.30
5603 KALAMAIWAS_167240 9 Greece 2005 23.20
5604 KALAMAIWAS_167240 10 Greece 2005 18.10
5605 KALAMAIWAS_167240 11 Greece 2005 13.70
5606 KALAMAIWAS_167240 12 Greece 2005 11.00

5607 rows × 5 columns

Next, we want to create a function to create our plot. To condense our plot, we will allow to user to choose to see the top n occuring cities. Please see the code comments for the process of creating this function.

def avg_city_temp(country, year_begin, year_end, n, **kwargs):
    '''Returns line graph of average temperatures of cities in specified country over specified time rangr
    
    @param country: string; name of a country for which data should be returned
    @param year_begin: integer; earliest year for which data should be returned
    @param year_end: integer; latest year for which data should be returned
    @param n: integer; n number of cities occuring most in data to be used in graph
    @param **kwargs: additional keyword arguments passed to px.scatter_mapbox()
    @return: line graph with specifications plotted
    '''
    
    #returns list of top n cities that occur the most in the data
    cities = country_temp_database(country, year_begin, year_end).groupby(["NAME"])["Country"].count().sort_values(ascending=False).head(n).index.to_list()
    
    #returns dataframe of name, country, year, temp for specified country in specified year range
    data = country_temp_database(country, year_begin, year_end)
    
    #groups data by city by year
    grouped_data = data[data['NAME'].isin(cities)].groupby(["NAME", "Year"])["Temp"].mean()
    
    #save to CSV file; while transform was an option, it resulted in a strange graph
    grouped_data.to_csv(country + ".csv")
    
    #read CSV file
    new_data = pd.read_csv(country + ".csv")
    
    #create line graph
    fig = px.line(new_data, 
                  x="Year", 
                  y="Temp", 
                  #each line is dif color corresponding to city
                  color="NAME", 
                  #title specific to country
                  title="Average Temperature of Top " + str(n) + " Occuring Cities in " +  country + " Data per Year from " + str(year_begin) + "-" + str(year_end),
                  **kwargs)
    
    fig.show()

Again, let’s test this function out on Greece! Let’s say we want to see the change in average temperature for the top 10 occuring cities in the Greece data from 1990-2005. We will call the function as follows:

avg_city_temp("Greece", 1990, 2005, 10)

Great, it works! Let’s move onto plot 2.

Plot 2

For our second plot, we answer the following question:

How does station level affect the average temperature of different cities between different countries?

Similar to our first plot, we will be creating another SQL query to give us the information we need to create our second plot. In this case, we want the country, city names, station levels, and temperatures. We will also limit the number of countries chosen by the user to 3. We will also limit it down to one year.

def stations_database(country1, country2, country3, year):
    
    '''Returns a data frame containing the country, cities, station levels and temperature for 3 countries chosen by user
   
    @param country1: string; country chosen by user
    @param country2: string; country chosen by user
    @param country3: string; country chosen by user
    @param year: integer; year
    @return: dataframe containing data that matches specifications
    '''

    conn = sqlite3.connect("temps.db")
    
    cmd = "SELECT C.country, S.name, S.STNELEV, T.temp \
            FROM temperatures T \
            LEFT JOIN stations S ON T.id = S.id \
            LEFT JOIN countries C ON SUBSTRING(S.id,1,2) = C.'FIPS-10-4' \
            WHERE (year =" + str(year) + ") AND (country= \"" + str(country1) + "\") OR (country= \"" + str(country2) + "\") OR (country= \"" + str(country3) + "\")"

    df = pd.read_sql(cmd, conn)

    #closing connection
    conn.close()

    #returns dataframe with specifications in place
    return df

Let’s test it out on three random cities for the year 1990:

stations_database("China", "India", "Greece", 1990)
Country NAME STNELEV Temp
0 China HONG_KONG_OBSERVATORY 62.0 15.98
1 China HONG_KONG_OBSERVATORY 62.0 16.57
2 China HONG_KONG_OBSERVATORY 62.0 18.94
3 China HONG_KONG_OBSERVATORY 62.0 21.70
4 China HONG_KONG_OBSERVATORY 62.0 24.92
... ... ... ... ...
116479 India DIU 15.0 26.90
116480 India DIU 15.0 27.80
116481 India DIU 15.0 27.09
116482 India DIU 15.0 25.85
116483 India DIU 15.0 23.20

116484 rows × 4 columns

Great, it works!

Now, let’s create a function that will give us our figure. Within our function, we must first clean up our data by creating a column for the average temperature per city. Since the mean temperature for a city is consistent throughout the cities’ entires, we will be removing all but one instance of each city in each country. We then create 3 different dataframes for each country, which we will use to construct our plot. We make three subplots, and each scatterplot is added by adding a trace. Once we update our titles and margins, we’re good to go!

def temp_per_stn(country1, country2, country3, year, n, **kwargs):
    '''Returns 3 scatterplots for 3 countries chosen by the user. Each plot shows the mean temperature vs. station level (log).
   
    @param country1: string; country chosen by user
    @param country2: string; country chosen by user
    @param country3: string; country chosen by user
    @param year: integer; year
    @param n: number of cities in each country user would like to see (first n cities in data)
    @return: Scatterplots that match specifications
    '''
    
    ## clean data to use for plot
    #call on stations_database for three countries
    df = stations_database(country1, country2, country3, year)
    #get mean for each station level, make new column
    df["mean_temp"] = df.groupby(['NAME'])["Temp"].transform(np.mean)
    #same mean for each city, so drop city if already in dataframe
    df = df.drop_duplicates(subset=['NAME'])

    #create three different dataframes based on country, n=# of cities user wants to see
    df1 = df[df["Country"] == country1].head(n)
    df2 = df[df["Country"] == country2].head(n)
    df3 = df[df["Country"] == country3].head(n)
    
    #make subplots
    fig = sp.make_subplots(rows=1, cols=3) 
    
    #add trace for first country
    fig.add_trace(go.Scatter(x=df1['STNELEV'], 
                             y=df1["mean_temp"], 
                             mode='markers', 
                             name=country1),row=1, col=1
    )

    #add trace for second country
    fig.add_trace(go.Scatter(x=df2['STNELEV'], 
                               y=df2["mean_temp"], 
                               mode='markers', 
                               name=country2),row=1, col=2
    )

    #add trace for third country
    fig.add_trace(go.Scatter(x=df3['STNELEV'], 
                               y=df3["mean_temp"], 
                               mode='markers', 
                               name=country3),row=1, col=3
    )
    
    #add x-axis title for each plot
    for i in [1,2,3]:
        fig.update_xaxes(title_text="Station Level (log)", type="log", row=1, col=i)
        
    #add y-axis title
    fig.update_yaxes(title_text="Mean Temperature", row=1, col=1)
    
    #update title and legend
    fig.update_layout(title_text="Mean Temperature per Station Level for " + str(n) + " Cities in " + country1 + ", " + country2 + " and " + country3,
                     showlegend=False)
    
    fig.update_layout(margin={"r":0, "l": 0, "b":0}) 

    fig.show()

Now, let’s try it our on 3 countries:

temp_per_stn("China", "India", "Greece", 1990, 10)

It works!

That’s it for today. Thank you so much for reading and I’ll see you next time!

Written on April 24, 2022