By Raphael Deykin and Yali Tiomkin
Final Data Science Tutorial - CMPS 3660 - Introduction to Data Science - Proessor Mattei
Table of Contents:
On January 12, 2016, Governor John Bel Edwards signed his first executive order to start the process of Medicaid expansion in Louisiana. He promised this would be one of his frist acts as governor, which he immediately took action to turn into reality.
Medicaid Expansion in Louisiana is called Healthy Louisiana. Today, there are over 730,000 residents enrolled in Healthy Louisiana. Additionally, almost 2 millian Louisiana residents are enrolled in Medicaid. Although a much higher number than originally expected, the COVID pandemic pushed hundreds of thousands of people to join Medicaid. It proved to be a true safety net during the pandemic.
Medicaid Expansion allows any eligible adult to enroll under the following criteria:
This project aims to explore the differences in health insurance by county in Louisiana, comparing data from pre-Medicaid expansion to post-Medicaid expansion. As you can see in the picture above, although the parishes are similar in size by land area, they differ tremendously by population, wealth, races, income variation, political agendas, and so many more details.
We are aiming to explore the differences Medicaid expansion made on insurance rates.
Specifically:
How large of a difference did Medicaid expansion make on insurance rates?
Which features of the population played the largest role in determining their insurance rates?
How did the adoption of Medicaid expansion change the makeup of the insured and the uninsured populations in Louisiana?
Our datasets are both from the US Census Bureau, which acquired the data through community surveys that led to population estimates. These are from the Official US Government data, so this is the most reliable and accurate data out there. To explore differences in Medicaid expansion, we chose datasets from 2 different years, one pre-Medicaid expansion and one post-Medicaid expansion. The pre-Medicaid expansion data comes from 2015. The post-Medicaid expansion data comes from 2021. The 2021 data includes a portion of the total parishes in Louisiana, so we used the similar parishes from 2015.
The data collected represents an array of details about the population of each parish. This includes:
AGE: Separated into categories which represent minors, adults, and elderly persons.
SEX: Female vs Male (No option for non-binary persons)
RACE: Options for white, hispanic/latino, black, asian, hawaiian, or several races
LIVING ARRANGEMENTS: Options include a family household (married couple), only a male or female reference person, or a non-family household
NATIVITY AND US CITIZENSHIP STATUS: Native vs. foreign born, and if foreign born specifies if they are a citizen or not
DISABILITY STATUS: Specifies if the person is living with any disability
EDUCATIONAL ATTAINMENT: For persons over the age of 26, did they obtain less than a high school degree, a high school degree, some college or associate's degree, or bachelor's or higher
EMPLOYMENT STATUS: For population between ages 19-64, if in labor force specifies if they are employed
WORK EXPERIENCE: Specifies if in the last year, person worked full time, less than full time, or not at all
HOUSEHOLD INCOME: Total household population separated to categories, in 2021-inflation adjusted dollars
RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: if below poverty line, where does the household stand in terms of percentage below that line
We extracted our data from the US Census Bureau Website found at https://data.census.gov/ . The website has options for a plethora of surveys, as well as options for states, cities, counties, years, and a variety of topics. After finding the two datasets we chose to explore, we decided to upload these to google drive in order to retain the unchanged data. We upload these csvs to our analysis below.
from google.colab import drive
drive.mount('/content/drive')
!pip install censusdata
!pip install geopandas
%cd /content/drive/My Drive
!git clone https://github.com/rdeykin/CMPS_3160_Project.git
%cd CMPS_3160_Project/
!git pull
%cd Data/
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.pyplot import figure
import seaborn as sns
import censusdata
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
health21 = pd.read_csv("../Data/ACSST1Y2021.S2701-2022-10-10T235340.csv")
health15= pd.read_csv("../Data/ACSST1Y2015.S2701-2022-11-04T214721.csv")
health21.head
health15.head
As you can tell, these datasets have some odd numbers and characters that need to be dropped. Here we begin our analysis of the health insurance datasets. We start by dropping the nonfunctional headers, then renaming the columns.
health15.drop(health15.filter(like="Margin"),axis=1,inplace=True)
#dropping the margin of error columns, unnecessary for our analysis
health21.drop(labels=[1,14,17,27,34,39,42,48,54,59,66],axis=0,inplace=True)
health15.drop(labels=[1,15,18,28,35,40,43,49,55,60,67],axis=0,inplace=True)
# dropping unusable columns, mostly titles such as AGE or RACE titles
health21=health21.rename(columns={'Label (Grouping)': 'Label',
'Ascension Parish, Louisiana!!Total!!Estimate':"Ascension Total",
'Ascension Parish, Louisiana!!Insured!!Estimate': 'Ascension Insured',
'Ascension Parish, Louisiana!!Percent Insured!!Estimate': 'Ascension Insured %',
'Ascension Parish, Louisiana!!Uninsured!!Estimate': 'Ascension Uninsured',
'Ascension Parish, Louisiana!!Percent Uninsured!!Estimate': 'Ascension Uninsured %',
'Bossier Parish, Louisiana!!Total!!Estimate':"Bossier Total",
'Bossier Parish, Louisiana!!Insured!!Estimate': 'Bossier Insured',
'Bossier Parish, Louisiana!!Percent Insured!!Estimate': 'Bossier Insured %',
'Bossier Parish, Louisiana!!Uninsured!!Estimate': 'Bossier Uninsured',
'Bossier Parish, Louisiana!!Percent Uninsured!!Estimate': 'Bossier Uninsured %',
'Caddo Parish, Louisiana!!Total!!Estimate':"Caddo Total",
'Caddo Parish, Louisiana!!Insured!!Estimate': 'Caddo Insured',
'Caddo Parish, Louisiana!!Percent Insured!!Estimate': 'Caddo Insured %',
'Caddo Parish, Louisiana!!Uninsured!!Estimate': 'Caddo Uninsured',
'Caddo Parish, Louisiana!!Percent Uninsured!!Estimate': 'Caddo Uninsured %',
'Calcasieu Parish, Louisiana!!Total!!Estimate':"Calcasieu Total",
'Calcasieu Parish, Louisiana!!Insured!!Estimate': 'Calcasieu Insured',
'Calcasieu Parish, Louisiana!!Percent Insured!!Estimate': 'Calcasieu Insured %',
'Calcasieu Parish, Louisiana!!Uninsured!!Estimate': 'Calcasieu Uninsured',
'Calcasieu Parish, Louisiana!!Percent Uninsured!!Estimate': 'Calcasieu Uninsured %',
'East Baton Rouge Parish, Louisiana!!Total!!Estimate':"East Baton Rouge Total",
'East Baton Rouge Parish, Louisiana!!Insured!!Estimate': 'East Baton Rouge Insured',
'East Baton Rouge Parish, Louisiana!!Percent Insured!!Estimate': 'East Baton Rouge Insured %',
'East Baton Rouge Parish, Louisiana!!Uninsured!!Estimate': 'East Baton Rougen Uninsured',
'East Baton Rouge Parish, Louisiana!!Percent Uninsured!!Estimate': 'East Baton Rouge Uninsured %',
'Iberia Parish, Louisiana!!Total!!Estimate':"Iberia Total",
'Iberia Parish, Louisiana!!Insured!!Estimate': 'Iberia Insured',
'Iberia Parish, Louisiana!!Percent Insured!!Estimate': 'Iberia Insured %',
'Iberia Parish, Louisiana!!Uninsured!!Estimate': 'Iberia Uninsured',
'Iberia Parish, Louisiana!!Percent Uninsured!!Estimate': 'Iberia Uninsured %',
'Jefferson Parish, Louisiana!!Total!!Estimate':"Jefferson Total",
'Jefferson Parish, Louisiana!!Insured!!Estimate': 'Jefferson Insured',
'Jefferson Parish, Louisiana!!Percent Insured!!Estimate': 'Jefferson Insured %',
'Jefferson Parish, Louisiana!!Uninsured!!Estimate': 'Jefferson Uninsured',
'Jefferson Parish, Louisiana!!Percent Uninsured!!Estimate': 'Jefferson Uninsured %',
'Lafayette Parish, Louisiana!!Total!!Estimate':"Lafayette Total",
'Lafayette Parish, Louisiana!!Insured!!Estimate': 'Lafayette Insured',
'Lafayette Parish, Louisiana!!Percent Insured!!Estimate': 'Lafayette Insured %',
'Lafayette Parish, Louisiana!!Uninsured!!Estimate': 'Lafayette Uninsured',
'Lafayette Parish, Louisiana!!Percent Uninsured!!Estimate': 'Lafayette Uninsured %',
'Lafourche Parish, Louisiana!!Total!!Estimate':"Lafourche Total",
'Lafourche Parish, Louisiana!!Insured!!Estimate': 'Lafourche Insured',
'Lafourche Parish, Louisiana!!Percent Insured!!Estimate': 'Lafourche Insured %',
'Lafourche Parish, Louisiana!!Uninsured!!Estimate': 'Lafourche Uninsured',
'Lafourche Parish, Louisiana!!Percent Uninsured!!Estimate': 'Lafourche Uninsured %',
'Livingston Parish, Louisiana!!Total!!Estimate':"Livingston Total",
'Livingston Parish, Louisiana!!Insured!!Estimate': 'Livingston Insured',
'Livingston Parish, Louisiana!!Percent Insured!!Estimate': 'Livingston Insured %',
'Livingston Parish, Louisiana!!Uninsured!!Estimate': 'Livingston Uninsured',
'Livingston Parish, Louisiana!!Percent Uninsured!!Estimate': 'Livingston Uninsured %',
'Orleans Parish, Louisiana!!Total!!Estimate':"Orleans Total",
'Orleans Parish, Louisiana!!Insured!!Estimate': 'Orleans Insured',
'Orleans Parish, Louisiana!!Percent Insured!!Estimate': 'Orleans Insured %',
'Orleans Parish, Louisiana!!Uninsured!!Estimate': 'Orleans Uninsured',
'Orleans Parish, Louisiana!!Percent Uninsured!!Estimate': 'Orleans Uninsured %',
'Ouachita Parish, Louisiana!!Total!!Estimate':"Ouachita Total",
'Ouachita Parish, Louisiana!!Insured!!Estimate': 'Ouachita Insured',
'Ouachita Parish, Louisiana!!Percent Insured!!Estimate': 'Ouachita Insured %',
'Ouachita Parish, Louisiana!!Uninsured!!Estimate': 'Ouachita Uninsured',
'Ouachita Parish, Louisiana!!Percent Uninsured!!Estimate': 'Ouachita Uninsured %',
'Rapides Parish, Louisiana!!Total!!Estimate':"Rapides Total",
'Rapides Parish, Louisiana!!Insured!!Estimate': 'Rapides Insured',
'Rapides Parish, Louisiana!!Percent Insured!!Estimate': 'Rapides Insured %',
'Rapides Parish, Louisiana!!Uninsured!!Estimate': 'Rapides Uninsured',
'Rapides Parish, Louisiana!!Percent Uninsured!!Estimate': 'Rapides Uninsured %',
'St. Landry Parish, Louisiana!!Total!!Estimate':"St. Landry Total",
'St. Landry Parish, Louisiana!!Insured!!Estimate': 'St. Landry Insured',
'St. Landry Parish, Louisiana!!Percent Insured!!Estimate': 'St. Landry Insured %',
'St. Landry Parish, Louisiana!!Uninsured!!Estimate': 'St. Landry Uninsured',
'St. Landry Parish, Louisiana!!Percent Uninsured!!Estimate': 'St. Landry Uninsured %',
'St. Tammany Parish, Louisiana!!Total!!Estimate':"St. Tammany Total",
'St. Tammany Parish, Louisiana!!Insured!!Estimate': 'St. Tammany Insured',
'St. Tammany Parish, Louisiana!!Percent Insured!!Estimate': 'St. Tammany Insured %',
'St. Tammany Parish, Louisiana!!Uninsured!!Estimate': 'St. Tammany Uninsured',
'St. Tammany Parish, Louisiana!!Percent Uninsured!!Estimate': 'St. Tammany Uninsured %',
'Tangipahoa Parish, Louisiana!!Total!!Estimate':"Tangipahoa Total",
'Tangipahoa Parish, Louisiana!!Insured!!Estimate': 'Tangipahoa Insured',
'Tangipahoa Parish, Louisiana!!Percent Insured!!Estimate': 'Tangipahoa Insured %',
'Tangipahoa Parish, Louisiana!!Uninsured!!Estimate': 'Tangipahoa Uninsured',
'Tangipahoa Parish, Louisiana!!Percent Uninsured!!Estimate': 'Tangipahoa Uninsured %',
'Terrebonne Parish, Louisiana!!Total!!Estimate':"Terrebonne Total",
'Terrebonne Parish, Louisiana!!Insured!!Estimate': 'Terrebonne Insured',
'Terrebonne Parish, Louisiana!!Percent Insured!!Estimate': 'Terrebonne Insured %',
'Terrebonne Parish, Louisiana!!Uninsured!!Estimate': 'Terrebonne Uninsured',
'Terrebonne Parish, Louisiana!!Percent Uninsured!!Estimate': 'Terrebonne Uninsured %',
'United States!!Total!!Estimate':"US Total",
'United States!!Insured!!Estimate': 'US Insured',
'United States!!Percent Insured!!Estimate': 'US Insured %',
'United States!!Uninsured!!Estimate': 'US Uninsured',
'United States!!Percent Uninsured!!Estimate': 'US Uninsured %'})
health15=health15.rename(columns={'Label (Grouping)': 'Label',
'Ascension Parish, Louisiana!!Total!!Estimate':"Ascension Total",
'Ascension Parish, Louisiana!!Insured!!Estimate': 'Ascension Insured',
'Ascension Parish, Louisiana!!Percent Insured!!Estimate': 'Ascension Insured %',
'Ascension Parish, Louisiana!!Uninsured!!Estimate': 'Ascension Uninsured',
'Ascension Parish, Louisiana!!Percent Uninsured!!Estimate': 'Ascension Uninsured %',
'Bossier Parish, Louisiana!!Total!!Estimate':"Bossier Total",
'Bossier Parish, Louisiana!!Insured!!Estimate': 'Bossier Insured',
'Bossier Parish, Louisiana!!Percent Insured!!Estimate': 'Bossier Insured %',
'Bossier Parish, Louisiana!!Uninsured!!Estimate': 'Bossier Uninsured',
'Bossier Parish, Louisiana!!Percent Uninsured!!Estimate': 'Bossier Uninsured %',
'Caddo Parish, Louisiana!!Total!!Estimate':"Caddo Total",
'Caddo Parish, Louisiana!!Insured!!Estimate': 'Caddo Insured',
'Caddo Parish, Louisiana!!Percent Insured!!Estimate': 'Caddo Insured %',
'Caddo Parish, Louisiana!!Uninsured!!Estimate': 'Caddo Uninsured',
'Caddo Parish, Louisiana!!Percent Uninsured!!Estimate': 'Caddo Uninsured %',
'Calcasieu Parish, Louisiana!!Total!!Estimate':"Calcasieu Total",
'Calcasieu Parish, Louisiana!!Insured!!Estimate': 'Calcasieu Insured',
'Calcasieu Parish, Louisiana!!Percent Insured!!Estimate': 'Calcasieu Insured %',
'Calcasieu Parish, Louisiana!!Uninsured!!Estimate': 'Calcasieu Uninsured',
'Calcasieu Parish, Louisiana!!Percent Uninsured!!Estimate': 'Calcasieu Uninsured %',
'East Baton Rouge Parish, Louisiana!!Total!!Estimate':"East Baton Rouge Total",
'East Baton Rouge Parish, Louisiana!!Insured!!Estimate': 'East Baton Rouge Insured',
'East Baton Rouge Parish, Louisiana!!Percent Insured!!Estimate': 'East Baton Rouge Insured %',
'East Baton Rouge Parish, Louisiana!!Uninsured!!Estimate': 'East Baton Rougen Uninsured',
'East Baton Rouge Parish, Louisiana!!Percent Uninsured!!Estimate': 'East Baton Rouge Uninsured %',
'Iberia Parish, Louisiana!!Total!!Estimate':"Iberia Total",
'Iberia Parish, Louisiana!!Insured!!Estimate': 'Iberia Insured',
'Iberia Parish, Louisiana!!Percent Insured!!Estimate': 'Iberia Insured %',
'Iberia Parish, Louisiana!!Uninsured!!Estimate': 'Iberia Uninsured',
'Iberia Parish, Louisiana!!Percent Uninsured!!Estimate': 'Iberia Uninsured %',
'Jefferson Parish, Louisiana!!Total!!Estimate':"Jefferson Total",
'Jefferson Parish, Louisiana!!Insured!!Estimate': 'Jefferson Insured',
'Jefferson Parish, Louisiana!!Percent Insured!!Estimate': 'Jefferson Insured %',
'Jefferson Parish, Louisiana!!Uninsured!!Estimate': 'Jefferson Uninsured',
'Jefferson Parish, Louisiana!!Percent Uninsured!!Estimate': 'Jefferson Uninsured %',
'Lafayette Parish, Louisiana!!Total!!Estimate':"Lafayette Total",
'Lafayette Parish, Louisiana!!Insured!!Estimate': 'Lafayette Insured',
'Lafayette Parish, Louisiana!!Percent Insured!!Estimate': 'Lafayette Insured %',
'Lafayette Parish, Louisiana!!Uninsured!!Estimate': 'Lafayette Uninsured',
'Lafayette Parish, Louisiana!!Percent Uninsured!!Estimate': 'Lafayette Uninsured %',
'Lafourche Parish, Louisiana!!Total!!Estimate':"Lafourche Total",
'Lafourche Parish, Louisiana!!Insured!!Estimate': 'Lafourche Insured',
'Lafourche Parish, Louisiana!!Percent Insured!!Estimate': 'Lafourche Insured %',
'Lafourche Parish, Louisiana!!Uninsured!!Estimate': 'Lafourche Uninsured',
'Lafourche Parish, Louisiana!!Percent Uninsured!!Estimate': 'Lafourche Uninsured %',
'Livingston Parish, Louisiana!!Total!!Estimate':"Livingston Total",
'Livingston Parish, Louisiana!!Insured!!Estimate': 'Livingston Insured',
'Livingston Parish, Louisiana!!Percent Insured!!Estimate': 'Livingston Insured %',
'Livingston Parish, Louisiana!!Uninsured!!Estimate': 'Livingston Uninsured',
'Livingston Parish, Louisiana!!Percent Uninsured!!Estimate': 'Livingston Uninsured %',
'Orleans Parish, Louisiana!!Total!!Estimate':"Orleans Total",
'Orleans Parish, Louisiana!!Insured!!Estimate': 'Orleans Insured',
'Orleans Parish, Louisiana!!Percent Insured!!Estimate': 'Orleans Insured %',
'Orleans Parish, Louisiana!!Uninsured!!Estimate': 'Orleans Uninsured',
'Orleans Parish, Louisiana!!Percent Uninsured!!Estimate': 'Orleans Uninsured %',
'Ouachita Parish, Louisiana!!Total!!Estimate':"Ouachita Total",
'Ouachita Parish, Louisiana!!Insured!!Estimate': 'Ouachita Insured',
'Ouachita Parish, Louisiana!!Percent Insured!!Estimate': 'Ouachita Insured %',
'Ouachita Parish, Louisiana!!Uninsured!!Estimate': 'Ouachita Uninsured',
'Ouachita Parish, Louisiana!!Percent Uninsured!!Estimate': 'Ouachita Uninsured %',
'Rapides Parish, Louisiana!!Total!!Estimate':"Rapides Total",
'Rapides Parish, Louisiana!!Insured!!Estimate': 'Rapides Insured',
'Rapides Parish, Louisiana!!Percent Insured!!Estimate': 'Rapides Insured %',
'Rapides Parish, Louisiana!!Uninsured!!Estimate': 'Rapides Uninsured',
'Rapides Parish, Louisiana!!Percent Uninsured!!Estimate': 'Rapides Uninsured %',
'St. Landry Parish, Louisiana!!Total!!Estimate':"St. Landry Total",
'St. Landry Parish, Louisiana!!Insured!!Estimate': 'St. Landry Insured',
'St. Landry Parish, Louisiana!!Percent Insured!!Estimate': 'St. Landry Insured %',
'St. Landry Parish, Louisiana!!Uninsured!!Estimate': 'St. Landry Uninsured',
'St. Landry Parish, Louisiana!!Percent Uninsured!!Estimate': 'St. Landry Uninsured %',
'St. Tammany Parish, Louisiana!!Total!!Estimate':"St. Tammany Total",
'St. Tammany Parish, Louisiana!!Insured!!Estimate': 'St. Tammany Insured',
'St. Tammany Parish, Louisiana!!Percent Insured!!Estimate': 'St. Tammany Insured %',
'St. Tammany Parish, Louisiana!!Uninsured!!Estimate': 'St. Tammany Uninsured',
'St. Tammany Parish, Louisiana!!Percent Uninsured!!Estimate': 'St. Tammany Uninsured %',
'Tangipahoa Parish, Louisiana!!Total!!Estimate':"Tangipahoa Total",
'Tangipahoa Parish, Louisiana!!Insured!!Estimate': 'Tangipahoa Insured',
'Tangipahoa Parish, Louisiana!!Percent Insured!!Estimate': 'Tangipahoa Insured %',
'Tangipahoa Parish, Louisiana!!Uninsured!!Estimate': 'Tangipahoa Uninsured',
'Tangipahoa Parish, Louisiana!!Percent Uninsured!!Estimate': 'Tangipahoa Uninsured %',
'Terrebonne Parish, Louisiana!!Total!!Estimate':"Terrebonne Total",
'Terrebonne Parish, Louisiana!!Insured!!Estimate': 'Terrebonne Insured',
'Terrebonne Parish, Louisiana!!Percent Insured!!Estimate': 'Terrebonne Insured %',
'Terrebonne Parish, Louisiana!!Uninsured!!Estimate': 'Terrebonne Uninsured',
'Terrebonne Parish, Louisiana!!Percent Uninsured!!Estimate': 'Terrebonne Uninsured %',
'United States!!Total!!Estimate':"US Total",
'United States!!Insured!!Estimate': 'US Insured',
'United States!!Percent Insured!!Estimate': 'US Insured %',
'United States!!Uninsured!!Estimate': 'US Uninsured',
'United States!!Percent Uninsured!!Estimate': 'US Uninsured %'})
Next, we rename the rows which are not appropriately labeled.
health21
health15
health21.drop(labels=[11,12,13,49,55],axis=0,inplace=True)
health21.loc[2,'Label']="Age: <6"
health21.loc[3,'Label']="Age: 6-17"
health21.loc[4,'Label']="Age: 18-24"
health21.loc[5,'Label']="Age: 25-34"
health21.loc[6,'Label']="Age: 35-44"
health21.loc[7,'Label']="Age: 45-54"
health21.loc[8,'Label']="Age: 55-64"
health21.loc[9,'Label']="Age: 65-74"
health21.loc[10,'Label']="Age: 75+"
health21.loc[55,'Label']='Population 18 to 64 years'
health21.loc[43,'Label']='Population 26 years and over'
health21.loc[67,'Label']='Pop where poverty status determined'
health21.loc[31,'Label']='Male reference, no spouse'
health21.loc[32,'Label']='Female reference, no spouse'
health21.loc[33,'Label']='Non-family household/Other'
health21.loc[45,'Label']='High school grad'
health21.loc[57,'Label']='Worked < full-time in past 12 months'
health21.loc[68,'Label']='< 138% of poverty threshold'
health21.loc[70,'Label']='>= 400% of poverty threshold'
health15.drop(labels=[2,5,11,14],axis=0,inplace=True)
health15.loc[3,'Label']="Age: <6"
health15.loc[4,'Label']="Age: 6-17"
health15.loc[6,'Label']="Age: 18-24"
health15.loc[7,'Label']="Age: 25-34"
health15.loc[8,'Label']="Age: 35-44"
health15.loc[9,'Label']="Age: 45-54"
health15.loc[10,'Label']="Age: 55-64"
health15.loc[12,'Label']="Age: 65-74"
health15.loc[13,'Label']="Age: 75+"
health15.loc[32,'Label']='Male reference, no spouse'
health15.loc[33,'Label']='Female reference, no spouse'
health15.loc[44,'Label']='Population 26 years and over'
health15.loc[56,'Label']='Population 18 to 64 years'
health15.loc[68,'Label']='Pop where poverty status determined'
health15.loc[34,'Label']='Non-family household/Other'
health15.loc[46,'Label']='High school grad'
health15.loc[58,'Label']='Worked < full-time in past 12 months'
health15.loc[69,'Label']='< 138% of poverty threshold'
health15.loc[72,'Label']='>= 400% of poverty threshold'
health21.dtypes
health15.dtypes
Lastly, we transpose the dataset in order for the observations to be seen as columns.
health21.set_index("Label",inplace=True)
health21 = health21.transpose()
health21
health15.set_index("Label",inplace=True)
health15 = health15.transpose()
health15
There is an interesting error that we get when we try to access the column "Male". This is because there are indentations (more specifically non breaking spaces) in the original dataset that have been handled automatically by PANDAS replacing them with the "\xa0". Lets look at an example
health21["\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Male"]
If we had tried health_insur_LA_Tp["(spacespacespace)Male"], PANDAS would have thrown a KeyError. To us, non-breaking spaces look like regular spaces but as we can see above they are handled internally much differently. To avoid having to type "\ax0" a bunch of times, we are going to rename the columns to exclude these non-breaking spaces.
Thanks to aechase on GitHub and FolksTalk.com for helping to elucidate exactly what was going on here
Links: https://www.folkstalk.com/tech/pandas-xa0-with-code-examples/, https://github.com/pandas-dev/pandas/issues/14579
Now that the data has been roughly cleaned and organized, it is important to check that the datatypes of the variables are numbers so operations can be performed on them.
health21.columns = health21.columns.str.replace('\xa0', '')
health15.columns = health15.columns.str.replace('\xa0', '')
health21.dtypes
health21.dtypes
Looks like they aren't numbers, rather they are strings. Lets cast them as numbers on both datasets so we can perform meaningful operations on them. However, we have to get rid of the other string characters (such as percentage signs and commas) before we can do this.
health21.replace(regex = ",", value = "", inplace = True)
health21.replace(regex = "%", value = "", inplace = True)
health21.replace(regex = "N", value = "", inplace = True)
health21=health21.apply(pd.to_numeric, errors = "coerce")
health21
health15.replace(regex = ",", value = "", inplace = True)
health15.replace(regex = "%", value = "", inplace = True)
health15.replace(regex = "N", value = "", inplace = True)
health15=health15.apply(pd.to_numeric, errors = "coerce")
health15
Now we have cleaned the dataset of all string characters and converted the data to numbers, casting all the missing values as a NaN
At this point, it seems like we would be able to start doing some analysis. However, there are repeat observations (parishes) on the table, but in different units. Essentially, there are three different tables smushed into one. To make our analysis meaningful, we will have to filter the larger tables into smaller ones with the same units.
#isolating the percentages
health21_percent = pd.DataFrame(health21[health21.index.str.contains("%")])
health21_percent.rename(columns={"Civilian noninstitutionalized population": "Percent"}, inplace=True)
health21_percent
# Percentage of Insured
health21_percent_insured = pd.DataFrame(health21_percent[health21_percent.index.str.contains("Insured")])
health21_percent_insured.rename(columns={"Percent": "Insured Percent"}, inplace=True)
health21_percent_insured.rename(index={
"Ascension Insured %": "Ascension",
"Bossier Insured %": "Bossier",
"Caddo Insured %": "Caddo",
"Calcasieu Insured %": "Calcasieu",
"East Baton Rouge Insured %": "East Baton Rouge",
"Iberia Insured %": "Iberia",
"Jefferson Insured %": "Jefferson",
"Lafayette Insured %": "Lafayette",
"Lafourche Insured %": "Lafourche",
"Livingston Insured %": "Livingston",
"Orleans Insured %": "Orleans",
"Ouachita Insured %": "Ouachita",
"Rapides Insured %": "Rapides",
"St. Landry Insured %": "St. Landry",
"St. Tammany Insured %": "St. Tammany",
"Tangipahoa Insured %": "Tangipahoa",
"Terrebonne Insured %": "Terrebonne",
"US Insured %": "US"
},inplace=True)
health21_percent_insured
# Percentage Uninsured
health21_percent_uninsured = pd.DataFrame(health21_percent[health21_percent.index.str.contains("Uninsured")])
health21_percent_uninsured.rename(columns={"Percent": "Uninsured Percent"}, inplace=True)
health21_percent_uninsured.rename(index={
"Ascension Uninsured %": "Ascension",
"Bossier Uninsured %": "Bossier",
"Caddo Uninsured %": "Caddo",
"Calcasieu Uninsured %": "Calcasieu",
"East Baton Rouge Uninsured %": "East Baton Rouge",
"Iberia Uninsured %": "Iberia",
"Jefferson Uninsured %": "Jefferson",
"Lafayette Uninsured %": "Lafayette",
"Lafourche Uninsured %": "Lafourche",
"Livingston Uninsured %": "Livingston",
"Orleans Uninsured %": "Orleans",
"Ouachita Uninsured %": "Ouachita",
"Rapides Uninsured %": "Rapides",
"St. Landry Uninsured %": "St. Landry",
"St. Tammany Uninsured %": "St. Tammany",
"Tangipahoa Uninsured %": "Tangipahoa",
"Terrebonne Uninsured %": "Terrebonne",
"US Uninsured %": "US"
},inplace=True)
health21_percent_uninsured
# Get the numeric insurance data
h = health21[~health21.index.str.contains("%")]
# Get the totals
health21_total = pd.DataFrame(h[h.index.str.contains("Total")])
health21_total.rename(columns={"Civilian noninstitutionalized population": "Total Citizens"}, inplace=True)
health21_total.rename(index={
"Ascension Total": "Ascension",
"Bossier Total": "Bossier",
"Caddo Total": "Caddo",
"Calcasieu Total": "Calcasieu",
"East Baton Rouge Total": "East Baton Rouge",
"Iberia Total": "Iberia",
"Jefferson Total": "Jefferson",
"Lafayette Total": "Lafayette",
"Lafourche Total": "Lafourche",
"Livingston Total": "Livingston",
"Orleans Total": "Orleans",
"Ouachita Total": "Ouachita",
"Rapides Total": "Rapides",
"St. Landry Total": "St. Landry",
"St. Tammany Total": "St. Tammany",
"Tangipahoa Total": "Tangipahoa",
"Terrebonne Total": "Terrebonne",
"US Total": "US"
},inplace=True)
health21_total
# Get numbers of insured
health21_insured = pd.DataFrame(h[h.index.str.contains("Insured")])
health21_insured.rename(index={
"Ascension Insured": "Ascension",
"Bossier Insured": "Bossier",
"Caddo Insured": "Caddo",
"Calcasieu Insured": "Calcasieu",
"East Baton Rouge Insured": "East Baton Rouge",
"Iberia Insured": "Iberia",
"Jefferson Insured": "Jefferson",
"Lafayette Insured": "Lafayette",
"Lafourche Insured": "Lafourche",
"Livingston Insured": "Livingston",
"Orleans Insured": "Orleans",
"Ouachita Insured": "Ouachita",
"Rapides Insured": "Rapides",
"St. Landry Insured": "St. Landry",
"St. Tammany Insured": "St. Tammany",
"Tangipahoa Insured": "Tangipahoa",
"Terrebonne Insured": "Terrebonne",
"US Insured": "US"
},inplace=True)
health21_insured=health21_insured.rename(columns={"Civilian noninstitutionalized population": "Insured Population"})
health21_insured
# Get numbers of Uninsured
health21_uninsured = pd.DataFrame(h[h.index.str.contains("Uninsured")])
health21_uninsured.rename(index={
"Ascension Uninsured": "Ascension",
"Bossier Uninsured": "Bossier",
"Caddo Uninsured": "Caddo",
"Calcasieu Uninsured": "Calcasieu",
"East Baton Rouge Uninsured": "East Baton Rouge",
"Iberia Uninsured": "Iberia",
"Jefferson Uninsured": "Jefferson",
"Lafayette Uninsured": "Lafayette",
"Lafourche Uninsured": "Lafourche",
"Livingston Uninsured": "Livingston",
"Orleans Uninsured": "Orleans",
"Ouachita Uninsured": "Ouachita",
"Rapides Uninsured": "Rapides",
"St. Landry Uninsured": "St. Landry",
"St. Tammany Uninsured": "St. Tammany",
"Tangipahoa Uninsured": "Tangipahoa",
"Terrebonne Uninsured": "Terrebonne",
"US Uninsured": "US"
},inplace=True)
health21_uninsured=health21_uninsured.rename(columns={"Civilian noninstitutionalized population": "Uninsured Population"})
health21_uninsured
health15_percent = pd.DataFrame(health15[health15.index.str.contains("%")])
health15_percent.rename(columns={"Civilian noninstitutionalized population": "Percent"}, inplace=True)
health15_percent
health15_percent_insured = pd.DataFrame(health15_percent[health15_percent.index.str.contains("Insured")])
health15_percent_insured.rename(columns={"Percent": "Insured Percent"}, inplace=True)
health15_percent_insured.rename(index={
"Ascension Insured %": "Ascension",
"Bossier Insured %": "Bossier",
"Caddo Insured %": "Caddo",
"Calcasieu Insured %": "Calcasieu",
"East Baton Rouge Insured %": "East Baton Rouge",
"Iberia Insured %": "Iberia",
"Jefferson Insured %": "Jefferson",
"Lafayette Insured %": "Lafayette",
"Lafourche Insured %": "Lafourche",
"Livingston Insured %": "Livingston",
"Orleans Insured %": "Orleans",
"Ouachita Insured %": "Ouachita",
"Rapides Insured %": "Rapides",
"St. Landry Insured %": "St. Landry",
"St. Tammany Insured %": "St. Tammany",
"Tangipahoa Insured %": "Tangipahoa",
"Terrebonne Insured %": "Terrebonne",
"US Insured %": "US"
},inplace=True)
health15_percent_insured
# Percentage Uninsured
health15_percent_uninsured = pd.DataFrame(health15_percent[health15_percent.index.str.contains("Uninsured")])
health15_percent_uninsured.rename(columns={"Percent": "Uninsured Percent"}, inplace=True)
health15_percent_uninsured.rename(index={
"Ascension Uninsured %": "Ascension",
"Bossier Uninsured %": "Bossier",
"Caddo Uninsured %": "Caddo",
"Calcasieu Uninsured %": "Calcasieu",
"East Baton Rouge Uninsured %": "East Baton Rouge",
"Iberia Uninsured %": "Iberia",
"Jefferson Uninsured %": "Jefferson",
"Lafayette Uninsured %": "Lafayette",
"Lafourche Uninsured %": "Lafourche",
"Livingston Uninsured %": "Livingston",
"Orleans Uninsured %": "Orleans",
"Ouachita Uninsured %": "Ouachita",
"Rapides Uninsured %": "Rapides",
"St. Landry Uninsured %": "St. Landry",
"St. Tammany Uninsured %": "St. Tammany",
"Tangipahoa Uninsured %": "Tangipahoa",
"Terrebonne Uninsured %": "Terrebonne",
"US Uninsured %": "US"
},inplace=True)
health15_percent_uninsured
# Get the numeric insurance data
h = health15[~health15.index.str.contains("%")]
# Get the totals
health15_total = pd.DataFrame(h[h.index.str.contains("Total")])
health15_total.rename(columns={"Civilian noninstitutionalized population": "Total Citizens"}, inplace=True)
health15_total.rename(index={
"Ascension Total": "Ascension",
"Bossier Total": "Bossier",
"Caddo Total": "Caddo",
"Calcasieu Total": "Calcasieu",
"East Baton Rouge Total": "East Baton Rouge",
"Iberia Total": "Iberia",
"Jefferson Total": "Jefferson",
"Lafayette Total": "Lafayette",
"Lafourche Total": "Lafourche",
"Livingston Total": "Livingston",
"Orleans Total": "Orleans",
"Ouachita Total": "Ouachita",
"Rapides Total": "Rapides",
"St. Landry Total": "St. Landry",
"St. Tammany Total": "St. Tammany",
"Tangipahoa Total": "Tangipahoa",
"Terrebonne Total": "Terrebonne",
"US Total": "US"
},inplace=True)
health15_total
# Get numbers of insured
health15_insured = pd.DataFrame(h[h.index.str.contains("Insured")])
health15_insured.rename(index={
"Ascension Insured": "Ascension",
"Bossier Insured": "Bossier",
"Caddo Insured": "Caddo",
"Calcasieu Insured": "Calcasieu",
"East Baton Rouge Insured": "East Baton Rouge",
"Iberia Insured": "Iberia",
"Jefferson Insured": "Jefferson",
"Lafayette Insured": "Lafayette",
"Lafourche Insured": "Lafourche",
"Livingston Insured": "Livingston",
"Orleans Insured": "Orleans",
"Ouachita Insured": "Ouachita",
"Rapides Insured": "Rapides",
"St. Landry Insured": "St. Landry",
"St. Tammany Insured": "St. Tammany",
"Tangipahoa Insured": "Tangipahoa",
"Terrebonne Insured": "Terrebonne",
"US Insured": "US"
},inplace=True)
health15_insured=health15_insured.rename(columns={"Civilian noninstitutionalized population": "Insured Population"})
health15_insured
# Get numbers of Uninsured
health15_insur_LA_uninsured = pd.DataFrame(h[h.index.str.contains("Uninsured")])
health15_insur_LA_uninsured.rename(index={
"Ascension Uninsured": "Ascension",
"Bossier Uninsured": "Bossier",
"Caddo Uninsured": "Caddo",
"Calcasieu Uninsured": "Calcasieu",
"East Baton Rouge Uninsured": "East Baton Rouge",
"Iberia Uninsured": "Iberia",
"Jefferson Uninsured": "Jefferson",
"Lafayette Uninsured": "Lafayette",
"Lafourche Uninsured": "Lafourche",
"Livingston Uninsured": "Livingston",
"Orleans Uninsured": "Orleans",
"Ouachita Uninsured": "Ouachita",
"Rapides Uninsured": "Rapides",
"St. Landry Uninsured": "St. Landry",
"St. Tammany Uninsured": "St. Tammany",
"Tangipahoa Uninsured": "Tangipahoa",
"Terrebonne Uninsured": "Terrebonne",
"US Uninsured": "US"
},inplace=True)
health15_uninsured=health15_insur_LA_uninsured.rename(columns={"Civilian noninstitutionalized population": "Uninsured Population"})
health15_uninsured
Now we have tables we can work with. Lets look at a graph and see if we can pick out any patterns of interest
Above we have outlined the general workflow that will be performed on our other dataset so we can move on to more in depth analysis
Lets start our analysis by doing some EDA on both the post Medicaid expansion dataset and the pre Medicaid expansion dataset. To begin, lets get some initial comparisons of population size and number of insured people.
fig, (ax2,ax1) = plt.subplots(1,2, figsize=(25,10))
fig.suptitle('Population vs Insurance', fontsize=30)
total_pop=pd.DataFrame(health21_total["Total Citizens"])
insured=pd.DataFrame(health21_insured["Insured Population"])
pop_insur=total_pop.join(insured,how="outer")
pop_insur.drop("US",inplace=True)
pop_insur.sort_values(by="Total Citizens",inplace=True)
pop_insur.plot.bar(stacked=True, color={"Total Citizens": "plum","Insured Population": "yellowgreen"},ax = ax1)
ax1.set_xlabel("Louisiana Parishes",fontsize=14)
ax1.set_ylabel("# of Citizens",fontsize=14)
ax1.set_title('Post Medicaid Expansion',fontsize = 25)
total_pop1=pd.DataFrame(health15_total["Total Citizens"])
insured1=pd.DataFrame(health15_insured["Insured Population"])
pop_insur1=total_pop.join(insured,how="outer")
pop_insur1.drop("US",inplace=True)
pop_insur1.sort_values(by="Total Citizens",inplace=True)
pop_insur1.plot.bar(stacked=True,color={"Total Citizens": "plum","Insured Population": "yellowgreen"}, ax = ax2)
ax2.set_xlabel("Louisiana Parishes",fontsize=14)
ax2.set_ylabel("# of Citizens",fontsize=14)
ax2.set_title('Pre Medicaid Expansion',fontsize = 25)
Looking at both of the graphs, it seems like both for pre Medicaid and post Medicaid expansion, the total population in each parish is roughly equal to the number of insured citizens. Does this mean that Medicaid expansion didn't have an effect? Not necessarily, especially because here we are just comparing raw numbers and not percentages.
ages21=health21_insured.sort_values(by="Insured Population",inplace=False)
ages21=ages21.filter(like="Age")
ages21=ages21.transpose()
ages21=ages21["US"]
ages15=health15_insured.sort_values(by="Insured Population",inplace=False)
ages15=ages15.filter(like="Age")
ages15=ages15.transpose()
ages15=ages15["US"]
fig=plt.figure()
ax=fig.add_subplot(111)
ax2=ax.twinx()
width=0.3
ages15.plot(kind='bar',color='plum',ax=ax,width=width,position=1,figsize=(10,7))
ages21.plot(kind='bar',color='yellowgreen',ax=ax,width=width,position=0,figsize=(10,7))
ax.set_ylabel("# Insured Citizens")
ax.set_xlabel("Ages")
ax.set_title("Insured Age Groups Pre v Post Medicaid Expansion",fontsize=15)
ax.legend(["2015","2021"],loc='best')
plt.show()
Something to note here, is that the number of insured citizens for the age group 6-17 is higher in the post Medicaid expansion vs the pre Medicaid expansion graph. Another aspect of this that is important to note is that the numbers of insured people went up post Medicaid expansion. This could be due to population growth but it is possible that Medicaid expansion had a measurable effect.
ages21=health21_uninsured.sort_values(by="Uninsured Population",inplace=False)
ages21=ages21.filter(like="Age")
ages21=ages21.transpose()
ages21=ages21["US"]
ages15=health15_uninsured.sort_values(by="Uninsured Population",inplace=False)
ages15=ages15.filter(like="Age")
ages15=ages15.transpose()
ages15=ages15["US"]
fig=plt.figure()
ax=fig.add_subplot(111)
ax2=ax.twinx()
width=0.3
ages15.plot(kind='bar',color='plum',ax=ax,width=width,position=1,figsize=(10,7))
ages21.plot(kind='bar',color='yellowgreen',ax=ax,width=width,position=0)
ax.set_ylabel("# Uninsured Citizens")
ax.set_xlabel("Ages")
ax.set_title("Uninsured Age Groups Pre v Post Medicaid Expansion",fontsize=20)
ax.legend(["2015","2021"],loc='best')
plt.show()
Again the different is subtle, but it is noticable. The pre Medicaid expansion graph seems to peak at a higher point than the post Medicaid expansion graph. Since the graph is depicting number of uninsured citizens, it seems that Medicaid expansion could possibly have reduced the number of uninsured people, particularly in the age range of 25-34.
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(25,10))
fig.suptitle('Sex vs Insurance', fontsize=30)
sex=health21_insured.sort_values(by="Insured Population",inplace=False)
sex=sex.filter(like="ale")
sex.drop("US",inplace=True)
sex=sex.drop(sex.filter(like="reference"),axis=1)
sex=sex.drop(sex.filter(like="grad"),axis=1)
sex.plot.bar(stacked=False,color={"Male": "royalblue","Female": "lightpink"},ax=ax2)
ax2.set_xlabel("Parish",fontsize=14)
ax2.set_ylabel("# of Insured Citizens",fontsize=14)
ax2.set_title('2021',fontsize = 25)
sex=health15_insured.sort_values(by="Insured Population",inplace=False)
sex=sex.filter(like="ale")
sex.drop("US",inplace=True)
sex=sex.drop(sex.filter(like="reference"),axis=1)
sex=sex.drop(sex.filter(like="grad"),axis=1)
sex.plot.bar(stacked=False,color={"Male": "royalblue","Female": "lightpink"},ax=ax1)
ax1.set_xlabel("Parish",fontsize=14)
ax1.set_ylabel("# of Insured Citizens",fontsize=14)
ax1.set_title('2015',fontsize = 25)
#plt.show()
Here it doesn't look like there is much of a different between the numbers of insured by sex across the different parishes between pre Medicaid expansion and post Medicaid expansion. What is interesting, though, is that females are more insured than males in every single parish in both datasets, and there are many more insured people in East Baton Rouge than there are in Iberia (However this might be because there are just more people in East Baton Rouge).
Raw numbers are great. However, it is possible that the differences between the pre Medicaid expansion and post Medicaid expansion graphs are due to population growth or other outside factors that affect population size. To get aroud this, lets take a look at a comparison of the two datasets but this time lets use percentages instead of the numbers themselves.
ages21=health21_percent_insured.sort_values(by="Insured Percent",inplace=False)
ages21=ages21.filter(like="Age")
ages21=ages21.transpose()
ages21=ages21["US"]
ages15=health15_percent_insured.sort_values(by="Insured Percent",inplace=False)
ages15=ages15.filter(like="Age")
ages15=ages15.transpose()
ages15=ages15["US"]
fig=plt.figure()
ax=fig.add_subplot(111)
ax2=ax.twinx()
width=0.3
ages15.plot(kind='bar',color='plum',ax=ax,width=width,position=1,figsize=(10,7))
ages21.plot(kind='bar',color='yellowgreen',ax=ax,width=width,position=0,figsize=(10,7))
ax.set_ylabel("% Insured Citizens")
ax.set_xlabel("Ages")
ax.set_title("Insured % Age Groups Pre v Post Medicaid Expansion",fontsize=20)
ax.legend(["2015","2021"],loc='best')
plt.show()
These graphs look pretty similar. This might be because there are just subtle differences in the percentages between the two timeframes. Lets look at the uninsurance rates between pre and post Medicaid expansion to see if there are more noticable differences
ages21=health21_percent_uninsured.sort_values(by="Uninsured Percent",inplace=False)
ages21=ages21.filter(like="Age")
ages21=ages21.transpose()
ages21=ages21["US"]
ages15=health15_percent_uninsured.sort_values(by="Uninsured Percent",inplace=False)
ages15=ages15.filter(like="Age")
ages15=ages15.transpose()
ages15=ages15["US"]
fig=plt.figure()
ax=fig.add_subplot(111)
ax2=ax.twinx()
width=0.3
ages15.plot(kind='bar',color='plum',ax=ax,width=width,position=1,figsize=(10,7))
ages21.plot(kind='bar',color='yellowgreen',ax=ax,width=width,position=0,figsize=(10,7))
ax.set_ylabel("% Uninsured Citizens")
ax.set_xlabel("Ages")
ax.set_title("Uninsured % Age Groups Pre v Post Medicaid Expansion",fontsize=20)
ax.legend(["2015","2021"],loc='best')
plt.show()
Now the difference is much more noticable! According to these graphs, it looks like there is a difference in the percentages of people uninsured, perticularly in the 18-34 age ranges. Whats interesting is that it looks like the percentage in the 55-64, under 6, and the 6-18 age ranges actually went up after Medicaid expansion. This warrents more investigation.
import matplotlib.pyplot as plt
fig,ax = plt.subplots(1,1, figsize=(18,10))
sex=health15_percent_uninsured.sort_values(by="Uninsured Percent",inplace=False)
sex=sex.filter(like="ale")
sex.drop("US",inplace=True)
sex=sex.drop(sex.filter(like="reference"),axis=1)
sex=sex.drop(sex.filter(like="grad"),axis=1)
sex1=health21_percent_uninsured.sort_values(by="Uninsured Percent",inplace=False)
sex1.drop("US",inplace=True)
sex1 = sex1.reindex(["Ascension", "St. Tammany", "East Baton Rouge",
"Livingston", "Bossier", "Lafourche",
"Orleans","Lafayette","Calcasieu",
"Rapides","Tangipahoa","Ouachita",
"Caddo", "Jefferson", "Iberia","St. Landry", "Terrebonne"])
sex1=sex1.filter(like="ale")
sex1=sex1.drop(sex1.filter(like="reference"),axis=1)
sex1=sex1.drop(sex1.filter(like="grad"),axis=1)
sex2 = sex1.join(sex, how='outer', lsuffix='_2021', rsuffix='_2015')
sex2=sex2[["Female_2015","Female_2021","Male_2015","Male_2021"]]
sex2.plot.bar(stacked=False, color={"Male_2021": "royalblue","Female_2021": "magenta", "Male_2015": "darkblue", "Female_2015": "plum"}, ax = ax)
ax.set_ylabel("% Uninsured Citizens", fontsize = 15)
ax.set_xlabel("Parish", fontsize = 15)
ax.set_title("Uninsured Percent by Parish", fontsize = 25)
When looking at the rates of uninsured people by parish and stratifying by sex, it looks like women were starting from a better place than men were (They already had lower rates of uninsurance as a group than men did in 2015). Even after Medicaid expansion, it seems like men still had higher rates overall of uninsured people as a group than women did. Despite this, Medicaid expansion seems to have reduced rates of uninsured people in every parish.
While it is helpful to visualize the difference between the various rates in the 2015 and the 2021 dataset, we are more interested in seeing how the adoption of Medicaid expansion changed the makeup of the insured and the uninsured populations in Louisiana.
Below, we are grabbing only the features shared by both datasets (some differ slightly), and putting them in order to be readable.
#uninsured_features = list(set(health21_percent_uninsured.columns) & set(health15_percent_uninsured.columns))
#insured_features = list(set(health21_percent_insured.columns) & set(health15_percent_insured.columns))
uninsured_features=['Population 18 to 64 years','>= 400% of poverty threshold','< 138% of poverty threshold','Pop where poverty status determined',
'$100,000 and over','$75,000 to $99,999','$50,000 to $74,999', '$50,000 to $74,999','$25,000 to $49,999','$25,000 to $49,999','Under $25,000','Total household population','Did not work','Worked < full-time in past 12 months',
'Not in labor force','Unemployed','Employed','In labor force',"Bachelor's degree or higher","Some college or associate's degree",'High school grad','Less than high school graduate','Population 26 years and over',
'No disability','With a disability','Non-family household/Other','Female reference, no spouse','Male reference, no spouse','In other families','In married couple families','In family households',
'White alone, not Hispanic or Latino','Black or African American alone','White alone','Female','Male','Age: 75+','Age: 65-74','Age: 55-64','Age: 45-54', 'Age: 35-44',"Age: 25-34","Age: 18-24",'Uninsured Percent',"Age: <6"]
insured_features=['Population 18 to 64 years','>= 400% of poverty threshold','< 138% of poverty threshold','Pop where poverty status determined',
'$100,000 and over','$75,000 to $99,999','$50,000 to $74,999', '$50,000 to $74,999','$25,000 to $49,999','$25,000 to $49,999','Under $25,000','Total household population','Did not work','Worked < full-time in past 12 months',
'Not in labor force','Unemployed','Employed','In labor force',"Bachelor's degree or higher","Some college or associate's degree",'High school grad','Less than high school graduate','Population 26 years and over',
'No disability','With a disability','Non-family household/Other','Female reference, no spouse','Male reference, no spouse','In other families','In married couple families','In family households',
'White alone, not Hispanic or Latino','Black or African American alone','White alone','Female','Male','Age: 75+','Age: 65-74','Age: 55-64','Age: 45-54', 'Age: 35-44',"Age: 25-34","Age: 18-24",'Insured Percent',"Age: <6"]
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(16,28),sharey=False)
fig.suptitle('Feature Weights Uninsured %', fontsize=20, y=1.02)
Y_train_15 = health15_percent_uninsured[uninsured_features]["Uninsured Percent"].copy()
X_train_15 = health15_percent_uninsured[uninsured_features].drop("Uninsured Percent", axis = 1)
X_train_15.dropna(axis = "columns",inplace = True)
reg = LinearRegression()
reg.fit(X_train_15, Y_train_15)
#Code to determine feature weights comes courtesy of the SK-Learn documentation
# https://inria.github.io/scikit-learn-mooc/python_scripts/dev_features_importance.html
coefs15_u = pd.DataFrame(
reg.coef_,
columns=['Coefficients'], index=X_train_15.columns
)
Y_train_21 = health21_percent_uninsured[uninsured_features]["Uninsured Percent"].copy()
X_train_21 = health21_percent_uninsured[uninsured_features].drop("Uninsured Percent", axis = 1)
X_train_21.dropna(axis = "columns",inplace = True)
reg = LinearRegression()
reg.fit(X_train_21, Y_train_21)
coefs21_u = pd.DataFrame(
reg.coef_,
columns=['Coefficients'], index=X_train_21.columns
)
coefs15_u.plot.barh(ax=ax1,color={"Coefficients": "plum"},figsize=(18,10))
ax1.axvline(x=0, color='.5')
ax1.set_title("2015",fontsize=15)
coefs21_u.plot.barh(ax=ax2,color={"Coefficients": "yellowgreen"}, figsize=(18,10))
ax2.axvline(x=0, color='.5')
ax2.set_title("2021",fontsize=15)
plt.tight_layout()
These graphs show us a lot of interesting points of our data in terms of which features effect the insurance rates in what way. In 2015, we notice that being in a non-family or couple household is a large predictor for a low rate of uninsurance, or a high rate of insurance.
From these graphs, we can determine the most important features for predicting uninsured % in the two years. We will explore the feature importances of insured % more in our next graph below.
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(16,28),sharey=False)
fig.suptitle('Feature Weights Insured %', fontsize=20, y=1.02)
Y_train_15 = health15_percent_insured[insured_features]["Insured Percent"].copy()
X_train_15 = health15_percent_insured[insured_features].drop("Insured Percent", axis = 1)
X_train_15.dropna(axis = "columns",inplace = True)
reg = LinearRegression()
reg.fit(X_train_15, Y_train_15)
coefs15_i = pd.DataFrame(
reg.coef_,
columns=['Coefficients'], index=X_train_15.columns
)
Y_train_21 = health21_percent_insured[insured_features]["Insured Percent"].copy()
X_train_21 = health21_percent_insured[insured_features].drop("Insured Percent", axis = 1)
X_train_21.dropna(axis = "columns",inplace = True)
reg = LinearRegression()
reg.fit(X_train_21, Y_train_21)
coefs21_i = pd.DataFrame(
reg.coef_,
columns=['Coefficients'], index=X_train_21.columns
)
coefs15_i.plot.barh(ax=ax1,color={"Coefficients": "plum"},figsize=(18, 10))
ax1.axvline(x=0, color='.5')
ax1.set_title("2015",fontsize=15)
coefs21_i.plot.barh(ax=ax2, color={"Coefficients": "yellowgreen"},figsize=(18, 10))
ax2.axvline(x=0, color='.5')
ax2.set_title("2021",fontsize=15)
plt.tight_layout()
Here we see the feature weights for predicting insurance. On the left we have the feature importance for predicting insurance in 2015, and on the right we have the same except they're the weights for 2021. It is important to notice that the feature importance graphs for insured % look different than the graphs for feature importance for uninsured %.
#coefs21_i.set_index("Label")
coefs21_i
# copy the data
coefs21_min_max_scaled = coefs21_i.copy()
coefs21_min_max_scaled["Coefficients"] = (coefs21_min_max_scaled["Coefficients"] - coefs21_min_max_scaled["Coefficients"].min()) / (coefs21_min_max_scaled["Coefficients"].max() - coefs21_min_max_scaled["Coefficients"].min())
coefs21_min_max_scaled.plot.pie(subplots=True,figsize=(12,12),legend=False)
The pie chart above was created for another visualization of the coefficients with the largest feature weights with Insurance rates in 2021. First, the coefficients were normalized using min-max scaling. The graph provides a holistic view that all the variables have an importance, yet some are much smaller or larger than others. For example, features such as No disability, Male, College or associate's degree, or being below 400% of the poverty threshold seem important in determining whether a person is insured in 2021. On the other hand, features such as disability have very little correlation with being insured.
This might bring up a question.. does having a disability in Louisiana mean you are less likely to be insured??
coefs15_u.reset_index(inplace=True)
coefs15_uninsured=coefs15_u.copy()
coefs15_u["Coefficients"]=coefs15_u["Coefficients"].abs()
(coefs15_u["Coefficients"]>0.065).value_counts()
new_coefs15_u=coefs15_u[coefs15_u["Coefficients"]>0.065]
new_coefs15_u.dropna(inplace=True)
new_coefs15_u.sort_values(by="Coefficients",ascending=True,inplace=True)
x1=new_coefs15_u["Label"]
y1=new_coefs15_u["Coefficients"]
coefs21_u.reset_index(inplace=True)
coefs21_uninsured=coefs21_u.copy()
coefs21_u["Coefficients"]=coefs21_u["Coefficients"].abs()
(coefs21_u["Coefficients"]>0.048).value_counts()
new_coefs21_u=coefs21_u[coefs21_u["Coefficients"]>0.048]
new_coefs21_u.dropna(inplace=True)
new_coefs21_u.sort_values(by="Coefficients",ascending=True,inplace=True)
x2=new_coefs21_u["Label"]
y2=new_coefs21_u["Coefficients"]
coefs15_uninsured.drop(index=0,inplace=True)
coefs15_uninsured.drop(index=31,inplace=True)
coefs15_uninsured.dropna()
coefs15_uninsured.reset_index(inplace=True)
coefs15_i.reset_index(inplace=True)
coefs15_insured=coefs15_i.copy()
coefs15_i["Coefficients"]=coefs15_i["Coefficients"].abs()
(coefs15_i["Coefficients"]>0.05).value_counts()
new_coefs15_i=coefs15_i[coefs15_i["Coefficients"]>0.05]
new_coefs15_i.dropna(inplace=True)
new_coefs15_i.sort_values(by="Coefficients",ascending=True,inplace=True)
x3=new_coefs15_i["Label"]
y3=new_coefs15_i["Coefficients"]
coefs21_i.reset_index(inplace=True)
coefs21_insured=coefs21_i.copy()
coefs21_i["Coefficients"]=coefs21_i["Coefficients"].abs()
(coefs21_i["Coefficients"]>0.0503).value_counts()
new_coefs21_i=coefs21_i[coefs21_i["Coefficients"]>0.0503]
new_coefs21_i.dropna(inplace=True)
new_coefs21_i.sort_values(by="Coefficients",ascending=True,inplace=True)
x4=new_coefs21_i["Label"]
y4=new_coefs21_i["Coefficients"]
coefs15_insured.drop(index=0,inplace=True)
coefs15_insured.drop(index=31,inplace=True)
coefs15_insured.dropna()
coefs15_insured.reset_index(inplace=True)
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2,figsize=(15,12))
fig.suptitle('Feature Importance by Year and Insured/Uninsured', fontsize=20, y=1.02)
ax1.barh(x1, y1,color={'plum'})
ax1.set_title("2015 Uninsured % Most Important Features")
ax1.set_xlabel("Absolute Value of Feature Weights")
ax2.barh(x2, y2,color={'indigo'})
ax2.set_title("2021 Uninsured % Most Important Features")
ax2.set_xlabel("Absolute Value of Feature Weights")
plt.tight_layout()
ax3.barh(x3, y3,color={'yellowgreen'})
ax3.set_title("2015 Insured % Most Important Features")
ax3.set_xlabel("Absolute Value of Feature Weights")
ax4.barh(x4, y4,color={'darkolivegreen'})
ax4.set_title("2021 Insured % Most Important Features")
ax4.set_xlabel("Absolute Value of Feature Weights")
plt.tight_layout()
As you can see above, the factors differ in which has the highest feature weights depending on year and insured/uninsured %. Our next goal is to take these 5 most important features and use them to predict the 2021 data on the 2015 data. We will explore this in the last section of our analysis.
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,14),sharey=False)
fig.suptitle('Deltas', fontsize=20, y=1.02)
coefs_joined_u=coefs15_uninsured.join(other=coefs21_uninsured,how="inner",lsuffix="_2015",rsuffix="_2021")
coefs_joined_u["delta"]=coefs_joined_u["Coefficients_2021"]-coefs_joined_u["Coefficients_2015"]
coefs_joined_u.rename(columns={"Label_2015":"Features"},inplace=True)
coefs_joined_u.set_index("Features",inplace=True)
coefs_joined_u.drop(columns={"Label_2021","Coefficients_2021","Coefficients_2015","index"},inplace=True)
coefs_joined_i=coefs15_insured.join(other=coefs21_insured,how="inner",lsuffix="_2015",rsuffix="_2021")
coefs_joined_i.drop(index=30,inplace=True)
coefs_joined_i["delta"]=coefs_joined_i["Coefficients_2021"]-coefs_joined_i["Coefficients_2015"]
coefs_joined_i.rename(columns={"Label_2015":"Features"},inplace=True)
coefs_joined_i.set_index("Features",inplace=True)
coefs_joined_i.drop(columns={"Label_2021","Coefficients_2021","Coefficients_2015","index"},inplace=True)
coefs_joined_u.plot.barh(ax=ax1,color={"delta": "plum"},figsize=(15, 10))
ax1.axvline(x=0, color='.5')
ax1.set_title("Uninsured",fontsize=15)
ax1.set_xlabel("Delta of Beta Value")
coefs_joined_i.plot.barh(ax=ax2,color={"delta": "yellowgreen"},figsize=(15, 10))
ax2.axvline(x=0, color='.5')
ax2.set_title("Insured",fontsize=15)
ax2.set_xlabel("Delta of Beta Value")
plt.tight_layout()
Here we have the difference in feature weights between 2021 and 2015 for both insured % and uninsured %. One of the most important points on this graph is the delta for "<138 % of poverty threshold" on the uninsured deltas graph.
As is clear, the value for this delta is very negative which means that from 2015 to 2021 being below 138% of the povery threshold became negatively associated with being uninsured.
In plain english, this means that you are LESS likely to be uninsured in 2021 than you were in 2015 if you are below 138% of the poverty threshold. This is important to realize because one of the main tennants of Medicaid expansion is that if you were below 138% of the poverty threshold, you became eligable for Medicaid. This graph is highlighting one of the main effects of Medicaid expansion.
To answer this key question we can use a linear regression.
Here is how the process will work:
We chose 5 factors to explore this linear regression. We chose these for specific reasons described below, but overall we were looking for factors that would impact a person's ability or choice to be insured by Medicaid expansion.
Male - came up in 3 of the feature weights as one of the most important factors.
<138% of poverty threshold - This is the group that should be directly effected by Medicaid Expansion. They are now eligible for Medicaid so we expect to see an increase in their insurance rates.
No disability - Shown in 2 of the feature weights, emphasizes the importance of peole who don't suffer from any disabilities, which may include a majority of healthy adults living in Louisiana. These people have no incentive to be insured because of a specific disability, so they will be effected heavily by Medicaid expansion.
In married couple families - The so-called "normal" households, this also came up in one of our feature weight importance graphs.
Age: 35-44: This is an age group highly effected by Medicaid. They fit the criteria for expansion and din't qualify before. Also seen in a feature importance graph.
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(19,13),sharey=False)
X_train = health15_percent_insured[["Male","< 138% of poverty threshold","No disability","In married couple families","Age: 35-44"]]
Y_train = health15_percent_insured["Insured Percent"]
X_test = health21_percent_insured[["Male","< 138% of poverty threshold","No disability","In married couple families","Age: 35-44"]]
Y_test = health21_percent_insured["Insured Percent"]
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
y_pred = regressor.predict(X_test)
Pred_v_Actual = pd.DataFrame({'Actual': Y_test, 'Predicted': y_pred.flatten()})
Pred_v_Actual["delta"]=Pred_v_Actual["Actual"]-Pred_v_Actual["Predicted"]
Pred_v_Actual_nodelt=Pred_v_Actual.drop(columns={"delta"})
Pred_v_Actual_nodelt.plot.bar(ax=ax1,color={"Actual":"yellowgreen","Predicted":"indigo"})
ax1.set_title("Difference between Predicted and Actual Insurance Rates", fontsize = 10)
ax1.set_ylabel("% Insured Citizens", fontsize = 15)
ax1.set_xlabel("Parish", fontsize = 15)
Pred_v_Actual_=Pred_v_Actual.drop(columns={"Predicted","Actual"})
Pred_v_Actual_.plot.bar(ax=ax2,color="plum")
ax2.set_title("Difference between Predicted and Actual Insurance Rates", fontsize = 10)
ax2.set_ylabel("Delta of % Insured Citizens", fontsize = 15)
ax2.set_xlabel("Parish", fontsize = 15)
These two models show the differences between our prediction for insurance rates of 2021 with the linear regression model and the acutal insurance rates. The left graph shows that all Predicted values were equal to or lower than the Actual values. This shows that actual insurance rates were higher than they wouldn've been without Medicaid!! A great sign!!
The right graph shows an enlarged version of these differences by observing the delta (or difference) between the predicted and observed values. This graph shows us more clearly the differences between parishes, and some parishes such as Rapides and St. Landry were able to highly improve insurance rates thanks to Medicaid. Other parishes such as Ouachita and Jefferson had less of a change, even though we saw a net improvement across all parishes.
Our analysis has been a learning process with insurance and Medicaid expansion. Louisiana is an interesting test case to evaluate since the populations vary here between rural and suburban, and there are huge differences in beliefs such as vaccination, abortion, and more medically-related political controversies. Our analysis has shown that insurance correlates much less with some factors we had expected, yet much more with others. We discovered how large a role gender plays in insurance rates, yet how race plays a smaller role.
As is the case with any predictive model, ours is not perfect. A key assumption that we made about the rates of insurance and uninsurance is that they are in a linear relationship with the features we selected. Because of this assumption, we chose to use a linear regression for our predictions and we got some associations that look strange at first glance.
It is entirely possible that the associations between insurance rates and the features of the dataset are linear, but its more likely that there are non-linear relationships in these data. Furthermore, its entirely possible that there are non-linear relationships between the features themselves that affect rates of insurance.
That is not to say that our model is not capable of reflecting some part of the real world. However, it is important to keep in mind that just because we used "machine learning" doesn't mean that we have unlocked all the secrets of insurance rates in Louisiana. With more resources, it would be worth examining each parish in our dataset individually and seeing what we find. Additionally, it would be worthwhile to explore the use of non-linear models that might more "accurately" capture the relationships of the features.
In conclusion, we were able to examine some interesting aspects of many population features and their relationship to rates of insurance. This is by no means an exhaustive analysis, but it is a good first look at this interesting problem.
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/Colab_Notebooks/Data_Science_Project.ipynb