Since the start of the COVID-19 pandemic early this year, people have taken various measurements, such as social distancing, self-isolation, avoid large gathering, and mandetory curfew in some countries, to prevent the spread of the disease. However, is this the only way that we can stay safe before a vaccine is avaliable to the general pubic?
In this tutorial, we are trying to explore how our diet would help us combat the coronavirus.
For this tutorial, we will be using several python libraries and we will assume that you are familiar with these libaraies. Below is a module checklist for what you need to get started.
# all the modules we need for this tutorial
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
The data that we found is from kaggle.com(click to check & download), which contains already collected diet data along with COVID-19 data. In this tutorial, we will be using "Food_Supply_Quantity_kg_Data.csv" and "Supply_Food_Data_Descriptions.csv" as our data source. (All the COVID-19 related data from this source are updated at 12/3/2020.)
Now it comes to our first step, we want to store our collected data from the CSV source we just obtain to the pandas dataframe. Below is one of the ways we can approach it, using the function pd.read_csv()
from pandas module.
# Import data:
origin_df = pd.read_csv("Food_Supply_Quantity_kg_Data.csv - Sheet1.csv")
origin_df
The following is the description of each food category:
description = pd.read_csv("Supply_Food_Data_Descriptions.csv - Sheet1.csv")
pd.options.display.max_colwidth = 400
description
As you can see, the above data seems to be a bit messy, given there are so many categories and the categorization of each type of food seems to be too precise and some of them are redundent. In this step, we will combine and drop some food categories to make the data more brief and understandable.
According to USDA's Choose MyPlate program ,a daily diet can be categorized into fruits, vegetabels, grains, diary, protein foods, and oils. However, given that countries have different food cultures around the world, we will also add stimulants (such as coffee, and tea), spcies, and alcoholic beverages to our food categories. Our food category will be a combination of the provided food categories:
We will drop "Miscellaneous" from the original data since the decription of this category is very vague.
Moreover, the proportion of each food category in a diet does not reflect the absolute quantity of each food cateory, which means a diet might appear to be healthy from a proportion prespective, but the actual quantity of the diet might be very large or small. This would effect a nation's nutrition status, such as obesity rate rate. Therefore, we will also include each country's obesity rate in our dataset.
We have also considered including undernourishment rate as part of our nutrition status matrics, but given the undernourishment rate data provided by this data set is not accurate and complete enought, for example, many countries' undernourishment rates are either recorded as Na or just <2.5, we decided not to include undernourishment rate as part of our nutrition status matrics.
We will also drop all rows with Na.
Please note: all the data we will be using are in %
# Create a new df with combinations of variables from the origin_df
data = {'Country': origin_df['Country'],
'Fruits': origin_df['Fruits - Excluding Wine'],
'Vegetables': origin_df['Pulses'] + origin_df['Starchy Roots'] + origin_df['Sugar & Sweeteners'] + origin_df['Sugar Crops'] + origin_df['Vegetables'] + origin_df['Vegetal Products'],
'Grains': origin_df['Cereals - Excluding Beer'],
'Protein': origin_df['Animal Products'] + origin_df['Animal Products'] + origin_df['Eggs'] + origin_df['Fish, Seafood'] + origin_df['Meat'] + origin_df['Offals'],
'Diary': origin_df['Milk - Excluding Butter'],
'Oils': origin_df['Animal fats'] + origin_df['Oilcrops'] + origin_df['Treenuts'] + origin_df['Vegetable Oils'],
'Stimulants': origin_df['Stimulants'],
'Spices': origin_df['Spices'],
'Alcohol': origin_df['Alcoholic Beverages'],
'Obesity': origin_df['Obesity'],
'Undernourished': origin_df['Undernourished'],
'COVID_19_Rate': origin_df['Confirmed']}
df = pd.DataFrame(data)
df=df.dropna() # drop all rows with Na.
df
In this section, we will plot the data to observe potential trends in our data and see what is useful to analyze.
We will start by visualize the proportion of each food category of each country's diet using a horizontal stacked bar plot. This process is done using matplotlib.plot.barh()
function.
plt.figure(figsize=(15, 40)) # Set appropriate window size
# Adding each food proportion to the horizontal stack
plt.barh(df['Country'], df['Fruits'], color='yellow', label = 'Fruits')
plt.barh(df['Country'], df['Vegetables'], left=df['Fruits'], color='green', label = 'Vegetables')
plt.barh(df['Country'], df['Grains'], left=df['Fruits'] + df['Vegetables'], color='tan', label = 'Grains')
plt.barh(df['Country'], df['Protein'], left=df['Fruits'] + df['Vegetables'] + df['Grains'], color='orange', label = 'Protein')
plt.barh(df['Country'], df['Diary'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'], color='linen', label = 'Diary')
plt.barh(df['Country'], df['Oils'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'], color='olive', label = 'Oils')
plt.barh(df['Country'], df['Stimulants'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'] + df['Oils'], color='red', label = 'Stimulants')
plt.barh(df['Country'], df['Spices'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'] + df['Oils'] + df['Stimulants'], color='purple', label = 'Spices')
plt.barh(df['Country'], df['Alcohol'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'] + df['Oils'] + df['Stimulants'] + df['Spices'], color='pink', label = 'Alcohol')
plt.title("The Composition of Each Country's Diet")
plt.xlabel("Diet Composition (%)")
plt.ylabel("Country")
res = plt.legend()
As you can notice on the horizonal axis, that the total percentage all call food categories of most countries exceeds 100%. This is mostly because some food might appear in mutiple categories, which results in counting the same food multiple times when we sum the proportion of all food categories. For example, treenuts can be counted as sources of both oil and protein in a diet.
The above bar plot shows that vegetables, grains, and protein are the major food categories making up diets across all countries.
Let's start by plotting COVID-19 rate over vegetables, grains, and protein.
We can start by constructing a figure by using plt.figure()
in the matplotlib module, you can adjust the size of the figure by adding the parameter figsize. For instance, if we want to set the figure size to be 12 by 8, then we can apply the following plt.figure(figsize=(12,8))
, please note that this step is totally optional if you are good with the default figure size.
Then it comes to our plotting procedure, we want to draw scatter plot in this case for analysis, to do that we can use the scatter function plt.scatter()
from matplotlib module.
After plotting the figure, we want to add the x-label, y-label along with the title for the graph. It is important that you apply title, x-label and y-label to illustrate the subjects of your plot to the people who are seeing it.
With the above being said, we can start by plotting COVID-19 Rate vs. Vegetables/Grains/Protein:
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Vegetables'],df['COVID_19_Rate']) # scatter plot
plt.title("COVID-19 Rate vs. Vegetables Proportion")
plt.xlabel('Vegetables Proportion (%)')
res = plt.ylabel('COVID-19 Rate')
It seems that there is possibily a negative correlation between COVID-19 Rate and Vegetables Proportion in a diet. As the proportion of vegetables increases, the COVID-19 rate decreases. It appears that when vegetablee propotion exceeds approximately 60%, the COVID-19 rate drops drastically. We will explore the exact relationship between COVID-19 rate and vegetables protion in the analysis section of this tutorial.
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Grains'],df['COVID_19_Rate']) # scatter plot
plt.title("COVID-19 Rate vs. Grains Proportion")
plt.xlabel('Grains Proportion (%)')
res = plt.ylabel('COVID-19 Rate')
Similar to vegetables, there is possibily a stronger negative correlation between COVID-19 rate and grans proportion in a diet. As the proportion of grains increases, the COVID-19 rate decreases. However, the relationship does not appear to be exactly linear, we will explore more about this relationship in the analysis section of this tutorial.
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Protein'],df['COVID_19_Rate']) # scatter plot
plt.title("COVID-19 Rate vs. Protein Proportion")
plt.xlabel('Protein Proportion (%)')
res = plt.ylabel('COVID-19 Rate')
It seems that there is a positive correlation between protein proportion and COVID-19 rate. As the proportion of protein increases, COVID-19 rate increases as well. However, this does not follow our normal assumption where protein is an essential source of nutrition. We will explore more on this in the analysis section.
Even though diets accross these countries are mostly made up by vegetables, grains, and protein, we cannot neglect the possible relationship between other food categories and COVID-19 rate.
Now, we will plot COVID-19 Rate vs. Fruits/Diary/Oils/Stumulants/Spices/Alcohol:
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Fruits'],df['COVID_19_Rate']) # drawing scatter plot
plt.title("COVID-19 Rate vs. Fruits Proportion") # adding title to the plot
plt.xlabel('Fruits Proportion (%)') # adding x-label to the plot
res = plt.ylabel('COVID-19 Rate') # adding y-label to the plot
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Diary'],df['COVID_19_Rate']) # scatter plot
plt.title("Diary category scatter plot")
plt.xlabel('Diary proportion (%)')
res = plt.ylabel('COVID-19 Rate')
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Oils'],df['COVID_19_Rate']) # scatter plot
plt.title("Oils category scatter plot")
plt.xlabel('Oils proportion (%)')
res = plt.ylabel('COVID-19 Rate')
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Stimulants'],df['COVID_19_Rate']) # scatter plot
plt.title('Stimulants category scatter plot')
plt.xlabel('Stimulants proportion (%)')
res = plt.ylabel('COVID-19 Rate')
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Spices'],df['COVID_19_Rate']) # scatter plot
plt.title('Spices category scatter plot')
plt.xlabel('Spices proportion (%)')
res = plt.ylabel('COVID-19 Rate')
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Alcohol'],df['COVID_19_Rate']) # scatter plot
plt.title('Alcohol category scatter plot')
plt.xlabel('Alcohol proportion (%)')
res = plt.ylabel('COVID-19 Rate')
From the above plots, it seems that there no observable correlations between Fruits/Diary/Oils/Stimilants/Spices/Alcohol and COVID-19 rate. They all share the property where most of the data are concentrated on a certain horizontal interval and spreading in random patterns, with the exception of COVID-19 vs. Diary Proportion, where most of the data are spreading out on the entire domain at random.
As we stated earlier, the food category proportion of each country's diet only shows the relative proportion of each food category in a diet. It does not show the abosolute quantity of each food category in a diet. For example, an under-developed country's diet could be majorly vegetables and protein based, but the amount vegetables and protein are very small compared with other countries. Therefore, we also need to see if the nutrition status of each country has relationship with COVID-19 rate. In our dataset, we have obesity rate as our matrics of nutrition status.
We will plot COVID-19 Rate vs. Obesity Rate
# Obesity vs. COVID-19 Rate
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Obesity'],df['COVID_19_Rate']) # scatter plot
plt.title('COVID-19 Rate vs. Obesity Rate scatter plot')
plt.xlabel('Obesity Rate (%)')
res = plt.ylabel('COVID-19 Rate')
It appears that there is a potential positive correlation between obesity rate and COVID-19 rate. However, the increase of COVID-19 rate is not gradual. When obesity rate exceeds 15%, COVID-19 rate increases drastically, which indicates this relationship might not be linear.
In this section, we will apply machine learning and statistical analysis to explore relationships between data. We will use linear regression model to obtain a predictive model of COVID-19 Rate based on diet composition.
After plotting the data, we would like to see if there is really a strong correlation between COVID-19 rate and vegetables/grains/protein proportions as we hypothesized. We will fit a regression line and get their R-Squared value, and use Ordinary Least Square(OLS) regression analysis on these three relationships seperately.
Please note: we will use significance level of 0.05 for all hypothesis tests below. We will conclude there is a strong linear relationship between indepedent variables and depedent variable if the model's R-Squared value is greater than or equal to 0.5
# Get linear regression model
vegetables = np.array(df['Vegetables']).reshape(-1, 1)
reg = LinearRegression().fit(vegetables, df['COVID_19_Rate'])
predicted = reg.predict(vegetables)
r_squared = reg.score(vegetables, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Vegetables'],df['COVID_19_Rate']) # scatter plot
plt.title("Vegetables category scatter plot")
plt.xlabel('Vegetables Proportion (%)')
res = plt.ylabel('COVID-19 Rate')
# Plot regression line
res = plt.plot(df['Vegetables'], predicted, '-',color='red', label = r_squared)
res = plt.legend()
# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Vegetables", data=df)
res = mod.fit()
print(res.summary())
# Get linear regression model
grains = np.array(df['Grains']).reshape(-1, 1)
reg = LinearRegression().fit(grains, df['COVID_19_Rate'])
predicted = reg.predict(grains)
r_squared = reg.score(grains, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Grains'],df['COVID_19_Rate']) # scatter plot
plt.title("Grains category scatter plot")
plt.xlabel('Grains Proportion (%)')
res = plt.ylabel('COVID-19 Rate')
# Plot regression line
res = plt.plot(df['Grains'], predicted, '-',color='red', label = r_squared)
res = plt.legend()
# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Grains", data=df)
res = mod.fit()
print(res.summary())
# Get linear regression model
protein = np.array(df['Protein']).reshape(-1, 1)
reg = LinearRegression().fit(protein, df['COVID_19_Rate'])
predicted = reg.predict(protein)
r_squared = reg.score(protein, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Protein'],df['COVID_19_Rate']) # scatter plot
plt.title("Protein category scatter plot")
plt.xlabel('Protein Proportion (%)')
res = plt.ylabel('COVID-19 Rate')
# Plot regression line
res = plt.plot(df['Protein'], predicted, '-',color='red', label = r_squared)
res = plt.legend()
# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Protein", data=df)
res = mod.fit()
print(res.summary())
From the above plots and their OLS analysis, it confirmed our hypothesis ealier, where vegetable and grains proportion have negative linear relationship with COVID-19 rate, and protein proportion has a positive linear relationship with COVID-19 rate. This claim can be supported by the fact that the p-values of all indepedent varible's coefficients are less than 0.05, which rejects the null hypothesis where all indepedent variables' coefficients are 0.
R-Squared value of COVID-19 Rate over Vegetables Proportion model is 0.208, R-Squared value of COVID-19 Rate over Grains Proportion model is 0.204, and the R-Squared value of COVID-19 Rate over Protein Proportion model is 0.292. These R-Squared values suggest that there is no strong linear relationships between vegetables/grains/protein proportions and COVID-19 rate.
Now, let's see if there is a strong linear relationship between obesity rate and COVID-19 rate
obesity = np.array(df['Obesity']).reshape(-1, 1)
reg = LinearRegression().fit(obesity, df['COVID_19_Rate'])
predicted = reg.predict(obesity)
r_squared = reg.score(obesity, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Obesity'],df['COVID_19_Rate']) # scatter plot
plt.title("Obesity Rate scatter plot")
plt.xlabel('Obesity Rate (%)')
res = plt.ylabel('COVID-19 Rate')
# Plot regression line
res = plt.plot(df['Obesity'], predicted, '-',color='red', label = r_squared)
res = plt.legend()
# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Obesity", data=df)
res = mod.fit()
print(res.summary())
Based on the plot and OSL analysis above, it appears that there is a positive linear relaionship between obesity rate and COVID-19 rate since the p-value of the indepedent variable's coefficient is < 0.05. However, given that the R-Squared value of the above is approximately 0.280, which is small, we conclude that there is no strong linear relationship between obesity rate and COVID-19 rate.
Now, let's see if there exists a strong linear model between COVID-19 rate, vagatables proportion, grains proportion, and protein proportion.
mod = smf.ols(formula="COVID_19_Rate ~ Vegetables + Grains + Protein", data=df)
res = mod.fit()
print(res.summary())
From the above OLS regression results, we see that although is R-Squared value is still less than 0.5, this model shows a stronger linear relationship between the predictors and depedent varible than models with just one food category. However, the p-value for Protein's coefficient and Vegetables' coefficient are both greater than our significance level of 0.05, and the p-value of Protein's coefficient is much larger than Vegetables'.
Let's what happens if we remove Protein from our predictors:
mod = smf.ols(formula="COVID_19_Rate ~ Vegetables + Grains", data=df)
res = mod.fit()
print(res.summary())
It seems that by removing Protein from our model, the p-value of all predictor's coefficients are small enough to reject the null hypothesis where there is no linear relationships between the predictors and the depedent variable.
Let's what happens if we add obesity rate into our model:
mod = smf.ols(formula="COVID_19_Rate ~ Vegetables + Grains + Obesity", data=df)
res = mod.fit()
print(res.summary())
We can see that by adding obesity rate to our model, the R-squared value slightly increases, but still not big enough to claim that there is a strong linear relationship between the predictors and the depedent variable.
From the above analysis on the correlation between vegetables/grains/protein/obesity rate and COVID-19 rate, we can see there is no strong linear correlation between the predictors and COVID-19 rate. However, we did discover is that there are negative correlations between vegetables/grains/obesity rate and COVID-19 rate, meaning that as these factors increases, the COVID-19 rate decreases. Protein is the only food category that has positive correlation with COVID-19 rate: as protein portion increases, COVID-19 rate increases as well. Moreover, although our model between COVID-19 rate and vegetable, grains portions, and obesity is not strongly linear, we can see that these factors do play a role in affecting COVID-rates, given that their coefficients' p-values are much less than the significance level of 0.05.
When we first obtained the data, we want to show that there is some kind of relationship between people's diet and the likelihood of getting COVID-19. After plotting out the dataset, we hypothesized that there might be a linear relationship between the food categories that make up the majority of the diet accross the world, such as vegetables, grains, and protein, and COVID-19 rate. However, when we did a linear regression analysis on these data, we conclude that these food categories do not have a strong linear correlation with COVID-19 rate. Among the few food categories we analyzed, we found that there is a potential negative correlation between the proportions of vegetables and grains in a diet, and the likelihood of getting COVID-19; there is a a potential positive correlation between the protein proportion, and the likelihood of getting COVID-19.
Further analysis is need on this topic, because there are countless factors, such as each country's public health policy, nation's economic power, population, diet, food culture, etc., effect the likelihood of being infected with a respetory disease like COVID-19. Our analysis showing there is no linear relationship between diet and the likelihood of being infected by COVID-19 does not mean there is no relationship at all. We are aware that our data might be flawed given that the categorization is not extremely precise. A more rigorous food categorization method is needed to explore the relationship between diet and COVID-19. Also, more advanced regression models and machine learning methods can be applied to the data such as polynomial regression, GMM, etc.
If you need any information regarding COVID-19, please visit CDC's website