Analysis of Diet and COVID-19¶

Authors: Miaokun Chen, Haoran Li

Introduction¶

Since the start of the COVID-19 pandemic early this year, people have taken various measurements, such as social distancing, self-isolation, avoid large gathering, and mandetory curfew in some countries, to prevent the spread of the disease. However, is this the only way that we can stay safe before a vaccine is avaliable to the general pubic?

In this tutorial, we are trying to explore how our diet would help us combat the coronavirus.

Libaraies Used¶

For this tutorial, we will be using several python libraries and we will assume that you are familiar with these libaraies. Below is a module checklist for what you need to get started.

numpy
pandas
scikit-learn
matplotlib
statsmodels

# all the modules we need for this tutorial
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

Data Collection¶

The data that we found is from kaggle.com(click to check & download), which contains already collected diet data along with COVID-19 data. In this tutorial, we will be using "Food_Supply_Quantity_kg_Data.csv" and "Supply_Food_Data_Descriptions.csv" as our data source. (All the COVID-19 related data from this source are updated at 12/3/2020.)

Now it comes to our first step, we want to store our collected data from the CSV source we just obtain to the pandas dataframe. Below is one of the ways we can approach it, using the function pd.read_csv() from pandas module.

# Import data:
origin_df = pd.read_csv("Food_Supply_Quantity_kg_Data.csv - Sheet1.csv")
origin_df

The following is the description of each food category:

description = pd.read_csv("Supply_Food_Data_Descriptions.csv - Sheet1.csv")
pd.options.display.max_colwidth = 400
description

Data Processing¶

As you can see, the above data seems to be a bit messy, given there are so many categories and the categorization of each type of food seems to be too precise and some of them are redundent. In this step, we will combine and drop some food categories to make the data more brief and understandable.

According to USDA's Choose MyPlate program ,a daily diet can be categorized into fruits, vegetabels, grains, diary, protein foods, and oils. However, given that countries have different food cultures around the world, we will also add stimulants (such as coffee, and tea), spcies, and alcoholic beverages to our food categories. Our food category will be a combination of the provided food categories:

Fruits: Fruits-excluding wine
Vegatables: Pulses, staychy roots, sugar & sweeteners, sugar crops, vehetables, vegetal products
Grains: Cereals-exclusing beers
Protein Foods: Animal products, aquatic products, other, eggs, seafood, fish, meat, offals.
Diary: Milk-excluding butter.
Oils: Animal fat, oilcrops, treenuts, vegetable oils.
Stimulants: coffee, tea, etc.
Spices: Cloves; Pepper; Pimento; Spices, Other
Alcholoic Drink: Alcohol, Non-Food; Beer; Beverages, Alcoholic; Beverages, Fermented; Wine

We will drop "Miscellaneous" from the original data since the decription of this category is very vague.

Moreover, the proportion of each food category in a diet does not reflect the absolute quantity of each food cateory, which means a diet might appear to be healthy from a proportion prespective, but the actual quantity of the diet might be very large or small. This would effect a nation's nutrition status, such as obesity rate rate. Therefore, we will also include each country's obesity rate in our dataset.

We have also considered including undernourishment rate as part of our nutrition status matrics, but given the undernourishment rate data provided by this data set is not accurate and complete enought, for example, many countries' undernourishment rates are either recorded as Na or just <2.5, we decided not to include undernourishment rate as part of our nutrition status matrics.

We will also drop all rows with Na.

Please note: all the data we will be using are in %

# Create a new df with combinations of variables from the origin_df
data = {'Country': origin_df['Country'], 
        'Fruits': origin_df['Fruits - Excluding Wine'], 
       'Vegetables': origin_df['Pulses'] + origin_df['Starchy Roots'] + origin_df['Sugar & Sweeteners'] + origin_df['Sugar Crops'] + origin_df['Vegetables'] + origin_df['Vegetal Products'], 
       'Grains': origin_df['Cereals - Excluding Beer'], 
       'Protein': origin_df['Animal Products'] + origin_df['Animal Products'] + origin_df['Eggs'] + origin_df['Fish, Seafood'] + origin_df['Meat'] + origin_df['Offals'], 
       'Diary': origin_df['Milk - Excluding Butter'], 
       'Oils': origin_df['Animal fats'] + origin_df['Oilcrops'] + origin_df['Treenuts'] + origin_df['Vegetable Oils'], 
       'Stimulants': origin_df['Stimulants'], 
       'Spices': origin_df['Spices'], 
       'Alcohol': origin_df['Alcoholic Beverages'], 
        'Obesity': origin_df['Obesity'],
        'Undernourished': origin_df['Undernourished'], 
       'COVID_19_Rate': origin_df['Confirmed']}

df = pd.DataFrame(data)
df=df.dropna() # drop all rows with Na. 

df

Data Visulization¶

In this section, we will plot the data to observe potential trends in our data and see what is useful to analyze.

We will start by visualize the proportion of each food category of each country's diet using a horizontal stacked bar plot. This process is done using matplotlib.plot.barh() function.

plt.figure(figsize=(15, 40)) # Set appropriate window size

# Adding each food proportion to the horizontal stack
plt.barh(df['Country'], df['Fruits'], color='yellow', label = 'Fruits')
plt.barh(df['Country'], df['Vegetables'], left=df['Fruits'], color='green', label = 'Vegetables')
plt.barh(df['Country'], df['Grains'], left=df['Fruits'] + df['Vegetables'], color='tan', label = 'Grains')
plt.barh(df['Country'], df['Protein'], left=df['Fruits'] + df['Vegetables'] + df['Grains'], color='orange', label = 'Protein')
plt.barh(df['Country'], df['Diary'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'], color='linen', label = 'Diary')
plt.barh(df['Country'], df['Oils'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'], color='olive', label = 'Oils')
plt.barh(df['Country'], df['Stimulants'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'] + df['Oils'], color='red', label = 'Stimulants')
plt.barh(df['Country'], df['Spices'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'] + df['Oils'] + df['Stimulants'], color='purple', label = 'Spices')
plt.barh(df['Country'], df['Alcohol'], left=df['Fruits'] + df['Vegetables'] + df['Grains'] + df['Protein'] + df['Diary'] + df['Oils'] + df['Stimulants'] + df['Spices'], color='pink', label = 'Alcohol')

plt.title("The Composition of Each Country's Diet")
plt.xlabel("Diet Composition (%)")
plt.ylabel("Country")

res = plt.legend()

As you can notice on the horizonal axis, that the total percentage all call food categories of most countries exceeds 100%. This is mostly because some food might appear in mutiple categories, which results in counting the same food multiple times when we sum the proportion of all food categories. For example, treenuts can be counted as sources of both oil and protein in a diet.

The above bar plot shows that vegetables, grains, and protein are the major food categories making up diets across all countries.

Let's start by plotting COVID-19 rate over vegetables, grains, and protein.

We can start by constructing a figure by using plt.figure() in the matplotlib module, you can adjust the size of the figure by adding the parameter figsize. For instance, if we want to set the figure size to be 12 by 8, then we can apply the following plt.figure(figsize=(12,8)), please note that this step is totally optional if you are good with the default figure size.
Then it comes to our plotting procedure, we want to draw scatter plot in this case for analysis, to do that we can use the scatter function plt.scatter() from matplotlib module.
After plotting the figure, we want to add the x-label, y-label along with the title for the graph. It is important that you apply title, x-label and y-label to illustrate the subjects of your plot to the people who are seeing it.

With the above being said, we can start by plotting COVID-19 Rate vs. Vegetables/Grains/Protein:

Vegetables:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Vegetables'],df['COVID_19_Rate']) # scatter plot
plt.title("COVID-19 Rate vs. Vegetables Proportion")
plt.xlabel('Vegetables Proportion (%)')

res = plt.ylabel('COVID-19 Rate')

It seems that there is possibily a negative correlation between COVID-19 Rate and Vegetables Proportion in a diet. As the proportion of vegetables increases, the COVID-19 rate decreases. It appears that when vegetablee propotion exceeds approximately 60%, the COVID-19 rate drops drastically. We will explore the exact relationship between COVID-19 rate and vegetables protion in the analysis section of this tutorial.

Grains:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Grains'],df['COVID_19_Rate']) # scatter plot
plt.title("COVID-19 Rate vs. Grains Proportion")
plt.xlabel('Grains Proportion (%)')
res = plt.ylabel('COVID-19 Rate')

Similar to vegetables, there is possibily a stronger negative correlation between COVID-19 rate and grans proportion in a diet. As the proportion of grains increases, the COVID-19 rate decreases. However, the relationship does not appear to be exactly linear, we will explore more about this relationship in the analysis section of this tutorial.

Protein:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Protein'],df['COVID_19_Rate']) # scatter plot
plt.title("COVID-19 Rate vs. Protein Proportion")
plt.xlabel('Protein Proportion (%)')
res = plt.ylabel('COVID-19 Rate')

It seems that there is a positive correlation between protein proportion and COVID-19 rate. As the proportion of protein increases, COVID-19 rate increases as well. However, this does not follow our normal assumption where protein is an essential source of nutrition. We will explore more on this in the analysis section.

Even though diets accross these countries are mostly made up by vegetables, grains, and protein, we cannot neglect the possible relationship between other food categories and COVID-19 rate.

Now, we will plot COVID-19 Rate vs. Fruits/Diary/Oils/Stumulants/Spices/Alcohol:

Fruits:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Fruits'],df['COVID_19_Rate']) # drawing scatter plot
plt.title("COVID-19 Rate vs. Fruits Proportion") # adding title to the plot
plt.xlabel('Fruits Proportion (%)') # adding x-label to the plot
res = plt.ylabel('COVID-19 Rate') # adding y-label to the plot

Diary:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Diary'],df['COVID_19_Rate']) # scatter plot
plt.title("Diary category scatter plot")
plt.xlabel('Diary proportion (%)')
res = plt.ylabel('COVID-19 Rate')

Oils:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Oils'],df['COVID_19_Rate']) # scatter plot
plt.title("Oils category scatter plot")
plt.xlabel('Oils proportion (%)')
res = plt.ylabel('COVID-19 Rate')

Stimulants:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Stimulants'],df['COVID_19_Rate']) # scatter plot
plt.title('Stimulants category scatter plot')
plt.xlabel('Stimulants proportion (%)')
res = plt.ylabel('COVID-19 Rate')

Spices:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Spices'],df['COVID_19_Rate']) # scatter plot
plt.title('Spices category scatter plot')
plt.xlabel('Spices proportion (%)')
res = plt.ylabel('COVID-19 Rate')

Alcohol:¶

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Alcohol'],df['COVID_19_Rate']) # scatter plot
plt.title('Alcohol category scatter plot')
plt.xlabel('Alcohol proportion (%)')
res = plt.ylabel('COVID-19 Rate')

From the above plots, it seems that there no observable correlations between Fruits/Diary/Oils/Stimilants/Spices/Alcohol and COVID-19 rate. They all share the property where most of the data are concentrated on a certain horizontal interval and spreading in random patterns, with the exception of COVID-19 vs. Diary Proportion, where most of the data are spreading out on the entire domain at random.

As we stated earlier, the food category proportion of each country's diet only shows the relative proportion of each food category in a diet. It does not show the abosolute quantity of each food category in a diet. For example, an under-developed country's diet could be majorly vegetables and protein based, but the amount vegetables and protein are very small compared with other countries. Therefore, we also need to see if the nutrition status of each country has relationship with COVID-19 rate. In our dataset, we have obesity rate as our matrics of nutrition status.

We will plot COVID-19 Rate vs. Obesity Rate

Obesity Rate¶

# Obesity vs. COVID-19 Rate
plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Obesity'],df['COVID_19_Rate']) # scatter plot
plt.title('COVID-19 Rate vs. Obesity Rate scatter plot')
plt.xlabel('Obesity Rate (%)')
res = plt.ylabel('COVID-19 Rate')

It appears that there is a potential positive correlation between obesity rate and COVID-19 rate. However, the increase of COVID-19 rate is not gradual. When obesity rate exceeds 15%, COVID-19 rate increases drastically, which indicates this relationship might not be linear.

Data Analysis¶

In this section, we will apply machine learning and statistical analysis to explore relationships between data. We will use linear regression model to obtain a predictive model of COVID-19 Rate based on diet composition.

After plotting the data, we would like to see if there is really a strong correlation between COVID-19 rate and vegetables/grains/protein proportions as we hypothesized. We will fit a regression line and get their R-Squared value, and use Ordinary Least Square(OLS) regression analysis on these three relationships seperately.

Please note: we will use significance level of 0.05 for all hypothesis tests below. We will conclude there is a strong linear relationship between indepedent variables and depedent variable if the model's R-Squared value is greater than or equal to 0.5

Vegetables¶

# Get linear regression model
vegetables = np.array(df['Vegetables']).reshape(-1, 1)
reg = LinearRegression().fit(vegetables, df['COVID_19_Rate'])
predicted = reg.predict(vegetables)
r_squared = reg.score(vegetables, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Vegetables'],df['COVID_19_Rate']) # scatter plot
plt.title("Vegetables category scatter plot")
plt.xlabel('Vegetables Proportion (%)')
res = plt.ylabel('COVID-19 Rate')

# Plot regression line
res = plt.plot(df['Vegetables'], predicted, '-',color='red', label = r_squared)
res = plt.legend()

# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Vegetables", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.203
Method:                 Least Squares   F-statistic:                     40.19
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           2.46e-09
Time:                        20:05:11   Log-Likelihood:                -250.84
No. Observations:                 155   AIC:                             505.7
Df Residuals:                     153   BIC:                             511.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.6347      0.550      8.424      0.000       3.548       5.722
Vegetables    -0.0650      0.010     -6.340      0.000      -0.085      -0.045
==============================================================================
Omnibus:                       27.069   Durbin-Watson:                   2.333
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.957
Skew:                           1.018   Prob(JB):                     1.56e-08
Kurtosis:                       4.191   Cond. No.                         299.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Grains¶

# Get linear regression model
grains = np.array(df['Grains']).reshape(-1, 1)
reg = LinearRegression().fit(grains, df['COVID_19_Rate'])
predicted = reg.predict(grains)
r_squared = reg.score(grains, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Grains'],df['COVID_19_Rate']) # scatter plot
plt.title("Grains category scatter plot")
plt.xlabel('Grains Proportion (%)')
res = plt.ylabel('COVID-19 Rate')

# Plot regression line
res = plt.plot(df['Grains'], predicted, '-',color='red', label = r_squared)
res = plt.legend()

# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Grains", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.204
Model:                            OLS   Adj. R-squared:                  0.198
Method:                 Least Squares   F-statistic:                     39.11
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           3.81e-09
Time:                        20:05:32   Log-Likelihood:                -251.27
No. Observations:                 155   AIC:                             506.5
Df Residuals:                     153   BIC:                             512.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.4657      0.225     10.967      0.000       2.022       2.910
Grains        -0.1051      0.017     -6.254      0.000      -0.138      -0.072
==============================================================================
Omnibus:                       18.448   Durbin-Watson:                   2.264
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.091
Skew:                           0.860   Prob(JB):                     2.63e-05
Kurtosis:                       3.557   Cond. No.                         30.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Protein¶

# Get linear regression model
protein = np.array(df['Protein']).reshape(-1, 1)
reg = LinearRegression().fit(protein, df['COVID_19_Rate'])
predicted = reg.predict(protein)
r_squared = reg.score(protein, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Protein'],df['COVID_19_Rate']) # scatter plot
plt.title("Protein category scatter plot")
plt.xlabel('Protein Proportion (%)')
res = plt.ylabel('COVID-19 Rate')

# Plot regression line
res = plt.plot(df['Protein'], predicted, '-',color='red', label = r_squared)
res = plt.legend()

# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Protein", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.292
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     63.03
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           4.07e-13
Time:                        20:05:49   Log-Likelihood:                -242.18
No. Observations:                 155   AIC:                             488.4
Df Residuals:                     153   BIC:                             494.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.4348      0.226     -1.920      0.057      -0.882       0.013
Protein        0.0555      0.007      7.939      0.000       0.042       0.069
==============================================================================
Omnibus:                       22.735   Durbin-Watson:                   2.464
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.199
Skew:                           0.832   Prob(JB):                     1.68e-07
Kurtosis:                       4.437   Cond. No.                         78.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the above plots and their OLS analysis, it confirmed our hypothesis ealier, where vegetable and grains proportion have negative linear relationship with COVID-19 rate, and protein proportion has a positive linear relationship with COVID-19 rate. This claim can be supported by the fact that the p-values of all indepedent varible's coefficients are less than 0.05, which rejects the null hypothesis where all indepedent variables' coefficients are 0.

R-Squared value of COVID-19 Rate over Vegetables Proportion model is 0.208, R-Squared value of COVID-19 Rate over Grains Proportion model is 0.204, and the R-Squared value of COVID-19 Rate over Protein Proportion model is 0.292. These R-Squared values suggest that there is no strong linear relationships between vegetables/grains/protein proportions and COVID-19 rate.

Now, let's see if there is a strong linear relationship between obesity rate and COVID-19 rate

Obesity¶

obesity = np.array(df['Obesity']).reshape(-1, 1)
reg = LinearRegression().fit(obesity, df['COVID_19_Rate'])
predicted = reg.predict(obesity)
r_squared = reg.score(obesity, df['COVID_19_Rate']) # reg.score return the R^2 value of the current regression model

plt.figure(figsize=(12,8)) # make the figure slightly bigger for better visulization
plt.scatter(df['Obesity'],df['COVID_19_Rate']) # scatter plot
plt.title("Obesity Rate scatter plot")
plt.xlabel('Obesity Rate (%)')
res = plt.ylabel('COVID-19 Rate')

# Plot regression line
res = plt.plot(df['Obesity'], predicted, '-',color='red', label = r_squared)
res = plt.legend()

# OLS Analysis
mod = smf.ols(formula="COVID_19_Rate ~ Obesity", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.275
Method:                 Least Squares   F-statistic:                     59.38
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           1.53e-12
Time:                        20:07:00   Log-Likelihood:                -243.50
No. Observations:                 155   AIC:                             491.0
Df Residuals:                     153   BIC:                             497.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2337      0.209     -1.119      0.265      -0.646       0.179
Obesity        0.0780      0.010      7.706      0.000       0.058       0.098
==============================================================================
Omnibus:                       29.903   Durbin-Watson:                   2.125
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.525
Skew:                           1.038   Prob(JB):                     3.54e-10
Kurtosis:                       4.558   Cond. No.                         45.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Based on the plot and OSL analysis above, it appears that there is a positive linear relaionship between obesity rate and COVID-19 rate since the p-value of the indepedent variable's coefficient is < 0.05. However, given that the R-Squared value of the above is approximately 0.280, which is small, we conclude that there is no strong linear relationship between obesity rate and COVID-19 rate.

Now, let's see if there exists a strong linear model between COVID-19 rate, vagatables proportion, grains proportion, and protein proportion.

mod = smf.ols(formula="COVID_19_Rate ~ Vegetables + Grains + Protein", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.335
Model:                            OLS   Adj. R-squared:                  0.321
Method:                 Least Squares   F-statistic:                     25.32
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           2.52e-13
Time:                        20:07:31   Log-Likelihood:                -237.33
No. Observations:                 155   AIC:                             482.7
Df Residuals:                     151   BIC:                             494.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.6182      2.363      1.531      0.128      -1.051       8.287
Vegetables    -0.0369      0.028     -1.314      0.191      -0.092       0.019
Grains        -0.0735      0.025     -2.992      0.003      -0.122      -0.025
Protein        0.0141      0.023      0.602      0.548      -0.032       0.060
==============================================================================
Omnibus:                       19.698   Durbin-Watson:                   2.436
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.493
Skew:                           0.790   Prob(JB):                     4.80e-06
Kurtosis:                       4.138   Cond. No.                     1.60e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

From the above OLS regression results, we see that although is R-Squared value is still less than 0.5, this model shows a stronger linear relationship between the predictors and depedent varible than models with just one food category. However, the p-value for Protein's coefficient and Vegetables' coefficient are both greater than our significance level of 0.05, and the p-value of Protein's coefficient is much larger than Vegetables'.

Let's what happens if we remove Protein from our predictors:

mod = smf.ols(formula="COVID_19_Rate ~ Vegetables + Grains", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.333
Model:                            OLS   Adj. R-squared:                  0.324
Method:                 Least Squares   F-statistic:                     37.96
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           4.26e-14
Time:                        20:08:45   Log-Likelihood:                -237.52
No. Observations:                 155   AIC:                             481.0
Df Residuals:                     152   BIC:                             490.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.0070      0.511      9.792      0.000       3.997       6.017
Vegetables    -0.0528      0.010     -5.432      0.000      -0.072      -0.034
Grains        -0.0847      0.016     -5.338      0.000      -0.116      -0.053
==============================================================================
Omnibus:                       19.704   Durbin-Watson:                   2.416
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.174
Skew:                           0.801   Prob(JB):                     5.63e-06
Kurtosis:                       4.084   Cond. No.                         310.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

It seems that by removing Protein from our model, the p-value of all predictor's coefficients are small enough to reject the null hypothesis where there is no linear relationships between the predictors and the depedent variable.

Let's what happens if we add obesity rate into our model:

mod = smf.ols(formula="COVID_19_Rate ~ Vegetables + Grains + Obesity", data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          COVID_19_Rate   R-squared:                       0.379
Model:                            OLS   Adj. R-squared:                  0.366
Method:                 Least Squares   F-statistic:                     30.67
Date:                Fri, 18 Dec 2020   Prob (F-statistic):           1.53e-15
Time:                        20:09:07   Log-Likelihood:                -232.03
No. Observations:                 155   AIC:                             472.1
Df Residuals:                     151   BIC:                             484.2
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.1785      0.740      4.298      0.000       1.717       4.640
Vegetables    -0.0389      0.010     -3.777      0.000      -0.059      -0.019
Grains        -0.0560      0.018     -3.173      0.002      -0.091      -0.021
Obesity        0.0407      0.012      3.328      0.001       0.017       0.065
==============================================================================
Omnibus:                       21.561   Durbin-Watson:                   2.315
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               29.624
Skew:                           0.791   Prob(JB):                     3.69e-07
Kurtosis:                       4.443   Cond. No.                         486.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can see that by adding obesity rate to our model, the R-squared value slightly increases, but still not big enough to claim that there is a strong linear relationship between the predictors and the depedent variable.

From the above analysis on the correlation between vegetables/grains/protein/obesity rate and COVID-19 rate, we can see there is no strong linear correlation between the predictors and COVID-19 rate. However, we did discover is that there are negative correlations between vegetables/grains/obesity rate and COVID-19 rate, meaning that as these factors increases, the COVID-19 rate decreases. Protein is the only food category that has positive correlation with COVID-19 rate: as protein portion increases, COVID-19 rate increases as well. Moreover, although our model between COVID-19 rate and vegetable, grains portions, and obesity is not strongly linear, we can see that these factors do play a role in affecting COVID-rates, given that their coefficients' p-values are much less than the significance level of 0.05.

Insights¶

When we first obtained the data, we want to show that there is some kind of relationship between people's diet and the likelihood of getting COVID-19. After plotting out the dataset, we hypothesized that there might be a linear relationship between the food categories that make up the majority of the diet accross the world, such as vegetables, grains, and protein, and COVID-19 rate. However, when we did a linear regression analysis on these data, we conclude that these food categories do not have a strong linear correlation with COVID-19 rate. Among the few food categories we analyzed, we found that there is a potential negative correlation between the proportions of vegetables and grains in a diet, and the likelihood of getting COVID-19; there is a a potential positive correlation between the protein proportion, and the likelihood of getting COVID-19.

Further analysis is need on this topic, because there are countless factors, such as each country's public health policy, nation's economic power, population, diet, food culture, etc., effect the likelihood of being infected with a respetory disease like COVID-19. Our analysis showing there is no linear relationship between diet and the likelihood of being infected by COVID-19 does not mean there is no relationship at all. We are aware that our data might be flawed given that the categorization is not extremely precise. A more rigorous food categorization method is needed to explore the relationship between diet and COVID-19. Also, more advanced regression models and machine learning methods can be applied to the data such as polynomial regression, GMM, etc.

If you need any information regarding COVID-19, please visit CDC's website

	Country	Alcoholic Beverages	Animal fats	Animal Products	Aquatic Products, Other	Cereals - Excluding Beer	Eggs	Fish, Seafood	Fruits - Excluding Wine	Meat	...	Vegetables	Vegetal Products	Obesity	Undernourished	Confirmed	Deaths	Recovered	Active	Population	Unit (all except Population)
0	Afghanistan	0.0014	0.1973	9.4341	0.0000	24.8097	0.2099	0.0350	5.3495	1.2020	...	6.7642	40.5645	4.5	29.8	0.120684	0.004680	0.095114	0.019903	38928000.0	%
1	Albania	1.6719	0.1357	18.7684	0.0000	5.7817	0.5815	0.2126	6.7861	1.8845	...	11.7753	31.2304	22.3	6.2	1.399542	0.029563	0.701621	0.668358	2838000.0	%
2	Algeria	0.2711	0.0282	9.6334	0.0000	13.6816	0.5277	0.2416	6.3801	1.1305	...	11.6484	40.3651	26.6	3.9	0.191816	0.005555	0.123971	0.062290	44357000.0	%
3	Angola	5.8087	0.0560	4.9278	0.0000	9.1085	0.0587	1.7707	6.0005	2.0571	...	2.3041	45.0722	6.8	25	0.047103	0.001079	0.025026	0.020998	32522000.0	%
4	Antigua and Barbuda	3.5764	0.0087	16.6613	0.0000	5.9960	0.2274	4.1489	10.7451	5.6888	...	5.4495	33.3233	19.1	NaN	0.146939	0.004082	0.135714	0.007143	98000.0	%
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
165	Venezuela (Bolivarian Republic of)	2.5952	0.0403	14.7565	0.0000	12.9253	0.3389	0.9456	7.6460	3.8328	...	4.1474	35.2416	25.2	21.2	0.359808	0.003159	0.342224	0.014425	28645000.0	%
166	Vietnam	1.4591	0.1640	8.5765	0.0042	16.8740	0.3077	2.6392	5.9029	4.4382	...	11.9508	41.4232	2.1	9.3	0.001412	0.000036	0.001248	0.000127	96209000.0	%
167	Yemen	0.0364	0.0446	5.7874	0.0000	27.2077	0.2579	0.5240	5.1344	2.7871	...	3.2135	44.2126	14.1	38.9	0.007433	0.002082	0.005090	0.000262	29826000.0	%
168	Zambia	5.7360	0.0829	6.0197	0.0000	21.1938	0.3399	1.6924	1.0183	1.8427	...	3.4649	43.9789	6.5	46.7	0.096279	0.001942	0.092940	0.001398	18384000.0	%
169	Zimbabwe	4.0552	0.0755	8.1489	0.0000	22.6240	0.2678	0.5518	2.2000	2.6142	...	2.3213	41.8526	12.3	51.3	0.068149	0.001864	0.058151	0.008134	14863000.0	%

	Country	Fruits	Vegetables	Grains	Protein	Diary	Oils	Stimulants	Spices	Alcohol	Obesity	Undernourished	COVID_19_Rate
0	Afghanistan	5.3495	49.8531	24.8097	20.5208	7.5828	0.8788	0.3078	0.0574	0.0014	4.5	29.8	0.120684
1	Albania	6.7861	46.5900	5.7817	40.4478	15.7213	1.5510	0.1055	0.0008	1.6719	22.3	6.2	1.399542
2	Algeria	6.3801	58.4600	13.6816	21.2536	7.6189	1.5237	0.2216	0.0557	0.2711	26.6	3.9	0.191816
3	Angola	6.0005	67.9867	9.1085	13.8971	0.8311	1.1270	0.0508	0.0009	5.8087	6.8	25	0.047103
5	Argentina	6.0435	41.1546	8.4102	47.5780	10.2328	1.2091	0.4378	0.0122	4.2672	28.5	4.6	3.173641
...	...	...	...	...	...	...	...	...	...	...	...	...	...
165	Venezuela (Bolivarian Republic of)	7.6460	45.6981	12.9253	34.8381	9.3920	1.4427	0.1479	0.0009	2.5952	25.2	21.2	0.359808
166	Vietnam	5.9029	56.7364	16.8740	24.9530	0.6069	1.5321	0.2880	0.2074	1.4591	2.1	9.3	0.001412
167	Yemen	5.1344	54.4168	27.2077	15.4265	1.8911	1.2167	0.2199	0.0347	0.0364	14.1	38.9	0.007433
168	Zambia	1.0183	57.1475	21.1938	16.2192	1.7570	2.6236	0.0618	0.0478	5.7360	6.5	46.7	0.096279
169	Zimbabwe	2.2000	52.2355	22.6240	19.9402	4.4310	2.2874	0.1494	0.0252	4.0552	12.3	51.3	0.068149

	Categories	Items
0	Alcoholic Beverages	Alcohol, Non-Food; Beer; Beverages, Alcoholic; Beverages, Fermented; Wine
1	Animal fats	Butter, Ghee; Cream; Fats, Animals, Raw; Fish, Body Oil; Fish, Liver Oil
2	Animal Products	Aquatic Animals, Others; Aquatic Plants; Bovine Meat; Butter, Ghee; Cephalopods; Cream; Crustaceans; Demersal Fish; Eggs; Fats, Animals, Raw; Fish, Body Oil; Fish, Liver Oil; Freshwater Fish; Marine Fish, Other; Meat, Aquatic Mammals; Meat, Other; Milk - Excluding Butter; Molluscs, Other; Mutton & Goat Meat; Offals, Edible; Pelagic Fish; Pigmeat; Poultry Meat
3	Aquatic Products, Other	Aquatic Animals, Others; Aquatic Plants; Meat, Aquatic Mammals
4	Cereals - Excluding Beer	Barley and products; Cereals, Other; Maize and products; Millet and products; Oats; Rice (Milled Equivalent); Rye and products; Sorghum and products; Wheat and products
5	Eggs	Eggs
6	Fish, Seafood	Cephalopods; Crustaceans; Demersal Fish; Freshwater Fish; Marine Fish, Other; Molluscs, Other; Pelagic Fish
7	Fruits - Excluding Wine	Apples and products; Bananas; Citrus, Other; Dates; Fruits, Other; Grapefruit and products; Grapes and products (excl wine); Lemons, Limes and products; Oranges, Mandarines; Pineapples and products; Plantains
8	Meat	Bovine Meat; Meat, Other; Mutton & Goat Meat; Pigmeat; Poultry Meat
9	Milk - Excluding Butter	Milk - Excluding Butter
10	Miscellaneous	Infant food; Miscellaneous
11	Offals	Offals, Edible
12	Oilcrops	Coconuts - Incl Copra; Cottonseed; Groundnuts (Shelled Eq); Oilcrops, Other; Olives (including preserved); Palm kernels; Rape and Mustardseed; Sesame seed; Soyabeans; Sunflower seed
13	Pulses	Beans; Peas; Pulses, Other and products
14	Spices	Cloves; Pepper; Pimento; Spices, Other
15	Starchy Roots	Cassava and products; Potatoes and products; Roots, Other; Sweet potatoes; Yams
16	Stimulants	Cocoa Beans and products; Coffee and products; Tea (including mate)
17	Sugar & Sweeteners	Honey; Sugar (Raw Equivalent); Sugar non-centrifugal; Sweeteners, Other
18	Sugar Crops	Sugar beet; Sugar cane
19	Treenuts	Nuts and products
20	Vegetable Oils	Coconut Oil; Cottonseed Oil; Groundnut Oil; Maize Germ Oil; Oilcrops Oil, Other; Olive Oil; Palm Oil; Palmkernel Oil; Rape and Mustard Oil; Ricebran Oil; Sesameseed Oil; Soyabean Oil; Sunflowerseed Oil
21	Vegetables	Onions; Tomatoes and products; Vegetables, Other
22	Vegetal Products	Alcohol, Non-Food; Apples and products; Bananas; Barley and products; Beans; Beer; Beverages, Alcoholic; Beverages, Fermented; Cassava and products; Cereals, Other; Citrus, Other; Cloves; Cocoa Beans and products; Coconut Oil; Coconuts - Incl Copra; Coffee and products; Cottonseed; Cottonseed Oil; Dates; Fruits, Other; Grapefruit and products; Grapes and products (excl wine); Groundnut Oil; Gr...