Macro-Regressor

Introduction

The Macro-Regressor is a Data Sience project conducting exploratory and predictive analysis on two datasets from Food.com. The Datasets consist of reciples, each uniquely indentified by their Reciple ID. Hence, the Recipes were merged on the recipe id to form the Recipe Dataframe we will be exploring.

Recipe Dataframe

The Recipe Dataframe originally consists of 234429 rows and 15 columns.

Column	Description
`name`	The name of the recipe.
`id`	A unique identifier for the recipe.
`minutes`	The total time required to prepare the recipe.
`contributor_id`	The ID of the user who submitted the recipe.
`submitted`	The date when the recipe was submitted.
`tags`	List of associated tags describing the recipe.
`nutrition`	Nutritional information for the recipe. This is list which contains [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)] where PDV is the % daily value.
`n_steps`	The number of steps in the recipe.
`steps`	Instructions detailing the steps to prepare the recipe.
`description`	A brief overview or description of the recipe.
`ingredients`	List of ingredients required for the recipe.
`n_ingredients`	The number of ingredients in the recipe.
`user_id`	The ID of the user interacting with the recipe.
`date`	The date related to a specific event or interaction.
`rating`	The user rating for the recipe.

This dataframe provides several insights on a topic which unites us all: food! Lot’s can be drawn from the Dataframe, however, there’s a pressing matter at hand; according to CBS news, obesity rates in the U.S. exceed 40% as of 2024. Hence, understanding our food’s nutritional value is key to building a healthier nation. The focus of this project will be on nutrition.

Data Cleaning and Exploratory Data Analysis

The following steps were taken to clean the Dataframe:

1) Transforming each of the nutritional columns into appropriate Macros. I.e, fat, sugar, sodium, protein, saturated fat, carbohydrates in grams rather than PDV. This is because the reccomended PDV differs from person to person. To do this, the Food.com uses the following PDVs as reccomended by the FDA:

Macro	PDV
fat	78
sugar	50
sodium	2300
protein	50
saturated fat	20
carbohydrates	275

2) The calories column was often times severely inaccurate. The following formula was used to recalculate the calories columns
$\hspace{15em}\text{calories} = \text{fat} \cdot 9 +\text{protein} \cdot 4 + \text{carbohydrates} \cdot 4$
where the units for all of the macros are grams.

3) The ingredeints and tags columns were transformed into lists rather than strings of the list.

4) Finally, no imputation was needed. None of the aforementioned columns had NaN or missing values.

Here’s a sample of the resulting Dataframe

name	calories	sugar	sodium	protein	carbohydrates	fat	saturated fat	n_ingredients	n_steps	ingredients	tags
dirty sriracha bloody mary	24	9	299	0.5	5.5	0	0	8	5	[‘celery salt’, ‘fresh lemon juice’, ‘horseradish’, ‘olive juice’, ‘sriracha sauce’, ‘tomato juice’, ‘vodka’, ‘worcestershire sauce’]	[‘15-minutes-or-less’, ‘3-steps-or-less’, ‘beverages’, ‘cocktails’, ‘course’, ‘easy’, ‘for-1-or-2’, ‘main-ingredient’, ‘number-of-servings’, ‘preparation’, ‘time-to-make’, ‘tomatoes’, ‘vegetables’]
rolachi	451.74	31	828	26	22	28.86	7.6	8	10	[‘green peppers’, ‘ground beef’, ‘onion’, ‘pepper’, ‘salad oil’, ‘salt’, ‘stewed tomatoes’, ‘zucchini’]	[‘3-steps-or-less’, ‘30-minutes-or-less’, ‘beef’, ‘easy’, ‘ground-beef’, ‘main-ingredient’, ‘meat’, ‘preparation’, ‘time-to-make’]
peanut butter truffle cupcakes	5060.14	538.5	2898	89	374	356.46	157.8	14	20	[‘baking cocoa’, ‘baking soda’, ‘brewed coffee’, ‘butter’, ‘buttermilk’, ‘creamy peanut butter’, ‘eggs’, ‘flour’, ‘heavy whipping cream’, ‘salt’, ‘semisweet chocolate’, ‘sugar’, ‘vanilla’, ‘white baking chocolate’]	[‘60-minutes-or-less’, ‘baking’, ‘cake-fillings-and-frostings’, ‘cakes’, ‘course’, ‘cupcakes’, ‘desserts’, ‘equipment’, ‘oven’, ‘preparation’, ‘time-to-make’]

Note While there were more columns (reviews, descriptions, etc.), these were not used for the rest of the analysis and were disgarded.

Univariate Analysis

Now let’s look at the distributions of some of the plots.

Note There were 2000 bins used for the following macro distributions in order to be able to view the distributions.
Here’s the distribution of the calories. It’s got a unique distribution but is still monotone decreasing for the most part. Dwelling deeper into it’s compoenents (see formula above) leads to a more firm understanding of this distribution.

Looking at the distribution of protein, we see a distribution which almost resembels a exponential distribution, just with a discerete variable. But then why doesn’t the calorie distribution also resemble this shape?

The distribution of the fat macro in grams play a big part. As seen above, it’s more common for recipes to have more fat. Additionally, this isn’t monotone decreasing and certain amounts of fat seem to be more common than others.

The distribution of the other numerical columns resembeled a normal distribution. Interestingly enough, while the variance in the number of steps is slightly greater, they both share the same mean. Here’s the distribution of the number of ingredients.

Here’s the distribution of the number of steps.

Bivariate Analysis

Now it’s time to see how some of these features may be correlated.
Perhaps the most obvious correlations which come to mind are those related to specific macros and calories. As you observe the following correlations, you’re encouraged to view the $R^2$ and $R$ score to get an idea of which macros are most closely tied to the overall calories.

The trends for sugar, sodium, and saturated fats, were simlar. All of them had a strong, positive, linear relationship with the calories macro with few outliers.

Interestingly, as the number of ingredients increase, the amount of time taken to make the recipes in minutes tends to decrease, though the correlation is not very strong. This may not be as nutritionally relavant, but is interesting nonetheless. As expected, however, we see the opposite trend in minutes vs the number of steps as the recipes requiring more steps tend to take longer as well.

Interesting Aggregates

To undestand the relavance of the aggregates, you must be familiarized about details of the ingredients and tagfeatures of the recipe dataframe.

Here’s a few of the most popular tags and the proportion of recipes which had them.

Similarly, here’s some of the popular tags!

Feel free to refer to these as you go through the rest of the analysis and creating the model. These provides key insights on nutrition.

Recipe Composition and Ingredients

Here’s the recipe composision of recipes with and without the butter ingredient. Observe how the macros are impacted.

Butter Ingredient	calories	carbohydrates	protein	fat	sugar
False	41.7	22	10	13.26	10
True	49.1	27.5	7.5	19.5	13.5

More, interestingly, let’s visualize this:

Here’s the impact of adding eggs to the recipes.

Eggs Ingredient	calories	carbohydrates	protein	fat	sugar
False	42.9	22	9	14.82	10.5
True	49.1	27.5	8.5	19.5	20.5

Recipe Composition and Tags

A similar trend can be ovserved with tags. Here’s the recipes with the meat tag.

Meat Tag	calories	carbohydrates	protein	fat	sugar
False	41.4	24.75	6	12.48	12
True	53	19.25	29	25.74	8.5

Adding an opposite effect, here’s the recipes with and without the vegetable tag.

Vegtable Tag	calories	carbohydrates	protein	fat	sugar
False	44.7	24.75	9.5	15.6	12.5
True	41.4	19.25	8	14.04	9

Framing a Prediction Problem

These recipes seem to provide in depth nutritional explanations. However, what if this weren’t the case.
One may one want to know the nutritional information of their recipes, but not have adequete information (calories, protein, fat, arbohydrates, sodium, sugar, and saturated fat). In a realistic setting, we’d only know a few things: how long the recipe takes, the # of ingredeints associated with the recipe, what the ingredeints are, and finally any tags you would associate with the recipe (the vibe of the recipe, if you will).

Hence, we will greate a Regressor for each of the aforementioned macros.
Here’s a few specifications on the regressor in addition to why we choose them.

We will use a Linear Regression model for the regressor. Due to it’s parametric nature, we will later be able to interprate the importance of certain features in the model. This can also provide key insights on certain ingredeints, tags, and trends based on the # of steps etc.
We will Regularize the model. In this way, we will be using Lasso Regression for the model. The reason for Lasso is, again, intepretability. Lasso, standing for Least Absolute Shrinkage and Selection means that our trained model will only contain the parameters which actually impact our predictions.
Lastly, since we are predicting each macro, we will be creating a class whoose API consists of being able to select which Macro you’d like to see predicted.

Baseline Model

In the baseline model, we will test this theory. It will be a more simple model. One aimed a solely predicting protein based on the presense of the meat tag, number of steps, number of ingredeints, and eggs (nominal and the only categorical). Later, we will add additional features and transform the numerical features to better match the models. The model achieved the following accuracies:

Train MSE	441.738
Test MSE	604.551

While this model did attain a decently low MSE, there’s a few problems with the model. Firstly, it’s only trained for protein. Encoding the ingredients must be done one at a time. In this case, the eggs ingredient was binary encoded, I.e., each recipe either contained eggs somewhere along it’s recipe list or didn’t. This ingredient alone wouldn’t be effective in classifying the other macros. Additionally, ingredients and tags whose proportions are too small will not be present enough for the model to learn from the recipes which contain them. Hence, it’s important to choose the right ingredients. This is also why using Lasso is helpful. Due to it’s shrinkage property, the nessessary ingredeints and tags will be filtered for each of the macros. Secondly, the n_ingredients and n_steps features don’t have a high $R^2$ value with protein. This means it’s highly likely that their relation to the protein macro isn’t linear. Hence, we can use sklearn’s PolynomialFeatures to determine the right relationship between these numerical features and macros.

Final Model

In the final model, we will make use of sklearn’s GridSearchCV class. This way, we will cross validate while training to reduce overfitting, allowing our training MSE to more closely match our testing MSE. Additionally, we’ll also experiment with different hyperparams such as the punishment coefficient for Lasso and the polynomial features for each of the numerical features. Aside from that, we’ll be adding the top $20$ most popular tags and ingredients. After the top $20@$, their proportions get too small to be able to adequetly train the Lasso model and would simply introduce noice. Addtionally, while this approach may not gurantee that soley the best features are being used to train each model, different features may be useful for different macros. One of the goals of this project is interpretability. After creating the model, we are able to interpret the parameters to determine which features were of service to which macros.

Results

The following are the MSE of the test sets.

Macro	Test MSE
calories	418557
fat	2144.63
sugar	13047.8
sodium	1.4756700.1
protein	566.39
saturated fat	204.198
carbohydrates	5490.71

As can be see, there’s a significant decrease in the MSE for protein from the baseline model. The training MSEs are also lesser with the protein training MSE being $405.1$$.

The following are the optimal hyperparams for each of the models.

	calories	fat	sugar	sodium	protein	saturated fat	carbohydrates
N_Steps Poly Features	2	2	2	3	1	2	2
N_Ingredients Poly Features	2	2	3	2	1	2	2
Lassor regressor	0.0625	0.0625	0.0625	0.5	0.0625	0.0625	0.0625

As can be seen, the polynomial features most most of the models are greather than 1, suggesting a non-linear relationship between said macros and the numerical features. The optimal punishment tends to be $0.0625$ for most of the values. Note The range of the polynomial features exponent ranges from $1$ to $4$ and $2^{-5}$ to $2^{4}$ for something the Lasso punishment.

Interpreting the Parameters

Now, we are able to interpret the impact of adding each of the ingredients to the model. The following table is the meat tag coefficient. As expected, it’s positive for protein and negative for carbohydrates. Take a look at the Interesting Aggregates section to see why.

Macro	meat_tag coefficient
calories	115.549
fat	9.74835
sugar	-1.04353
sodium	251.082
protein	11.1625
saturated fat	2.35439
carbohydrates	-4.08609

Now, if we were to specialize towards predicting the values of a certain macro, we can do so by selecting only the columns which provice any significant value. In the following, the all_coefficients dataframe represents the coefficient for each of the macros.
all_coefficients[(all_coefficients['protein'] > 1) | (all_coefficients['protein'] < -1)]['protein']
This singles out any important ingredient or tag! We get the following series:

Feature	Protein (Grams)
15-minutes-or-less_tag	-3.74627
30-minutes-or-less_tag	-2.27869
60-minutes-or-less_tag	-2.51325
baking powder_ing	-1.00415
baking soda_ing	-1.69974
course_tag	-5.9061
low-carb_tag	4.09764
main-dish_tag	10.6352
meat_tag	11.1625
n_ingredients	1.61722
n_steps	1.09601
sugar_ing	-2.62982
time-to-make_tag	-2.06129
vegetables_tag	-5.13664

Hence, future Macro regressors can focus solely on one macro at a time and aim to increase the training time using only the features whch are of use. Additional research can be done, interpreting the parameters.

Thank you!