|
1 | 1 | { |
2 | 2 | "cells": [ |
3 | | - { |
4 | | - "cell_type": "markdown", |
5 | | - "metadata": { |
6 | | - "colab_type": "text", |
7 | | - "id": "view-in-github" |
8 | | - }, |
9 | | - "source": [ |
10 | | - "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" |
11 | | - ] |
12 | | - }, |
13 | | - { |
14 | | - "cell_type": "markdown", |
15 | | - "metadata": { |
16 | | - "id": "srAnaUPPbrH6" |
17 | | - }, |
18 | | - "source": [ |
19 | | - "Copyright 2025 Google LLC.\n", |
20 | | - "SPDX-License-Identifier: Apache-2.0\n", |
21 | | - "\n", |
22 | | - "**Notebook Version** - 2.0.0" |
23 | | - ] |
24 | | - }, |
25 | 3 | { |
26 | 4 | "cell_type": "markdown", |
27 | 5 | "metadata": { |
|
32 | 10 | "\n", |
33 | 11 | "**Objective:** This notebook demonstrates how to use Data Commons to build a linear regression model predicting the prevalence of obesity in US counties.\n", |
34 | 12 | "\n", |
35 | | - "**Background:** Obesity prevalence is known to correlate with various health and socio-economic factors [[1]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198075/)[[2]](https://www.ncbi.nlm.nih.gov/pubmed/26562758). Data for these factors often reside in separate datasets from different government agencies.\n", |
| 13 | + "**Background:** Obesity prevalence is known to correlate with various health and socio-economic factors [[1]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198075/)[[2]](https://www.ncbi.nlm.nih.gov/pubmed/26562758). Data for these factors often reside in separate datasets from different government agencies:\n", |
36 | 14 | "* The Centers for Disease Control (CDC) provides health condition prevalence data (e.g., obesity, high blood pressure).\n", |
37 | 15 | "* The US Bureau of Labor Statistics (BLS) provides unemployment rates.\n", |
38 | 16 | "* The US Census Bureau provides poverty rates and population counts.\n", |
|
51 | 29 | "*Note:* The US Census also provides unemployment statistics. Using BLS data here is for demonstration purposes. Comparing results using Census unemployment data could be a potential extension." |
52 | 30 | ] |
53 | 31 | }, |
| 32 | + { |
| 33 | + "cell_type": "markdown", |
| 34 | + "metadata": { |
| 35 | + "colab_type": "text", |
| 36 | + "id": "view-in-github" |
| 37 | + }, |
| 38 | + "source": [ |
| 39 | + "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" |
| 40 | + ] |
| 41 | + }, |
| 42 | + { |
| 43 | + "cell_type": "markdown", |
| 44 | + "metadata": { |
| 45 | + "id": "srAnaUPPbrH6" |
| 46 | + }, |
| 47 | + "source": [ |
| 48 | + "Copyright 2025 Google LLC.\n", |
| 49 | + "SPDX-License-Identifier: Apache-2.0\n", |
| 50 | + "\n", |
| 51 | + "**Notebook Version** - 2.0.0" |
| 52 | + ] |
| 53 | + }, |
54 | 54 | { |
55 | 55 | "cell_type": "markdown", |
56 | 56 | "metadata": { |
57 | 57 | "id": "7SnIECsk7Csw" |
58 | 58 | }, |
59 | 59 | "source": [ |
60 | | - "# 1. Setup Environment\n" |
| 60 | + "## 1. Set up environment\n" |
61 | 61 | ] |
62 | 62 | }, |
63 | 63 | { |
|
66 | 66 | "id": "pysygfoq43NF" |
67 | 67 | }, |
68 | 68 | "source": [ |
69 | | - "### 1.1 Install Libraries\n", |
| 69 | + "### 1.1. Install libraries\n", |
70 | 70 | "\n", |
71 | 71 | "Install the [datacommons-client](https://pypi.org/project/datacommons-client/) library." |
72 | 72 | ] |
|
92 | 92 | } |
93 | 93 | ], |
94 | 94 | "source": [ |
95 | | - "!pip install datacommons-client --upgrade --quiet" |
| 95 | + "!pip install \"datacommons-client[Pandas]\" --upgrade --quiet" |
96 | 96 | ] |
97 | 97 | }, |
98 | 98 | { |
|
101 | 101 | "id": "BtLVyFoN5AiI" |
102 | 102 | }, |
103 | 103 | "source": [ |
104 | | - "### 1.2 Import Dependencies\n", |
| 104 | + "### 1.2. Import dependencies\n", |
105 | 105 | "\n", |
106 | 106 | "Import required libraries for data manipulation, modeling, and plotting.\n" |
107 | 107 | ] |
|
134 | 134 | "id": "ZXzO6qSc5Xk0" |
135 | 135 | }, |
136 | 136 | "source": [ |
137 | | - "### 1.3 Initialize Data Commons Client\n", |
| 137 | + "### 1.3. Initialize Data Commons client\n", |
138 | 138 | "\n", |
139 | 139 | "Initialize the client using your Data Commons API key. Obtain a key from [apikeys.datacommons.org](https://apikeys.datacommons.org/) if you don't have one.\n" |
140 | 140 | ] |
|
158 | 158 | "id": "Ccy9-czCfVTn" |
159 | 159 | }, |
160 | 160 | "source": [ |
161 | | - "## 2. Data Acquisition\n", |
| 161 | + "## 2. Data acquisition\n", |
162 | 162 | "\n", |
163 | 163 | "Fetch statistical observations for the specified variables for all US counties for the year 2021 using the [Python Data Commons API](https://docs.datacommons.org/api/python/v2/)." |
164 | 164 | ] |
|
567 | 567 | "id": "z191ImVmrdds" |
568 | 568 | }, |
569 | 569 | "source": [ |
570 | | - "## 3. Data Preparation\n", |
| 570 | + "## 3. Data preparation\n", |
571 | 571 | "\n", |
572 | 572 | "Process the fetched data for modeling:\n", |
573 | 573 | "\n", |
574 | 574 | "1. **Filter:** Keep only relevant observations based on their `measurementMethod`. For CDC data, this is typically `AgeAdjustedPrevalence`. For Census, `CensusACS5YearSurvey`, and for BLS, `BLSSeasonallyUnadjusted`.\n", |
575 | | - "1. **Select Columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n", |
| 575 | + "1. **Select columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n", |
576 | 576 | "1. **Pivot:** Reshape the dataframe so each variable becomes a column, indexed by county `entity` and `entity_name`.\n", |
577 | | - "1. **Calculate Poverty Rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n", |
578 | | - "1. **Handle Missing Values:** Drop rows (counties) with any missing values for the selected variables.\n" |
| 577 | + "1. **Calculate poverty rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n", |
| 578 | + "1. **Handle missing values:** Drop rows (counties) with any missing values for the selected variables.\n" |
579 | 579 | ] |
580 | 580 | }, |
581 | 581 | { |
|
975 | 975 | "id": "-ZGRFaJKdHIO" |
976 | 976 | }, |
977 | 977 | "source": [ |
978 | | - "## 4. Exploratory Data Analysis\n", |
| 978 | + "## 4. Exploratory data analysis\n", |
979 | 979 | "\n", |
980 | 980 | "Visualize the relationships between the target variable (Obesity Prevalence) and the predictor variables (High Blood Pressure Prevalence, Unemployment Rate, Poverty Rate) using scatter plots. This helps assess potential correlations.\n" |
981 | 981 | ] |
|
1102 | 1102 | "id": "Bp52dWJNfYSa" |
1103 | 1103 | }, |
1104 | 1104 | "source": [ |
1105 | | - "## 5. Model Training\n", |
| 1105 | + "## 5. Model training\n", |
1106 | 1106 | "\n", |
1107 | 1107 | "Train a linear regression model to predict obesity prevalence based on the selected predictors.\n", |
1108 | 1108 | "\n", |
|
1111 | 1111 | "$$f_\\theta(x) = \\theta_0 + \\theta_1 (\\text{high blood pressure}) + \\theta_2 (\\text{unemployment}) + \\theta_3(\\text{poverty rate})$$\n", |
1112 | 1112 | "<br>\n", |
1113 | 1113 | "\n", |
1114 | | - "### 5.1 Prepare Features and Target Variable\n", |
| 1114 | + "### 5.1. Prepare features and target variable\n", |
1115 | 1115 | "Define the feature matrix `X` (predictors) and the target vector `Y` (obesity prevalence).\n", |
1116 | 1116 | "\n", |
1117 | 1117 | "Let's start by creating our training and test sets. We'll then train a linear regression model using Scikit learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)" |
|
1137 | 1137 | "id": "rmidaLTx_6C9" |
1138 | 1138 | }, |
1139 | 1139 | "source": [ |
1140 | | - "### 5.2 Split Data\n", |
| 1140 | + "### 5.2. Split data\n", |
1141 | 1141 | "\n", |
1142 | | - "Split the data into training and testing sets (80% train, 20% test).\n", |
1143 | | - "\n" |
| 1142 | + "Split the data into training and testing sets (80% train, 20% test)." |
1144 | 1143 | ] |
1145 | 1144 | }, |
1146 | 1145 | { |
|
1176 | 1175 | "id": "hu2t8OAGAGFp" |
1177 | 1176 | }, |
1178 | 1177 | "source": [ |
1179 | | - "### 5.3 Train Linear Regression Model\n", |
| 1178 | + "### 5.3. Train linear regression model\n", |
1180 | 1179 | "\n", |
1181 | 1180 | "Instantiate and train the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using the training data.\n", |
1182 | 1181 | "\n" |
|
1217 | 1216 | "id": "dBmThySxaXKp" |
1218 | 1217 | }, |
1219 | 1218 | "source": [ |
1220 | | - "## 6. Model Evaluation\n", |
| 1219 | + "## 6. Model evaluation\n", |
1221 | 1220 | "\n", |
1222 | 1221 | "Assess the performance of the trained model using the Mean Squared Error (MSE) metric and residual analysis.\n", |
1223 | 1222 | "\n", |
1224 | | - "\n", |
1225 | | - "### 6.1 Calculate Mean Squared Error (MSE)\n", |
| 1223 | + "### 6.1. Calculate Mean Squared Error (MSE)\n", |
1226 | 1224 | "\n", |
1227 | 1225 | "Define a function for MSE and calculate it for both the training and test sets. Lower MSE indicates better fit.\n", |
1228 | 1226 | "\n" |
|
1271 | 1269 | "id": "VsGLliuzawPE" |
1272 | 1270 | }, |
1273 | 1271 | "source": [ |
1274 | | - "### 6.2 Analyze Residuals\n", |
| 1272 | + "### 6.2. Analyze residuals\n", |
1275 | 1273 | "\n", |
1276 | 1274 | "Calculate and plot the residuals (difference between predicted and actual values) for the test set. Residuals ideally should be randomly scattered around zero." |
1277 | 1275 | ] |
|
1317 | 1315 | "id": "8VE-arrmbLNL" |
1318 | 1316 | }, |
1319 | 1317 | "source": [ |
1320 | | - "*Evaluation Summary:* The model achieves a test MSE of approximately 10%. The residual plots provide insights into the model's error distribution.\n", |
1321 | | - "\n", |
1322 | | - "\n", |
| 1318 | + "*Evaluation summary:* The model achieves a test MSE of approximately 10%. The residual plots provide insights into the model's error distribution.\n", |
1323 | 1319 | "\n", |
1324 | 1320 | "How well does your model perform? We were able to achieve an MSE for the test set of approximately 10% points from the observed obesity prevalence. Our model was also able to fit the data with the residuals clustered between -20% and 30%, which for a simple model considering only three explanatory variables isn't so bad." |
1325 | 1321 | ] |
|
1330 | 1326 | "id": "qapl33x8fy_A" |
1331 | 1327 | }, |
1332 | 1328 | "source": [ |
1333 | | - "## 7. Conclusion and Next Steps\n", |
| 1329 | + "## 7. Conclusion and next steps\n", |
1334 | 1330 | "This notebook demonstrated the use of Data Commons to efficiently acquire data from multiple sources (CDC, BLS, Census) and build a simple linear regression model to predict obesity prevalence in US counties. Data Commons significantly streamlines the data gathering and integration process.\n", |
1335 | 1331 | "\n", |
1336 | 1332 | "The resulting model, using high blood pressure prevalence, unemployment rate, and poverty rate, provides a baseline prediction.\n", |
1337 | 1333 | "\n", |
1338 | | - "**Potential Improvements & Further Exploration:**\n", |
| 1334 | + "**Potential improvements & further exploration:**\n", |
1339 | 1335 | "\n", |
1340 | 1336 | "* Add More Variables: Incorporate other variables known or hypothesized to correlate with obesity, such as:\n", |
1341 | 1337 | " * `Percent_Person_WithHighCholesterol`\n", |
1342 | 1338 | " * `Percent_Person_WithDiabetes`\n", |
1343 | 1339 | " * Educational attainment levels\n", |
1344 | 1340 | " * Access to healthy food outlets\n", |
1345 | 1341 | " * Physical inactivity rates\n", |
1346 | | - "* **Feature Engineering:** Create new features from existing ones.\n", |
1347 | | - "* **Model Selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n", |
1348 | | - "* **Geographic Analysis:** Explore spatial patterns in obesity prevalence and model errors.\n", |
1349 | | - "* **Alternative Data Sources:** Compare model performance using Census unemployment data instead of BLS data.\n", |
| 1342 | + "* **Feature engineering:** Create new features from existing ones.\n", |
| 1343 | + "* **Model selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n", |
| 1344 | + "* **Geographic analysis:** Explore spatial patterns in obesity prevalence and model errors.\n", |
| 1345 | + "* **Alternative data sources:** Compare model performance using Census unemployment data instead of BLS data.\n", |
1350 | 1346 | "Data Commons provides access to a wide range of variables, enabling exploration of correlations with factors like university counts, crime rates (e.g., arson), or environmental factors (e.g., snowfall), potentially leading to more comprehensive models." |
1351 | 1347 | ] |
1352 | 1348 | } |
|
0 commit comments