Minor edits

kmoscoe · kmoscoe · commit 568e0eb16c77 · 2025-04-25T10:25:24.000-07:00
diff --git a/notebooks/v2/analyzing_obesity_prevalence.ipynb b/notebooks/v2/analyzing_obesity_prevalence.ipynb
@@ -1,27 +1,5 @@
 {
   "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "view-in-github"
-      },
-      "source": [
-        "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "srAnaUPPbrH6"
-      },
-      "source": [
-        "Copyright 2025 Google LLC.\n",
-        "SPDX-License-Identifier: Apache-2.0\n",
-        "\n",
-        "**Notebook Version** - 2.0.0"
-      ]
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -32,7 +10,7 @@
         "\n",
         "**Objective:** This notebook demonstrates how to use Data Commons to build a linear regression model predicting the prevalence of obesity in US counties.\n",
         "\n",
-        "**Background:** Obesity prevalence is known to correlate with various health and socio-economic factors [[1]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198075/)[[2]](https://www.ncbi.nlm.nih.gov/pubmed/26562758). Data for these factors often reside in separate datasets from different government agencies.\n",
+        "**Background:** Obesity prevalence is known to correlate with various health and socio-economic factors [[1]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198075/)[[2]](https://www.ncbi.nlm.nih.gov/pubmed/26562758). Data for these factors often reside in separate datasets from different government agencies:\n",
         "* The Centers for Disease Control (CDC) provides health condition prevalence data (e.g., obesity, high blood pressure).\n",
         "* The US Bureau of Labor Statistics (BLS) provides unemployment rates.\n",
         "* The US Census Bureau provides poverty rates and population counts.\n",
@@ -51,13 +29,35 @@
         "*Note:* The US Census also provides unemployment statistics. Using BLS data here is for demonstration purposes. Comparing results using Census unemployment data could be a potential extension."
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "view-in-github"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/analyzing_obesity_prevalence.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "srAnaUPPbrH6"
+      },
+      "source": [
+        "Copyright 2025 Google LLC.\n",
+        "SPDX-License-Identifier: Apache-2.0\n",
+        "\n",
+        "**Notebook Version** - 2.0.0"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "7SnIECsk7Csw"
       },
       "source": [
-        "# 1. Setup Environment\n"
+        "## 1. Set up environment\n"
       ]
     },
     {
@@ -66,7 +66,7 @@
         "id": "pysygfoq43NF"
       },
       "source": [
-        "### 1.1 Install Libraries\n",
+        "### 1.1. Install libraries\n",
         "\n",
         "Install the [datacommons-client](https://pypi.org/project/datacommons-client/) library."
       ]
@@ -92,7 +92,7 @@
         }
       ],
       "source": [
-        "!pip install datacommons-client --upgrade --quiet"
+        "!pip install \"datacommons-client[Pandas]\" --upgrade --quiet"
       ]
     },
     {
@@ -101,7 +101,7 @@
         "id": "BtLVyFoN5AiI"
       },
       "source": [
-        "### 1.2 Import Dependencies\n",
+        "### 1.2. Import dependencies\n",
         "\n",
         "Import required libraries for data manipulation, modeling, and plotting.\n"
       ]
@@ -134,7 +134,7 @@
         "id": "ZXzO6qSc5Xk0"
       },
       "source": [
-        "### 1.3 Initialize Data Commons Client\n",
+        "### 1.3. Initialize Data Commons client\n",
         "\n",
         "Initialize the client using your Data Commons API key. Obtain a key from [apikeys.datacommons.org](https://apikeys.datacommons.org/) if you don't have one.\n"
       ]
@@ -158,7 +158,7 @@
         "id": "Ccy9-czCfVTn"
       },
       "source": [
-        "## 2. Data Acquisition\n",
+        "## 2. Data acquisition\n",
         "\n",
         "Fetch statistical observations for the specified variables for all US counties for the year 2021 using the [Python Data Commons API](https://docs.datacommons.org/api/python/v2/)."
       ]
@@ -567,15 +567,15 @@
         "id": "z191ImVmrdds"
       },
       "source": [
-        "## 3. Data Preparation\n",
+        "## 3. Data preparation\n",
         "\n",
         "Process the fetched data for modeling:\n",
         "\n",
         "1. **Filter:** Keep only relevant observations based on their `measurementMethod`. For CDC data, this is typically `AgeAdjustedPrevalence`. For Census, `CensusACS5YearSurvey`, and for BLS, `BLSSeasonallyUnadjusted`.\n",
-        "1. **Select Columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n",
+        "1. **Select columns:** Keep only essential columns: `entity`, `entity_name`, `variable`, `value`.\n",
         "1. **Pivot:** Reshape the dataframe so each variable becomes a column, indexed by county `entity` and `entity_name`.\n",
-        "1. **Calculate Poverty Rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n",
-        "1. **Handle Missing Values:** Drop rows (counties) with any missing values for the selected variables.\n"
+        "1. **Calculate poverty rate:** Compute the poverty rate percentage using the population count and the count of people below the poverty level.\n",
+        "1. **Handle missing values:** Drop rows (counties) with any missing values for the selected variables.\n"
       ]
     },
     {
@@ -975,7 +975,7 @@
         "id": "-ZGRFaJKdHIO"
       },
       "source": [
-        "## 4. Exploratory Data Analysis\n",
+        "## 4. Exploratory data analysis\n",
         "\n",
         "Visualize the relationships between the target variable (Obesity Prevalence) and the predictor variables (High Blood Pressure Prevalence, Unemployment Rate, Poverty Rate) using scatter plots. This helps assess potential correlations.\n"
       ]
@@ -1102,7 +1102,7 @@
         "id": "Bp52dWJNfYSa"
       },
       "source": [
-        "## 5. Model Training\n",
+        "## 5. Model training\n",
         "\n",
         "Train a linear regression model to predict obesity prevalence based on the selected predictors.\n",
         "\n",
@@ -1111,7 +1111,7 @@
         "$$f_\\theta(x) = \\theta_0 + \\theta_1 (\\text{high blood pressure}) + \\theta_2 (\\text{unemployment}) + \\theta_3(\\text{poverty rate})$$\n",
         "<br>\n",
         "\n",
-        "### 5.1 Prepare Features and Target Variable\n",
+        "### 5.1. Prepare features and target variable\n",
         "Define the feature matrix `X` (predictors) and the target vector `Y` (obesity prevalence).\n",
         "\n",
         "Let's start by creating our training and test sets. We'll then train a linear regression model using Scikit learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)"
@@ -1137,10 +1137,9 @@
         "id": "rmidaLTx_6C9"
       },
       "source": [
-        "### 5.2 Split Data\n",
+        "### 5.2. Split data\n",
         "\n",
-        "Split the data into training and testing sets (80% train, 20% test).\n",
-        "\n"
+        "Split the data into training and testing sets (80% train, 20% test)."
       ]
     },
     {
@@ -1176,7 +1175,7 @@
         "id": "hu2t8OAGAGFp"
       },
       "source": [
-        "### 5.3 Train Linear Regression Model\n",
+        "### 5.3. Train linear regression model\n",
         "\n",
         "Instantiate and train the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using the training data.\n",
         "\n"
@@ -1217,12 +1216,11 @@
         "id": "dBmThySxaXKp"
       },
       "source": [
-        "## 6. Model Evaluation\n",
+        "## 6. Model evaluation\n",
         "\n",
         "Assess the performance of the trained model using the Mean Squared Error (MSE) metric and residual analysis.\n",
         "\n",
-        "\n",
-        "### 6.1 Calculate Mean Squared Error (MSE)\n",
+        "### 6.1. Calculate Mean Squared Error (MSE)\n",
         "\n",
         "Define a function for MSE and calculate it for both the training and test sets. Lower MSE indicates better fit.\n",
         "\n"
@@ -1271,7 +1269,7 @@
         "id": "VsGLliuzawPE"
       },
       "source": [
-        "### 6.2 Analyze Residuals\n",
+        "### 6.2. Analyze residuals\n",
         "\n",
         "Calculate and plot the residuals (difference between predicted and actual values) for the test set. Residuals ideally should be randomly scattered around zero."
       ]
@@ -1317,9 +1315,7 @@
         "id": "8VE-arrmbLNL"
       },
       "source": [
-        "*Evaluation Summary:* The model achieves a test MSE of approximately 10%. The residual plots provide insights into the model's error distribution.\n",
-        "\n",
-        "\n",
+        "*Evaluation summary:* The model achieves a test MSE of approximately 10%. The residual plots provide insights into the model's error distribution.\n",
         "\n",
         "How well does your model perform? We were able to achieve an MSE for the test set of approximately 10% points from the observed obesity prevalence. Our model was also able to fit the data with the residuals clustered between -20% and 30%, which for a simple model considering only three explanatory variables isn't so bad."
       ]
@@ -1330,23 +1326,23 @@
         "id": "qapl33x8fy_A"
       },
       "source": [
-        "## 7. Conclusion and Next Steps\n",
+        "## 7. Conclusion and next steps\n",
         "This notebook demonstrated the use of Data Commons to efficiently acquire data from multiple sources (CDC, BLS, Census) and build a simple linear regression model to predict obesity prevalence in US counties. Data Commons significantly streamlines the data gathering and integration process.\n",
         "\n",
         "The resulting model, using high blood pressure prevalence, unemployment rate, and poverty rate, provides a baseline prediction.\n",
         "\n",
-        "**Potential Improvements & Further Exploration:**\n",
+        "**Potential improvements & further exploration:**\n",
         "\n",
         "* Add More Variables: Incorporate other variables known or hypothesized to correlate with obesity, such as:\n",
         "  * `Percent_Person_WithHighCholesterol`\n",
         "  * `Percent_Person_WithDiabetes`\n",
         "  * Educational attainment levels\n",
         "  * Access to healthy food outlets\n",
         "  * Physical inactivity rates\n",
-        "* **Feature Engineering:** Create new features from existing ones.\n",
-        "* **Model Selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n",
-        "* **Geographic Analysis:** Explore spatial patterns in obesity prevalence and model errors.\n",
-        "* **Alternative Data Sources:** Compare model performance using Census unemployment data instead of BLS data.\n",
+        "* **Feature engineering:** Create new features from existing ones.\n",
+        "* **Model selection:** Experiment with different regression models (e.g., Ridge, Lasso, tree-based models).\n",
+        "* **Geographic analysis:** Explore spatial patterns in obesity prevalence and model errors.\n",
+        "* **Alternative data sources:** Compare model performance using Census unemployment data instead of BLS data.\n",
         "Data Commons provides access to a wide range of variables, enabling exploration of correlations with factors like university counts, crime rates (e.g., arson), or environmental factors (e.g., snowfall), potentially leading to more comprehensive models."
       ]
     }