Simple Linear Regression

In this session, we will use the same CASchools dataset from the previous session and implement a simple linear regression model to explore a question of interest:

Research Question

Does expenditure per student affect student performance
in elementary school education?

1 Data Preparation

# load packages and dataset
pkgs <- c("tidyverse", "moments", "data.table", "ggsci", "stargazer")
missing <- setdiff(pkgs, rownames(installed.packages()))
if (length(missing) > 0) install.packages(missing)
invisible(lapply(pkgs, function(pkg) suppressPackageStartupMessages(library(pkg, character.only = TRUE))))
f_name <- "https://raw.githubusercontent.com/my1396/FIN5005-Fall2025/refs/heads/main/data/CASchools_test_score.csv"
cas <- read_csv(f_name,
    col_types = cols(
        county = col_factor(), # read as factor
        grades = col_factor()
    )
)
cas %>% as.data.table()

     district                          school      county grades students
        <num>                          <char>      <fctr> <fctr>    <num>
  1:    75119              Sunol Glen Unified     Alameda  KK-08      195
  2:    61499            Manzanita Elementary       Butte  KK-08      240
  3:    61549     Thermalito Union Elementary       Butte  KK-08     1550
  4:    61457 Golden Feather Union Elementary       Butte  KK-08      243
  5:    61523        Palermo Union Elementary       Butte  KK-08     1335
 ---                                                                     
416:    68957          Las Lomitas Elementary   San Mateo  KK-08      984
417:    69518            Los Altos Elementary Santa Clara  KK-08     3724
418:    72611          Somis Union Elementary     Ventura  KK-08      441
419:    72744               Plumas Elementary        Yuba  KK-08      101
420:    72751            Wheatland Elementary        Yuba  KK-08     1778
     teachers calworks   lunch computer expenditure    income   english  read
        <num>    <num>   <num>    <num>       <num>     <num>     <num> <num>
  1:    10.90   0.5102  2.0408       67    6384.911 22.690001  0.000000 691.6
  2:    11.15  15.4167 47.9167      101    5099.381  9.824000  4.583333 660.5
  3:    82.90  55.0323 76.3226      169    5501.955  8.978000 30.000002 636.3
  4:    14.00  36.4754 77.0492       85    7101.831  8.978000  0.000000 651.9
  5:    71.50  33.1086 78.4270      171    5235.988  9.080333 13.857677 641.8
 ---                                                                         
416:    59.73   0.1016  3.5569      195    7290.339 28.716999  5.995935 700.9
417:   208.48   1.0741  1.5038      721    5741.463 41.734108  4.726101 704.0
418:    20.15   3.5635 37.1938       45    4402.832 23.733000 24.263039 648.3
419:     5.00  11.8812 59.4059       14    4776.336  9.952000  2.970297 667.9
420:    93.40   6.9235 47.5712      313    5993.393 12.502000  5.005624 660.5
      math
     <num>
  1: 690.0
  2: 661.9
  3: 650.9
  4: 643.5
  5: 639.9
 ---      
416: 707.7
417: 709.5
418: 641.7
419: 676.5
420: 651.0

Variable of interest: Test scores

read and math are average reading and math test scores for each district
In the last session, we show that these two variables are highly correlated (r=0.92)
We construct an average test score as the average of the test score for reading and the score of the math test.

# compute TestScore and append it to CASchools
cas <- cas %>%
  mutate(TestScore = (read + math) / 2)

2 Descriptive Statistics

Refer to the previous session for detailed summary statistics and visualizations.

A quick preview

v <- setdiff(names(cas), c("district", "school", "county", "grades"))
stargazer(as.data.frame(cas[v]), type="html", digits=2)


Statistic	N	Mean	St. Dev.	Min	Max

students	420	2,628.79	3,913.10	81	27,176
teachers	420	129.07	187.91	4.85	1,429.00
calworks	420	13.25	11.45	0.00	78.99
lunch	420	44.71	27.12	0.00	100.00
computer	420	303.38	441.34	0	3,324
expenditure	420	5,312.41	633.94	3,926.07	7,711.51
income	420	15.32	7.23	5.34	55.33
english	420	15.77	18.29	0.00	85.54
read	420	654.97	20.11	604.50	704.00
math	420	653.34	18.75	605.40	709.50
TestScore	420	654.16	19.05	605.55	706.75

3 Scatter Plot

Hypothesis: Higher spending per student leads to better test scores.

ggplot(cas, aes(x = expenditure, y = TestScore)) +
    geom_point(alpha = 0.6, color = "#1976d2") +
    geom_smooth(method = "lm", color = "#d32f2f", se = FALSE) +
    labs(
        title = "Scatter Plot of Test Scores vs. Expenditure per Student",
        x = "Expenditure per Student (USD)",
        y = "Average Test Score"
    ) +
    theme_minimal(base_size = 14)

There seems to be a positive relationship between expenditure per student and average test scores. Also note that the points are dispersed, indicating that other factors may also influence test scores. We will further explore this using multiple regression in future sessions.

For now, we will focus on a simple linear regression model with only one independent variable: expenditure per student.

Let’s calculate the correlation coefficient between expenditure and test scores.

with(cas, cor.test(expenditure, TestScore))


    Pearson's product-moment correlation

data:  expenditure and TestScore
t = 3.9841, df = 418, p-value = 7.989e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.09736861 0.28180139
sample estimates:
      cor 
0.1912728

What does the correlation coefficient tell us? Is it statistically significant? Justify your answer.

The reported correlation coefficient is 0.19 between expenditure and TestScore.

This is a positive but weak correlation: higher expenditures are associated with higher test scores, but the relationship is not very strong.
The p-value = 7.989e-05, which is far below common significance levels (0.05, 0.01, or even 0.001).

This means we reject the null hypothesis that the true correlation is zero. The result is statistically significant.

4 Simple Linear Regression

4.1 Model Specification

We will fit a simple linear regression model to examine the relationship between expenditure per student and average test scores.

\[ \text{TestScore}_i = \beta_0 + \beta_1 \times \text{expenditure}_i + \varepsilon_i \tag{1}\]

Equation 1 is the linear regression model with a single independent variable.

Where:

Key Components of the Linear Regression Model

$\text{TestScore}_i$ in the Left Hand Side (LHS) is the dependent variable (response variable). $i$ indexes different school districts.
$\text{expenditure}_i$ in the Right Hand Side (RHS) is the independent variable (explanatory variable or regressor).
$\beta_0$ and $\beta_1$ are known as the parameters (coefficients) of the model.
- $\beta_0$ is the intercept, representing the expected value of TestScore when expenditure is zero.
- $\beta_1$ is the slope coefficient, representing the change in TestScore for a one dollar increase in expenditure.
$\varepsilon_i$ is the error term, capturing all other factors affecting not included in the model, e.g, teacher quality, school facilities, parental involvement, etc.

Equation 1 can be written more generally as:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

$\beta_1$ is the slope coefficient, which measures the change in the dependent variable $Y$ associated with a one-unit increase in the independent variable $X.$
$\beta_0$ is the intercept, which is the expected value of $Y$ when $X=0;$ it is the point at which the population regression line intersects the Y axis.

📌 In some econometric applications, the intercept has a meaningful economic interpretation. In other applications, the intercept has no real-world meaning; for example, when $X$ is the expenditure per student, strictly speaking the intercept is the expected value of test scores when there is no expenditure, which might be unrealistic.

When the real-world meaning of the intercept is nonsensical, it is best to think of it simply as the coefficient that determines the level of the regression line.

4.2 Estimating the Coefficients of the Linear Regression Model

Based on the scatter plot and the correlation analysis, we expect a positive relationship between expenditure and test scores. We will use the Ordinary Least Squares (OLS) method to estimate the coefficients of the linear regression model.

To run this regression, we use the lm() function in R.

# Fit the linear regression model
model <- lm(TestScore ~ expenditure, data = cas)
summary(model)


Call:
lm(formula = TestScore ~ expenditure, data = cas)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.146 -14.206   0.689  13.513  50.127 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.236e+02  7.720e+00  80.783  < 2e-16 ***
expenditure 5.749e-03  1.443e-03   3.984 7.99e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.72 on 418 degrees of freedom
Multiple R-squared:  0.03659,   Adjusted R-squared:  0.03428 
F-statistic: 15.87 on 1 and 418 DF,  p-value: 7.989e-05

4.3 Interpretation of the output

A. Model Specification and Fitted Values

The regression estimates the effect of expenditure (per student, in USD) on test scores (combined reading and math).

From Theory to Practice:

We started with the theoretical model: \[ \text{TestScore}_i = \beta_0 + \beta_1 \times \text{expenditure}_i + \varepsilon_i \]

Using OLS estimation, we obtain the fitted model:

\[ \widehat{\text{TestScore}} = 623.6 + 0.00575 \times \text{expenditure} \]

The hat symbol ($\widehat{\text{TestScore}}$) indicates these are predicted values based on our sample data. Notice that we no longer have the error terms ($\varepsilon_i$) because this equation gives us the predicted values from the regression line.

Understanding Residuals:

Since our model can’t perfectly predict every observation, we need to measure how far off our predictions are from reality:

\[ \text{Residual}_i = \text{TestScore}_i - \widehat{\text{TestScore}}_i \]

Residuals represent the difference between actual and predicted values - essentially, they capture what our model couldn’t explain. In the original theoretical model, these correspond to the error terms ($\varepsilon_i$) that we estimated.

B. Coefficients

Intercept (623.6): When expenditure = 0, the predicted test score is about 623.6. While not meaningful in practice (since expenditure cannot realistically be zero), it serves as the baseline of the regression line.
Expenditure (0.00575): For every one-dollar increase in expenditure per student, the average test score is predicted to increase by 0.0057 points. Put differently: a $1,000 increase in expenditure is associated with a 5.75-point increase in test scores.

C. Statistical Significance

For the slope coefficient ($\beta_1$):

The t-value = 3.984 and p-value = 7.99e-05, which is well below 0.001.
This indicates that expenditure has a statistically significant positive effect on test scores.

D. Model Fit

R-squared = 0.0366 (about 3.7%).
- This means expenditure explains only a small portion of the variation in test scores.
- Many other factors (teacher quality, socio-economic background, school resources, etc.) likely play a larger role.
F-statistic = 15.87 with a p-value of 7.99e-05.
- This tests the null hypothesis that all regression coefficients are equal to zero (no effect).
- The low p-value indicates we reject the null hypothesis, confirming that the model has some explanatory power.

E. Residuals

The investigation of residuals requires diagnostic test which checks the assumptions of linear regression (e.g., homoscedasticity, normality of errors).

We will cover them in future sessions.

Here we provide a brief overview.

The first plot is “Residuals vs Fitted” plot.

The residuals appear reasonably balanced around zero, though the spread suggests substantial unexplained variation.

The spread of residuals seems to increase slightly with fitted values, indicating potential heteroskedasticity (non-constant variance of errors). This suggests that the assumption of homoskedasticity may be violated, which could affect the reliability of our coefficient estimates and standard errors.
The second plot is the normal Q-Q plot, which assesses whether the residuals are normally distributed.
- The points generally follow the reference line, suggesting that the residuals are approximately normally distributed, although there shows thin tails.
- This indicates that there are few extreme residuals. The model predictions don’t have large outliers.

From the summary output:

The residual standard error = 18.72, which is the average size of prediction errors (in test score points).

✅ Summary Interpretation:

The regression analysis shows a statistically significant positive relationship between school expenditure and test scores. On average, an additional \$1,000 in per-student expenditure is associated with an increase of about 5.75 points in test scores. However, the explanatory power of the model is low (R² ≈ 3.7%), indicating that expenditure alone explains only a small fraction of the variation in test scores. This suggests that while spending matters, many other factors also influence student performance.

To gain a more comprehensive understanding of the factors affecting student achievement, future analyses should employ multiple regression models that incorporate additional explanatory variables such as student demographics, teacher qualifications, and school characteristics. We will cover multiple regression in subsequent sessions.

5 Report by Stargazer

stargazer produces well-formatted regression tables that can be easily included in reports or publications.

stargazer(model, type = "text", digits = 2)


===============================================
                        Dependent variable:    
                    ---------------------------
                             TestScore         
-----------------------------------------------
expenditure                   0.01***          
                              (0.001)          
                                               
Constant                     623.62***         
                              (7.72)           
                                               
-----------------------------------------------
Observations                    420            
R2                             0.04            
Adjusted R2                    0.03            
Residual Std. Error      18.72 (df = 418)      
F Statistic           15.87*** (df = 1; 418)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

References

Introduction to Econometrics with R