1  The mechanics of linear regression

In this chapter, you are going to read portions of Chapters 1 to 3 of Wooldridge (2020) along with this study guide. We will be focusing on the algebraic, numerical, or mechanics of the main tool for the course – linear regression.

While reading this chapter along with the related portions of the textbook, ignore the parts where the word “model” appears. The point of this chapter is that we can run regressions without any reference to a model. We could do more if there is a model and that will be the subject of the next chapter.

If you want to read an alternative version of what is written here, I would suggest Section 1.1 of Pua (2024).

1.1 Motivation

Note

Read Section 1-1 first before proceeding.

TipExercises
  1. How was econometrics defined in the textbook? What can we do if we can use econometrics?
  2. What makes econometrics different from statistics?
  3. Provide examples of experimental and observational data coming from the Philippines. Be specific as to why the data can be considered experimental or observational.

Section 1-1 discusses the distinction between experimental and observational data. In recent times, there have been attempts to combine both types of data, especially in light of privacy concerns (Mann, Sales, and Gagnon-Bartsch 2025), the biases from passively collected data (Athey, Chetty, and Imbens 2025), and the explosion of big data (Rosenman et al. 2023). A very recent review of the attempts to combine both types of data can be found in Rosenman (2025).

1.2 Data in spreadsheets

Note

Read Section 1-3a first before proceeding. Focus on the dataset called WAGE1 in the textbook and pay attention to how the dataset is structured and how the variables are measured.

The textbook has a collection of “cleaned” datasets. “Cleaned” here means that someone accessed the original sources, processed these different sources together, and put together a dataset meant for immediate use. As a starting point, we will work on these cleaned datasets first.

It is best to have these datasets conveniently available to you so that you can work on any of them at your own time1 We will use the R package wooldridge (Shea 2024).

Let us reproduce Table 1.1 in page 6. First, install the R package2 wooldridge. Second, we need to load3 the wooldridge package in order to load cleaned datasets from the textbook.

Below, you will see commands wrapped in interactive code cells. You can run the interactive code cells while working with your internet browser4. These can also be copied and pasted in R or RStudio which you can run while working with your laptop or desktop.

The interactive code cells containing the commands will take some time to run because a package needs to be downloaded. Notice that the entire wage1 dataset will be displayed.

To reproduce Table 1.1 more closely, we need to show the first 5 rows, the last 2 rows, and the entries of the columns obsno, wage, educ, exper, female, married. The key R commands here are to extract rows and columns of a data frame and c().

Notice that the name obsno does not really show up explicitly.

Another option is to used head(), tail(), rbind().

Important

Why show you different ways of doing the same thing?

Note

Look up Computer Exercise C1 in Chapter 1. Give the exercises a quick scan to get a sense of what you are asked to do.

One way to learn R is to practice using the computer exercises. You may also choose to ask for AI assistance to look up what commands in R to use to do a particular task. After that, remember these commands. For now, we work on items (i), (ii), and (v) of Computer Exercise C1 together.

TipExercises
  1. Write down sentences to communicate your answers to C1 (i).
  2. Do the same for C1 (ii). How would you know if the average would seem high or low? What is your basis?
  3. To obtain the CPI figures for the US, consult this link. What do you notice about the measurement of CPI?
  4. For adjusting CPI, consult this link. Can you reproduce the number 5.572 found in that link? Now try answering C1 items (iii) and (iv).
TipMore practice

Do Chapter 1 Computer Exercises C2, C3, C4 (except (iv)), C5 (except (iv)), C6 (except (iv)), C7 (except (iv)), C8 (except (iii) and (iv)).

You might have to pick up some R skills at a fasteR rate.

1.3 Running regressions with cross-sectional data

Let us demonstrate the main tool you will be using for this course. We will primarily use the dataset wage1 and see how it is used in Chapters 2 to 7 of the textbook. In this way, you can already get a feel for how to run regressions, even if you do not necessarily know why you are running them.

Note

Read Example 2.4 found in Section 2-2 first before proceeding. It is possible that you might not understand everything now – you will get there soon.

Let us reproduce the finding in (2.27).

Congratulations! You now know how to run a simple linear regression. Another way to say this is that you have run a (linear) regression of wage on educ. wage is called a regressand and educ is called a regressor5.

Let us try reproducing other regressions found in the book which uses wage1.

Note

Read Example 3.2 found in Section 3-2b first before proceeding.

Observe that the wage variable is nonlinearly transformed into natural logarithmic form6. In addition, there are additional variables exper and tenure on top of educ. Let us reproduce the finding in (3.19).

Note

Focus on (6.12) in page 189.

Observe that the wage variable is not logarithmically transformed and that the experience variable is included as exper and exper\(^2\). The latter is also an example of a nonlinear transformation: squaring.

Note

Read Example 7.1 found in Section 7.2 first before proceeding.

Let us reproduce the findings in (7.4) and (7.5). The difference this time is the presence of the female variable.

Congratulations! You have run several regressions and been exposed to multiple linear regression, transformations, and binary/dummy variables. All of them were computed using the same dataset!

TipExercises / more practice

Reproduce the regression results reported in (2.26), (2.28), (2.44), (2.46), (3.15), (6.32), (7.8), and (7.9).

1.4 Running regressions with time series data

Note

Read Section 1-3b first before proceeding. Focus on the dataset called PRMINWGE in the textbook and pay attention to how the dataset is structured and how the variables are measured.

Time series data need to be treated differently in R and in other statistical software. We need to explicitly declare the data frequency.

We will be using the dataset prminwge based on the research by Castillo-Freeman and Freeman (1992). This book chapter is one of those rare research publications where the data is part of the publication itself. We will reproduce the findings in column (6) of Table 6.2. This is an opportunity for you to see what lagged variable7 and a linear time trend are.

Note

Give a quick read of the paragraph, starting from “As a final test …”, on page 181 of Castillo-Freeman and Freeman (1992). Pay attention to the variables we need for column (6) of Table 6.2. The variables used should be matched with the names from prminwge.

We will be loading the dataset and then giving you a sense of what lagged variables and linear time trend are. We need the logarithm of manufacturing earnings lagged by one year.

What you saw earlier with cbind() is just for illustration purposes. ts.intersect() is what you should use to combine the log of manufacturing earnings with its one year lag. After that, we attempt to reproduce column (6) of Table 6.2: \[\begin{eqnarray*}\widehat{\log\left(\mathsf{mfgwage}\right)}_t &=& -0.21+0.20\log\left(\mathsf{avgmin}\right)_t-0.00t \\ && +0.09\log\left(\mathsf{prdef}\right)_t+0.05\log\left(\mathsf{prgnp}\right)_t\\ &&+0.72\log\left(\mathsf{mfgwage}\right)_{t-1}\end{eqnarray*}\]

The variable \(t\) is the linear time trend and the variable \(\log\left(\mathsf{mfgwage}\right)_{t-1}\) is a one year lag of the log of manufacturing earnings.

Compare with the published results. What do you notice?

TipExercises

Reproduce the regression results reported in (10.18) and (10.19).

1.5 How were the regressions computed?

1.5.1 Formulas which facilitate interpretation

Note

Focus on the system of equations in (2.14) and (2.15). For now, accept these equations as they are. Next, focus on (2.17) and (2.19). These are the solutions to the system in (2.14) and (2.15), provided that (2.18) holds. Work out how (2.17) and (2.19) are solutions before proceeding. You might need to refer to Appendix A-1 for a refresher when working with summation notation.

Focus on (2.19). We can interpret \(\widehat{\beta}_1\) as a ratio of the sample covariance of \(x\) and \(y\) to the sample variance of \(x\).

Note

It is very likely that you have heard of the sample variance. If you have never heard of the sample covariance, refer to Appendix C equation (C.14). This is the definition of the sample covariance.

You might not notice it yet, but the definition in (C.14) and the numerator in (2.19) have a slight difference – the letters are capitalized in (C.14) but not in (2.19). We will come back to this later.

It is best to make sense of the sample covariance through its sign. We are looking at pairs \(\left(x_1, y_1\right), \left(x_2, y_2\right), \ldots, \left(x_n, y_n\right)\). Here \(n\) is the total number of complete observations or the sample size. The \(x\)’s have an average \(\overline{x}\) and similarly for the \(y\)’s. There are four possibilities:

  1. Some observation \(i\) could have \(x_i>\overline{x}\) and \(y_i>\overline{y}\). Then \(\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)>0\).
  2. Some observation \(i\) could have \(x_i<\overline{x}\) and \(y_i<\overline{y}\). Then \(\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)>0\).
  3. Some observation \(i\) could have \(x_i>\overline{x}\) and \(y_i<\overline{y}\). Then \(\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)<0\).
  4. Some observation \(i\) could have \(x_i<\overline{x}\) and \(y_i>\overline{y}\). Then \(\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)<0\).

So some observations contribute a positive amount while other observations contribute a negative amount. If the positive amounts overwhelm the negative amounts, then the sample covariance is positive. If the negative amounts overwhelm the positive amounts, then the sample covariance is negative. The sample covariance would roughly be zero if the total of the positive contributions to the sum is roughly the same as the total of the negative contributions.

TipExercise

What is the unit of measurement of the sample covariance? If it helps, think about the context of Example 2.4.

Recall that we already reproduced equation (2.27) in Example 2.4. Let us try computing \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) if we let \(y=\)wage and \(x=\)educ.

As a result, the command lm() reports as its coefficients \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\). A system of two equations in two unknowns \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) is being solved simultaneously.

Note

Read Section 2-2. Start from page 26. Stop reading before you reach “Not surprisingly, …”

Where does this system of equations come from?

Note

Read Appendix 2A. If you forgotten your calculus skills and why you are using them (which you should not), now is the time to return to your notes in ECOMATH. We will be using ideas from optimization and matrix algebra.

Note that there is a new word here: residuals. Do not worry about it for now. Do make a note of this new word.

1.5.2 A matter of notation

You may have read the phrase “dummy arguments”. What does this phrase mean? It means that we could have written the minimization problem in Appendix 2A as \[\min_{a,b} \sum_{i=1}^n\left(y_i-a-bx_i\right)^2\] or even \[\min_{\alpha_0,\alpha_1} \sum_{i=1}^n\left(y_i-\alpha_0-\alpha_1 x_i\right)^2\] yet the underlying meaning of the optimization problem is not lost. We ultimately are minimizing a function \(Q\left(b_0,b_1\right)\) or \(Q\left(a,b\right)\) or \(Q\left(\alpha_0,\alpha_1\right)\). We stick to the dummy arguments used in the textbook.

Another matter of notation is \[\frac{\partial Q\left(\widehat{\beta}_0,\widehat{\beta}_1\right)}{\partial b_0}=0,\ \ \frac{\partial Q\left(\widehat{\beta}_0,\widehat{\beta}_1\right)}{\partial b_1}=0.\] These tell you that we are taking the first-order partial derivatives of \(Q\left(b_0,b_1\right)\) with respect to \(b_0\) and \(b_1\) and then evaluate at a point \((\widehat{\beta}_0, \widehat{\beta}_1)\): \[\frac{\partial Q\left(b_0,b_1\right)}{\partial b_0}\bigg\vert_{b_0=\widehat{\beta}_0,b_1=\widehat{\beta}_1}=0,\ \ \frac{\partial Q\left(b_0,b_1\right)}{\partial b_1}\bigg\vert_{b_0=\widehat{\beta}_0,b_1=\widehat{\beta}_1}=0.\]

1.5.3 Solution to the least squares problem

The minimization problem found in Appendix 2A is sometimes called an ordinary least squares problem. Sometimes you will hear and see OLS or LS. “Least” here refers to minimization. “Squares” here means the squared terms which form the sum.

“O” stands for ordinary, which refers to the equal contribution of each observation to the sum. We observe pairs \(\left(x_1, y_1\right), \left(x_2, y_2\right), \ldots, \left(x_n, y_n\right)\). Each observation from \(i=1,\ldots, n\) contributes \[\left(y_i-b_0-b_1x_i\right)^2=1\times \left(y_i-b_0-b_1x_i\right)^2\] to the total. This is in contrast to weighted least squares (WLS), which will be discussed next time.

After applying calculus to find the first-order necessary conditions, you will come to the conclusion that an optimal solution \(\left(\widehat{\beta}_0, \widehat{\beta}_1\right)\) solves a system of equations (2.14) and (2.15). In addition, the solution found in (2.17) and (2.19) is the only one.

To determine if you have indeed found a minimizer, you need to check the second-order sufficient conditions. Appendix 2A exploits the quadratic structure of \(Q\left(b_0,b_1\right)\). In particular, you must show that \[Q\left(b_0,b_1\right)=Q\left(\widehat{\beta}_0, \widehat{\beta}_1\right)+\sum_{i=1}^n \left(\left(\widehat{\beta}_0-b_0\right)+\left(\widehat{\beta}_1-b_1\right)x_i\right)^2.\] The first term after the equality is the minimized value of the objective function evaluated at \(\left(\widehat{\beta}_0, \widehat{\beta}_1\right)\). It no longer depends on \(b_0\) and \(b_1\). Therefore, it is unaffected by the choice of \(b_0\) and \(b_1\).

The second term is a term that is always greater than or equal to zero. In addition, this second term depends on \(b_0\) and \(b_1\). The only way to minimize \(Q\left(b_0,b_1\right)\) is to choose \(b_0=\widehat{\beta}_0\) and \(b_1=\widehat{\beta}_1\).

Where is the quadratic structure coming from? Focus on the expression for \(Q\left(b_0,b_1\right)\). Observe that the second term is quadratic in \(b_0\) and \(b_1\). We can also graph the three-dimensional surface \(Q\left(b_0,b_1\right)\). Notice the bowl-like graphic.

1.5.4 Matrix representation

After some algebraic manipulation, you can write (2.14) and (2.15) in matrix form, specifically:

\[\begin{eqnarray*} \underbrace{\begin{pmatrix}n & \displaystyle\sum_{i=1}^n x_i \\ \displaystyle\sum_{i=1}^n x_i & \displaystyle\sum_{i=1}^n x_i^2\end{pmatrix}}_{\mathbf{A}}\underbrace{\begin{pmatrix}\widehat{\beta}_0 \\ \widehat{\beta}_1\end{pmatrix}}_{\widehat{\boldsymbol\beta}} = \underbrace{\begin{pmatrix}\displaystyle\sum_{i=1}^n y_i \\ \displaystyle\sum_{i=1}^n x_iy_i\end{pmatrix}}_{\mathbf{b}}. \end{eqnarray*}\] Provided that \(\mathbf{A}\) has an inverse, then \(\widehat{\boldsymbol\beta}=\mathbf{A}^{-1}\mathbf{b}\). The matrices \(\mathbf{A}\) and \(\mathbf{b}\) can be written in terms of the matrices containing the data, specifically, let:

\[\mathbf{X}=\begin{pmatrix}1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n\end{pmatrix},\ \ \mathbf{y}=\begin{pmatrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{pmatrix}.\] You can show that \(\mathbf{A}=\mathbf{X}^\mathsf{T}\mathbf{X}\) and \(\mathbf{b}=\mathbf{X}^\mathsf{T}\mathbf{y}\). Therefore, \(\widehat{\boldsymbol\beta}=\left(\mathbf{X}^\mathsf{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathsf{T}\mathbf{y}\).

We now have another representation of the optimal solution to the least squares problem. The advantage of this representation is that it facilitates computation and it can easily be extended to more columns in \(\mathbf{X}\). The disadvantage is that we lose the nice interpretation of \(\widehat{\beta}_1\) as the ratio of the sample covariance of \(x\) and \(y\) to the sample variance of \(x\).

Note

Different versions of a formula highlight different aspects of the underlying object being studied. Sometimes, a version of a formula is useful to build intuition. Other times, another version could make computations easier. Knowing all of them is crucial.

1.5.5 Objects produced after least squares

Note

Read Section 2-2. Start from page 27 “The estimates given in …” and stop reading before you reach page 28 “Equation (2.23) is also called …”.

When you apply least squares to data, like applying lm() to data in a spreadsheet, there are objects the procedure will produce:

  1. Regression coefficients (2.17) and (2.19): These are \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\).
  2. The \(i\)th fitted value \(\widehat{y}_i\) (2.20): \[\widehat{y}_i=\widehat{\beta}_0+\widehat{\beta}_1 x_i\] There would be \(n\) fitted values for the dataset.
  3. The \(i\)th residual \(\widehat{u}_i\) (2.21): \[\widehat{u}_i=y_i-\widehat{y}_i=y_i-\widehat{\beta}_0-\widehat{\beta}_1 x_i\] There would be \(n\) residuals for the dataset.
  4. Sum of squared residuals or residual sum of squares (2.22) or (2.35): \[\mathsf{SSR}=\sum_{i=1}^n \widehat{u}_i^2=\sum_{i=1}^n \left(y_i-\widehat{\beta}_0-\widehat{\beta}_1 x_i\right)^2=Q\left(\widehat{\beta}_0, \widehat{\beta}_1\right)\]
  5. Total sum of squares (2.33): \[\mathsf{SST}=\sum_{i=1}^n \left(y_i-\overline{y}\right)^2\] This is nothing but \((n-1)\) times the sample variance of \(y\).
  6. Explained sum of squares (2.34): \[\mathsf{SSE}=\sum_{i=1}^n \left(\widehat{y}_i-\overline{y}\right)^2\]
  7. The OLS regression line (2.23): \[\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1 x\]

We now illustrate these objects using Example 2.4. First, we look into the the regression coefficients, fitted values, and residuals.

Next, we compute \(\mathsf{SSR}\), \(\mathsf{SST}\), and \(\mathsf{SSE}\):

Finally, we can plot the regression line on a scatterplot of wage against educ. Can you find where the fitted values and residuals are in this plot?

1.6 Consequences of using least squares

Note
  1. Read Section 2-2 page 28 starting from “In most cases, …”. Stop reading before you reach “Because these examples …”.
  2. Read Section 2-2a entirely.
  3. Read Section 2-3 entirely.
  4. Read Section 2-4a entirely.

1.6.1 Algebraic properties

When you apply least squares to data, like applying lm() to data in a spreadsheet, the objects being produced have to satisfy numerical or algebraic properties:

  1. We can always write \(y_i=\widehat{y}_i+\widehat{u}_i\) for every \(i=1,\ldots, n\).
  2. The sum of the OLS residuals is zero. As a result, the sample average of the OLS residuals is also zero. We can write these as \[\sum_{i=1}^n \widehat{u}_i = 0,\ \ \overline{\widehat{u}}=\frac{1}{n}\sum_{i=1}^n \widehat{u}_i = 0\]
  3. The sample average of the fitted values is the sample average of the actual values. We can write this as \[\overline{\widehat{y}}=\frac{1}{n}\sum_{i=1}^n \widehat{y_i}=\frac{1}{n}\sum_{i=1}^n y_i=\overline{y}\]
  4. The sample covariance between \(x\) and the residuals is zero. We can write this as \[\frac{1}{n-1}\sum_{i=1}^n \left(x_i-\overline{x}\right)\left(\widehat{u}_i-\overline{\widehat{u}}\right)=0\]
  5. The sample covariance between the fitted values \(\widehat{y}\) and the residuals \(\widehat{u}\) is zero. We can write this as \[\frac{1}{n-1}\sum_{i=1}^n \left(\widehat{y}_i-\overline{\widehat{y}}\right)\left(\widehat{u}_i-\overline{\widehat{u}}\right)=0\]
  6. The point \(\left(\overline{x}, \overline{y}\right)\) is always on the OLS regression line. We can write this as \[\overline{y}=\widehat{\beta}_0+\widehat{\beta}_1\overline{x}=\begin{pmatrix}1 & \overline{x}\end{pmatrix}\begin{pmatrix}\widehat{\beta}_0 \\ \widehat{\beta}_1\end{pmatrix}\]
  7. \(\mathsf{SST}=\mathsf{SSE}+\mathsf{SSR}\)

We can verify these properties hold for Example 2.4. Take note that these properties hold in general.

1.6.2 Interpreting regression coefficients

1.6.2.1 Generic interpretation

The OLS regression line \(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1 x\) already provides us a way to interpret the regression coefficients. Think of \(\widehat{y}\) as a function of \(x\). To make this explicit, let \(\widehat{y}(x)=\widehat{\beta}_0+\widehat{\beta}_1x\).

When \(x=0\), we have \(\widehat{y}(0)=\widehat{\beta}_0\). Therefore, \(\widehat{\beta}_0\) is the fitted value of \(y\) when \(x=0\).

When \(x=x_{\mathsf{old}}\), we have \(\widehat{y}(x_{\mathsf{old}})=\widehat{\beta}_0+\widehat{\beta}_1 x_{\mathsf{old}}\). Similarly, when \(x=x_{\mathsf{new}}\neq x_{\mathsf{old}}\), we have \(\widehat{y}(x_{\mathsf{new}})=\widehat{\beta}_0+\widehat{\beta}_1 x_{\mathsf{new}}\). Therefore, \[\widehat{y}(x_{\mathsf{new}})-\widehat{y}(x_{\mathsf{old}})=\widehat{\beta}_1 \left(x_{\mathsf{new}}-x_{\mathsf{old}}\right) \Rightarrow \widehat{\beta}_1=\frac{\widehat{y}(x_{\mathsf{new}})-\widehat{y}(x_{\mathsf{old}})}{x_{\mathsf{new}}-x_{\mathsf{old}}}.\] It is not surprising that \(\widehat{\beta}_1\) can be interpreted as the slope of the OLS regression line.

Let \(\Delta \widehat{y}=\widehat{y}(x_{\mathsf{new}})-\widehat{y}(x_{\mathsf{old}})\) and \(\Delta x=x_{\mathsf{new}}-x_{\mathsf{old}}\). Therefore, \(\Delta\widehat{y}=\widehat{\beta}_1\Delta x\). So, we can compute the change in the fitted value of \(y\) for a given change in \(x\). In your interpretation, you have to provide your own \(\Delta x\).

Important

It should be extremely clear that we cannot compute the change in the actual value of \(y\) for a given change in \(x\).

Note

Revisit the interpretation of the results (2.27) found in Example 2.4. The interpretation of \(\widehat{\beta}_0=-0.90\) is accurate, but silly as the textbook suggests.

The interpretation of \(\widehat{\beta}_1=0.54\) is slightly inaccurate, specifically, “The slope estimate in (2.27) implies that one more year of education increases hourly wage by 54 cents a hour.” How should you change it?

1.6.2.2 Improving interpretation via centering

We could improve the interpretation of the intercept by centering \(x\). Start from the fitted values and apply the usual “add-subtract” trick: \[\begin{eqnarray*}\widehat{y}_i &=&\widehat{\beta}_0+\widehat{\beta}_1 x_i \\ &=& \widehat{\beta}_0+\widehat{\beta}_1 x_i-\widehat{\beta}_1\overline{x}+\widehat{\beta}_1\overline{x} \\ &=& \left(\widehat{\beta}_0+\widehat{\beta}_1 \overline{x}\right) +\widehat{\beta}_1\left(x_i-\overline{x}\right)\\ &=& \overline{y}+ \widehat{\beta}_1\left(x_i-\overline{x}\right) \end{eqnarray*}\]

Let us revisit Example 2.4.

How does centering affect the regression coefficients? The intercept now has a nicer interpretation, while the slope is still the same number as before.

1.6.2.3 Special cases

What happens when your OLS regression line is assumed to be flat or not depend on \(x\)?

TipExercises
  1. Work on Chapter 2 Problem 12. After working on this problem, obtain the expressions for all the objects produced using least squares.
  2. Try using wage1sub and run a regression of wage1 on a constant only using lm(wage ~ 1, data = wage1sub) and compare the result with the sample average of the wages.

What happens when your OLS regression line does not have an intercept? This special case is not as useful though, but can be used as practice.

TipExercises
  1. Read Section 2-6 and stop when you reach the end of page 50. Your task is to derive (2.66).
  2. Try running a regression of wage on educ using wage1sub without a constant by executing lm(wage ~ educ - 1, data = wage1sub). Compare the findings with the same regression but with an intercept.

What happens when \(x\) is a binary or dummy variable? This is a very important and very useful special case.

TipExercise

Read Section 2-7 and stop until you reach the end of page 51. Next, start from page 52 “The mechanics of OLS …” and stop when you reach “… two groups.” Familiarize yourself with the meaning of a binary or dummy variable. Pay attention to the meaning of the regression coefficients in (2.73) and (2.74).

  1. Can you identify the dummy variables in the wage1 data we worked on for Table 1.1?
  2. Using wage1, run a regression of wage on female.
  3. Calculate the average wages of males and the average wages of females. How are these quantities related to the regression coefficients?
  4. Work on Chapter 2 Problems 13 and 14.

1.6.2.4 Another way to make sense of \(\widehat{\beta}_1\)

We can write \(\widehat{\beta}_1\) in terms of the sample standard deviation of \(x\) (\(\widehat{\sigma}_x\)), the sample standard deviation of \(y\) (\(\widehat{\sigma}_y\)), and the sample correlation coefficient of \(x\) and \(y\) (\(\widehat{\rho}_{xy}\)). In particular, you can show starting from (2.19) that \[\widehat{\beta}_1=\widehat{\rho}_{xy}\times \frac{\widehat{\sigma}_y}{\widehat{\sigma}_x}.\]

Note

If you have never heard of the sample correlation coefficient, refer to Appendix C equation (C.15). This is the definition of the sample correlation coefficient.

You should wonder why we introduce the sample correlation coefficient if we already have the sample covariance. Figure out the units of measurement of the sample covariance. More importantly, you can write \[\widehat{\rho}_{xy}=\frac{1}{n-1}\sum_{i=1}^n \left(\frac{x_i-\overline{x}}{\widehat{\sigma}_x}\right)\left(\frac{y_i-\overline{y}}{\widehat{\sigma}_y}\right)\]

Can you interpret this expression and figure out why the sample correlation coefficient has to be introduced?

As a result, there are only five numbers we need to calculate the OLS regression line – the sample averages of each variable, the sample standard deviations of each variable, and the sample correlation coefficient between the two variables. In a way, we compress the data (no matter how large it is) into these five numbers when we calculate the OLS regression line.

Finally, we can connect these to an earlier interpretation of \(\widehat{\beta}_1\) by writing

\[\begin{eqnarray*}\widehat{y}_i &=& \overline{y}+ \widehat{\beta}_1\left(x_i-\overline{x}\right) \\ &=& \overline{y} + \widehat{\rho}_{xy}\times \widehat{\sigma}_y \times \left(\frac{x_i-\overline{x}}{\widehat{\sigma}_x}\right) \\ \Rightarrow \frac{\widehat{y}_i-\overline{y}}{\widehat{\sigma}_y}&=& \widehat{\rho}_{xy} \times \left(\frac{x_i-\overline{x}}{\widehat{\sigma}_x}\right) \\ \Rightarrow \frac{\widehat{y}_i-\overline{\widehat{y}}}{\widehat{\sigma}_y}&=& \widehat{\rho}_{xy} \times \left(\frac{x_i-\overline{x}}{\widehat{\sigma}_x}\right)\end{eqnarray*}.\] Therefore, we can interpret regression results in terms of \(z\)-scores or standardized variables.

1.6.2.5 Yet another way to make sense of \(\widehat{\beta}_1\)

Consider once again the OLS regression line \(y=\widehat{\beta}_0+\widehat{\beta}_1 x\). Recall the expression for \(\widehat{\beta}_1\) in (2.19).

That expression can be interpreted in another way.

  1. Run a regression of \(y\) on a constant only. Obtain the residuals.
  2. Run a regression of \(x\) on a constant only. Obtain the residuals.
  3. Run a regression of the residuals from the first step on the residuals from the second step, but without an intercept. The resulting slope is \(\widehat{\beta}_1\).

This finding will be revisited in a later section, but it sets the stage for the Frisch-Waugh-Lovell Theorem or FWL. At the moment, this might feel useless, but that is a surface-level feeling only.

TipExercise
  1. Implement the steps described earlier for Example 2.4. What do you observe?
  2. Show that the steps applies more generally by writing down an argument as to why \(\widehat{\beta}_1\) could be interpreted as described earlier.

1.6.3 Goodness-of-fit

The identity \(\mathsf{SST}=\mathsf{SSE}+\mathsf{SSR}\) provides a way to motivate a measure of goodness-of-fit. It has parallels to \(y_i=\widehat{y}_i+\widehat{u}_i\). Both of these are decompositions in the sense that we are summing up uncorrelated parts. Intuitively, there is no “overlap” in the constituent parts.

Therefore, it makes sense to look at the proportion of \(\mathsf{SST}\) attributed to \(\mathsf{SSE}\) and \(\mathsf{SSR}\). We now have the R-squared of the regression or coefficient of determination defined in (2.38) as \[R^2=\frac{\mathsf{SSE}}{\mathsf{SST}}=1-\frac{\mathsf{SSR}}{\mathsf{SST}}.\]

What meaning does R-squared convey? \(\mathsf{SSE}\) is \((n-1)\) multiplied by the sample of variance of the fitted values, while \(\mathsf{SST}\) is \((n-1)\) multiplied by the sample of variance of the actual \(y\) values. In addition, \[\mathsf{SSE}=\sum_{i=1}^n\left(\widehat{y}_i-\overline{\widehat{y}}\right)^2=\widehat{\beta}_1^2\sum_{i=1}^n\left(x_i-\overline{x}\right)^2.\] Therefore, R-squared is the proportion of the sample variation in \(y\) that is attributable to \(x\).

We can obtain this measure directly from the output of lm().

TipExercises/more practice
  1. Reproduce the findings in Example 2.3. Check the interpretations of the regression coefficients. Ignore the part which talks about the population regression function.
  2. Reproduce the findings in Example 2.5. Check the interpretations of the regression coefficients.
  3. Work on Examples 2.8 and 2.9.
  4. Work on Chapter 2 Problems 3, 4 (except (ii)), and 5.
  5. Work on Chapter 2 Computer Exercises C1, C3, C4, and C10.

1.6.4 Units of measurement

You have encountered a natural logarithmic transformation and a squared transformation earlier. These are nonlinear transformations. We will explore these transformations in later.

In contrast, we will be looking at the effect of applying linear transformations to the variables. Specifically, we will explore how changes in the units of measurement affect regression outputs.

Note

Read Section 2-4a entirely.

TipExercises
  1. Reproduce the findings in (2.40) and (2.41). Compare with the findings in Example 2.3.
  2. What happens to the R-squared relative to Example 2.3?
  3. What will be the regression coefficients if we run a regression of salardol on roedec? Do this without actually running the regression. Verify using R afterwards.
  4. Work on Chapter 2 Problem 9 (i) and (ii) and revisit your previous findings in this exercise.

You might wonder why this discussion on units of measurement is useful. One major reason is that we need to know the consequences of changing units of measurements. We don’t want changes in the units of measurements to fundamentally change our interpretation of results. Another major reason is to improve the presentation and communication of results. A discussion of this aspect can be found in Section 2.2.1.1 of Pua (2024).

1.7 Extensions to multiple linear regression

1.7.1 What does the extension look like?

Note

Read Sections 3-2a, 3-2c, 3-2e, 3-2h entirely.

So far, we have focused on simple linear regression, but have dipped our toes into multiple linear regression earlier. Let \(x_1, \ldots, x_k\) represent the regressors. In simple linear regression, we only have \(x=x_1\).

  1. You already saw that the lm() syntax does not change so much.
  2. Instead of an OLS regression line \(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x\), we now have a regression plane like \(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x_1+\widehat{\beta}_2 x_2\) in (3.9) or more generally a regression surface \(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x_1+\widehat{\beta}_2 x_2+\cdots+\widehat{\beta}_k x_k\) in (3.11).
  3. The minimization problem in Appendix 2A is now (3.12).
  4. Instead of a system of two equations in two unknowns (2.14) and (2.15), we now have a system of \(k+1\) equations in \(k+1\) unknowns8 as in (3.13).
  5. The fitted values and residuals would have to be adjusted in a natural way. Compare (3.20) and (3.21) with (2.20) and (2.21).
  6. The \(\mathsf{SST}\), \(\mathsf{SSE}\), and \(\mathsf{SSR}\) are still defined in the same way. Compare (3.24) to (3.26) with (2.33) to (2.35). The definition of R-squared is still the same.
TipExercise

How will the algebraic or numerical properties of the objects from least squares discussed in Section 1.6.1 change in the case of multiple linear regression?

Finally, recall that in Section 1.5.4 we were able to write (2.14) and (2.15) in matrix form. You could do the same for the system of equations in (3.13).

TipExercise
  1. What would be \(\mathbf{A}\), \(\widehat{\boldsymbol\beta}\), and \(\mathbf{b}\) in the case of multiple regression?
  2. How would you extend \(\mathbf{X}\) and \(\mathbf{y}\) to the multiple regression case?
  3. What would be the condition for the existence of a unique solution?

1.7.2 Interpreting regression coefficients

1.7.2.1 Generic interpretation

Note

Read Sections 3-2b, 3-2c, and 3-2d entirely. You might encounter a regressand that has been logarithmically transformed. We have not worked on how to interpret this situation yet.

The OLS regression surface \(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1 x_1+\cdots+\widehat{\beta}_k x_k\) already provides us a way to interpret the regression coefficients. Again, think of \(\widehat{y}\) as a function of \(x_1,\ldots,x_k\). To make this explicit, let \(\widehat{y}\left(x_1,\ldots,x_k\right)=\widehat{\beta}_0+\widehat{\beta}_1x_1+\cdots+\widehat{\beta}_k x_k\).

When \(x_1=x_2=\cdots=0\), we have \(\widehat{y}(0,0,\ldots,0)=\widehat{\beta}_0\). Therefore, \(\widehat{\beta}_0\) is the fitted value of \(y\) when \(x_1=x_2=\cdots=0\). Let \(j\) be some index from \(1,\ldots, k\). Let \(x_{j,\mathsf{old}}\) be the value of \(j\)th regressor at an old value and \(x_{j,\mathsf{new}}\) be the value of \(j\)th regressor at a new value.

When \(x_j=x_{j,\mathsf{old}}\) for all \(j\), we have \[\widehat{y}(x_{1,\mathsf{old}}, \ldots, x_{k,\mathsf{old}})=\widehat{\beta}_0+\widehat{\beta}_1 x_{1,\mathsf{old}}+\cdots+\widehat{\beta}_k x_{k,\mathsf{old}}.\] Similarly, when \(x_j=x_{j,\mathsf{new}}\) for all \(j\), we have \[\widehat{y}(x_{1,\mathsf{new}}, \ldots, x_{k,\mathsf{new}})=\widehat{\beta}_0+\widehat{\beta}_1 x_{1,\mathsf{new}}+\cdots+\widehat{\beta}_k x_{k,\mathsf{new}}.\] Therefore, \[\begin{eqnarray*}&&\widehat{y}(x_{1,\mathsf{new}}, \ldots, x_{j,\mathsf{new}} ,\ldots, x_{k,\mathsf{new}})-\widehat{y}(x_{1,\mathsf{old}}, \ldots, x_{j,\mathsf{old}}, \ldots, x_{k,\mathsf{old}}) \\ &=& \widehat{\beta}_1 \underbrace{(x_{1,\mathsf{new}}-x_{1,\mathsf{old}})}_{\Delta x_1}+\cdots+\widehat{\beta}_j\underbrace{(x_{j,\mathsf{new}}-x_{j,\mathsf{old}})}_{\Delta x_j}+\cdots+ \widehat{\beta}_k\underbrace{(x_{k,\mathsf{new}}-x_{k,\mathsf{old}})}_{\Delta x_k}.\end{eqnarray*}\]

As a result, the difference in the fitted value can be traced to how each regressor changes, i.e. \[\Delta \widehat{y}=\widehat{\beta}_1\Delta x_1+\cdots+\widehat{\beta}_j\Delta x_j+\cdots+\widehat{\beta}_k\Delta x_k.\] Compare this with (3.17). If you want to interpret \(\widehat{\beta}_1\) alone, then we have to choose old and new values so that \(\Delta x_2=\cdots=\Delta x_k=0\). As a result, we have \(\Delta\widehat{y}=\widehat{\beta}_1\Delta x_1\).

The textbook refers to \(\Delta x_2=\cdots=\Delta x_k=0\) as controlling for the variables \(x_2,\ldots,x_k\) or holding regressors other than \(x_1\) constant. The textbook also refers to regression coefficients having partial effect or ceteris paribus interpretations. But this language is not very accurate and can be misleading, especially for people not “in the know”. Refer to the fine print in Section 3-2c.

The bottom line is that each regression coefficient is really a comparison of the fitted values for different “groups” with a one-unit difference in only one of their \(x\)’s. When writing or communicating an interpretation, it is important to phrase this correctly so that it will not be misleading.

TipExercises
  1. Work on Examples 3.1 and 3.4. Reproduce the results. Check the interpretations and improve on them if necessary.
  2. Work on Going Further 3.2 on page 74.
  3. Work on Example 3.5. For now, ignore the part about the linear model.
  4. Work on Chapter 3 Problems 1, 2, and 3.
  5. Work on Chapter 3 Computer Exercises C5, C6, and C11 (except (iii), (iv), (v)).
  6. Consider a regression surface \(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1 x+\widehat{\beta}_2 x^2\). Suppose we are comparing the fitted values at \(x_{\mathsf{new}}\) and \(x_{\mathsf{old}}\). Can you interpret \(\widehat{\beta}_1\) and \(\widehat{\beta}_2\) separately? Discuss.

1.7.2.2 Frisch-Waugh-Lovell

Note

Read Sections 3-2f and 3-2g entirely, along with Appendix 3A.2.

Every regression coefficient in an OLS regression surface can be interpreted as a regression coefficient of a simple linear regression. We cannot graph or imagine high-dimensional surfaces, but this insight is important for us to visualize or interpret a multiple linear regression as simple linear regressions.

You have an inkling of a related idea in Section 1.6.2.5. The partial effect interpretation for \(\widehat{\beta}_1\) (for example) involves removing the “effect” of regressors other than \(x_1\) from \(x_1\). The the partialling-out involves subtracting fitted values. So it is really difficult to call it “controlling for other regressors”.

TipExercises

Focus on Appendix 3A.2.

  1. If \(\widehat{x}_{i1}\) is the \(i\)th fitted value from a regression of \(x_1\) on \(x_2, x_3, \ldots, x_k\) and \(\widehat{r}_{i1}\) is the corresponding residuals, why can we write \(x_{i1}=\widehat{x}_{i1}+\widehat{r}_{i1}\)?
  2. Derive the expressions found in (3.76), (3.77), (3.78). Make sure to justify why every step is correct.

In addition, there is a relationship (3.23) between the regression coefficients of a multiple regression of \(y\) on \(x_1\) and \(x_2\) and the regression coefficients of a simple regression of \(y\) on \(x_1\). The multiple regression with a longer list of regressors is called a long regression and the simple regression with a shorter list of regressors is called a short regression. Of course, there has to be at least a common regressor in both these regressions.

This means that there is no reason to prefer one regression over another at this stage. If we choose one regression over another, then that is additional information beyond what is provided by least squares. Section 1.4 of Pua (2024) has a discussion in a slightly easier setting where we are contrasting a regression of \(y\) on a constant (short regression) and a regression of \(y\) on a constant and a dummy variable (long regression).

TipExercise
  1. Read Appendix 3A.4 starting from page 115 “To show equation (3.79), …” until you reach “This is the relationship we wanted to show.” Make sure to work out the details of the argument and how FWL was used.
  2. Work on Example 3.3. Reproduce the findings and verify the discussion in that example.

1.7.3 Interactions

The product of two or more regressors is sometimes called an interaction term or an interaction. For this chapter, we will only cover interactions between two dummy variables only. The interaction term in this case is also a dummy variable. There are two points to be made:

  1. Why would interaction terms be useful?
  2. This is a useful case where the ceteris paribus interpretation of a regression coefficient should not be pushed too far.

Recall our wage1sub dataset. There were two variables which are binary in nature: female and married. What happens if you run a regression of wage on these two dummy variables and an interaction term?

Observe that a regression of wage on female, married, and their interaction reproduces the averages for different subgroups produced by the two binary variables. There are four subgroups, namely, single males, single females, married males, married females.

TipExercise
  1. Similar to the discussion on binary variables in Section 1.6.2.3, the intercept represents a sample average of a subgroup. Which is it?
  2. How about the slopes? They should represent differences in the sample averages of difference subgroups. Can you figure these out?

1.8 What’s next?

We have seen how linear regression works and how to interpret the results of a linear regression from a descriptive point of view. You also got exposed to how to use R to calculate regressions on cleaned datasets.

When you are reading the textbook, you might have occasionally seen the word “regression model”. We will be making sense of what this means in the next chapter. Afterwards, we have to connect what we did in this chapter to regression models.

References

Athey, Susan, Raj Chetty, and Guido Imbens. 2025. “Using Experiments to Correct for Selection in Observational Studies.” https://arxiv.org/abs/2006.09676.
Castillo-Freeman, Alida, and Richard B. Freeman. 1992. “When the Minimum Wage Really Bites: The Effect of the u.s.-Level Minimum on Puerto Rico.” In Immigration and the Work Force: Economic Consequences for the United States and Source Areas, 177–212. University of Chicago Press. http://www.nber.org/chapters/c6909.
Mann, Charlotte Z., Adam C. Sales, and Johann A. Gagnon-Bartsch. 2025. Journal of Causal Inference 13 (1): 20220081. https://doi.org/doi:10.1515/jci-2022-0081.
Pua, Andrew Adrian Yu. 2024. Econometrics. https://ecometr.neocities.org.
Rosenman, Evan T. R. 2025. “Methods for Combining Observational and Experimental Causal Estimates: A Review.” WIREs Computational Statistics 17 (2): e70027. https://doi.org/https://doi.org/10.1002/wics.70027.
Rosenman, Evan T. R., Guillaume Basse, Art B. Owen, and Mike Baiocchi. 2023. “Combining Observational and Experimental Datasets Using Shrinkage Estimators.” Biometrics 79 (4): 2961–73. https://doi.org/10.1111/biom.13827.
Shea, Justin M. 2024. Wooldridge: 115 Data Sets from "Introductory Econometrics: A Modern Approach, 7e" by Jeffrey M. Wooldridge. https://doi.org/10.32614/CRAN.package.wooldridge.
Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. 7th edition. Cengage Learning.

  1. The fine print: within the confines of the trimester, of course.↩︎

  2. You can install any R package in the same manner. Simply replace wooldridge with the name of R package you want to install.↩︎

  3. You can load other installed packages in a similar manner.↩︎

  4. These interactive code cells require an active connection. Without it, you may have to run the commands in R or RStudio.↩︎

  5. These terms are more neutral than dependent variable and independent variable, respectively. Other terms for the regressand include explained variable, response variable, or predicted variable. Other terms for the regressor include explanatory variable, control variable, predictor variable, or covariate.↩︎

  6. In economics and in R, log() is the function to compute natural logarithms. In mathematics, we use \(\ln(\cdot)\). This may be confusing for first-timers.↩︎

  7. Lagged, not logged.↩︎

  8. Ask yourself how many regression coefficients there are↩︎