2 Econometric models

In this chapter, you are going to read most of the remaining portions of Chapters 2, 3, and even more from Wooldridge (2020). What sets Chapter 2 apart from Chapter 1 is that we are going to fill in the details of the linear regression model.

Observe that we are using the technical phrase “linear regression model” rather than just linear regression. This is an important distinction that tends to get lost in textbooks. We will be focusing on what makes up the linear regression model and how linear regression, as discussed in Chapter 1, connects with that model. In addition, you will be exposed to a menagerie of econometric models beyond the linear regression model.

2.1 Motivation

What do the regression coefficients represent? We have interpreted these coefficients, especially the slopes, in terms of comparisons of fitted values. Nothing in Chapter 1 suggests that we have actually calculated any effect of changes of \(x\) on \(y\). To do this, we actually need a compelling argument, mathematically and economically, in order to really translate these regression coefficients beyond what you have seen in Chapter 1.

We have seen in Chapter 1 the descriptive and mechanical aspects of linear regression. Descriptive here means that we are interested in summarizing aspects of the data as they are. We will now be moving on to predictive and causal aspects of linear regression, and beyond.

Note

Read Section 1-2 entirely.

In your reading, you will encounter the meaning of empirical analysis. The biggest question is how linear regression as seen in Chapter 1 can be used in empirical analysis. Depending on the purpose of an empirical analysis, an economic model may or may not be required.

Examples 1.1 and 1.2 show that we usually do not completely know the underlying functional forms in an economic model. For example, you might have linear demand and linear supply equations, but you do not know exactly the coefficients. Another example is when you solve the utility maximization problem subject to budget constraints. The theory only tells us that, under certain restrictions on economic behavior and the economic environment, the demand for goods should only be decreasing in own prices. But there are many functions which could fit this theoretical prediction.

In order to connect the data to the economic model, you have to provide a translation of an economic model to an econometric model. The translation requires us to think about:

the mapping of observed and unobserved variables to the economic model
the functional form
the details of how the variables were obtained and measured

The attention to these three steps sets econometrics from statistics, economics, and mathematics. In fact, you will eventually come to the conclusion that statistics, economics, and mathematics are needed to really use and apply econometrics.

You see the translation in action when (1.1) somehow becomes (1.3). You will notice the sudden appearance of the error term \(u\). The primary motivation is that it represents unobserved factors. Notice that the textbook states “In fact, dealing with this error term or disturbance term is perhaps the most important component of an econometric analysis.”

But what is not very clear, even for me when I was taking econometrics for the first time as a student, is how (1.3) was obtained in the first place of where (1.3) originated. Where does it come from? Can we just add an error term \(u\) and be done with it?

Note

Read Section 2-1 entirely. It is very likely you might not understand everything here, but it is useful to get a preview of what the simple linear regression model looks like. It is expected that you might not understand every detail yet. Pay attention to what you could and could not understand.

Warning

Observe that the error term \(u\) has a different notation from the residual \(\widehat{u}\) in Chapter 1. Similarly notice the distinction between \(\beta_0, \beta_1\) and \(\widehat{\beta}_0, \widehat{\beta}_1\). Pay attention to these seemingly minor details.

2.2 Relationships

You will be presented two major examples where you will be making sense of the word “relationships” using actual data and artificial (or simulated) data.

How does labor force participation change with educational attainment? When answering this question, we are looking at how labor force participation and educational attainment are “related” to each other.
How would you decide whether a sequence of coin flips is real or fake? When answering this question, we are looking at patterns measured from sequences and how these patterns “relate” to each other.

In the process of going through these two examples, you will be learning how to create and analyze a custom dataset which will be “cleaned” afterwards and how to construct a Monte Carlo simulation. But don’t lose sight of the main goals of this section.

2.2.1 Labor force participation

This portion is inspired by Chapter 4 of Hendry and Nielsen (2007).

2.2.1.1 Constructing a custom dataset by shopping

You will be collecting data from IPUMS USA, specifically data from the American Community Survey (ACS). The linked website has a lot of supporting documentation and you are encouraged to look into the details.

Note

If you already have an account, you can skip this note. If not, continue reading.

Create an account at IPUMS. Click on the link to IPUMS earlier to learn more about this data repository. To create an account, click on Register at the top part of the page. After that, click on “Apply for Access”. Fill in the Required details only. For Occupation Category, select Undergraduate Student. For Specific Occupation Title, put in Undergraduate Student.

You will need access to IPUMS later and you will be working on some data analysis tasks with IPUMS soon. Don’t delay the creation of the account as it may take time for the access to be approved.

Once approved, you can now “shop” for your data!

Exercises

Go to website of IPUMS USA. Create your custom data set.
Click on SELECT SAMPLES and make sure that the only box that is checked is ACS 2024. Click on SUBMIT SAMPLE SELECTIONS.
Now, you will be selecting variables. You have the option to choose from variables measured at the household level or at the person level. You can also search directly. Given the question posed, how would you conduct the search?
The bottom line is that we will use variables with the names EMPSTAT and EDUC (why?). Select them by checking the boxes beside these variable names so that they will be “added to cart”.
Once finished, click on VIEW CART. You will notice that there are more variables included than those you have selected. Explore these variables.
Afterwards, click on CREATE DATA EXTRACT.

Congratulations, you have successfully created a data extract from IPUMS! But this is a very large dataset. We have the option to reduce it in size for our purposes.

Under OPTIONS, click on SELECT CASES to reduce the size of the dataset. Check the boxes for EDUC, EMPSTAT, and GQ. Select ” Include only those persons meeting case selection criteria”. For each variable listed, you can choose observations which obey pre-specified criteria. So that we are working on the same dataset, choose only those observations which fulfill the following criteria:

Households under the 1970 definition
Employment status is employed, not employed, or not in the labor force
No missing information about educational attainment

Afterwards, click on SUBMIT. Once you are back to your cart, put a description of your extract and then click on SUBMIT EXTRACT. Some time would have to pass before you actually get the links:

Download the dataset. Expect a file with file extension .dat.gz. This is a compressed file similar to zipped files. You have to decompress this file and it will produce a .dat file.
Open the command files. You have many options and you should choose the R command file. Feel free to explore others at your own leisure. These commands should be at the beginning of your R script to analyze the IPUMS data extract.
View the codebook. Open both the Basic and DDI links to see what they contain. These are the different formats of presenting documentation related to the data extract.
Download the codebook. You must right click DDI and save link as (or save target as) an XML document. If your browser does not have the option to save an XML document, adjust the type of document to All Files. Make sure the file extension is .xml.

To look for data you have downloaded before, click on MY DATA. Requested data files will stay in the system for a maximum of 72 hours.

Exercise

It is your turn to shop for a dataset. Modify your data requests so that you can answer the question ““How does female labor force participation change with educational attainment?”

Keep the resulting dataset for future exercises.

2.2.1.2 Analysis

Now that you have the dataset, let us proceed with an analysis which could answer the question “How does labor force participation change with educational attainment?”

The IPUMS dataset is going to be loaded differently from what you saw in Chapter 1 and we follow the instructions at the IPUMS website.

Important

Here you have to copy and paste code and execute the commands in R/RStudio. Don’t use the browser and make sure that you are not working on other things as the dataset has millions of observations!

Do not upload the dataset into the cloud because the dataset is not yours and can have some private information.

# Based on R command file from IPUMS
library(ipumsr)
ddi <- read_ipums_ddi("usa_00017.xml") # You must change the name of the xml file to your situation 
data <- read_ipums_micro(ddi)

We only have two variables and both of them are categorical in nature. Along with the codebook, explore what you think the commands below are producing.

table(data$EDUC)
table(data$EMPSTAT)

Observe that you have the frequency distributions of each variable separately. In this sense, these tables are not yet in the direction of finding “relationships”.

Observe also that there are quite a lot of categories for each variable. For the sake of reducing the size of the problem a bit and to illustrate how to redefine variable and process data for further use, we will reduce the number of categories.

We reduce the number of categories in EMPSTAT to two: in the labor force or not in the labor force. We also reduce the number of categories in EDUC to three: elementary or no schooling, high school, or college. Pay attention as to how these variables are recoded.

data$EDUCN <- with(data, 1*(EDUC <= 2) + 
  2*(EDUC == 3 | EDUC == 4 | EDUC == 5 | EDUC == 6) + 
  3*(EDUC >= 7 & EDUC <= 11))
data$EMPSTATN <- with(data, 1*(EMPSTAT <= 2) +
  0*(EMPSTAT == 3))

In order to explore the “relationship” between labor force participation and educational attainment, we can summarize the information by looking at a joint frequency distribution. In the context of categorical variables, we can construct a cross-tabulation or a crosstab.

xtab.EMP.ED <- with(data, table(EMPSTATN, EDUCN))
xtab.EMP.ED

We can add margins to the crosstab.

with(data, addmargins(xtab.EMP.ED))

Exercise

Why are they called margins and what information do they have?

We can also express everything in terms of relative frequencies:

ptab.EMP.ED <- with(data, prop.table(xtab.EMP.ED))
ptab.EMP.ED

Let \(X\) be EDUCN and \(Y\) be EMPSTATN. \(Y\) can take values 0 or 1, while \(X\) can take values 1, 2, or 3. The distinct values that \(X\) and \(Y\) can take can be written as \(x_1,\ldots, x_K\) and \(y_1,\ldots, y_J\).

Important

In Chapter 1, we used lowercase letters \(x\) and \(y\) to represent variables. In this chapter, we are moving on from this notation. We are using uppercase letters for the variables and the lowercase letters for the values these variables could take. Therefore, \(x_1\) no longer means the first observation of the variable \(x\). It will now represent one (not necessarily the first) possible value the variable \(X\) could take.

Wooldridge (2020) does not make this distinction too strongly, because you are supposed to be aware of this from the context.

Another important point of note moving forward: both \(X\) and \(Y\) here are categorical in nature. What happens when either one of these variables is continuous?

Exercise

What are the values of \(J\) and \(K\) in our example?

Each cell of the crosstab is \(\widehat{f}\left(x,y\right)\), defined as the relative frequency of people who reported \(X=x\) and \(Y=y\). Note that \(x\) and \(y\) are dummy arguments. The joint relative frequencies satisfy: \[\sum_{j=1}^J\sum_{k=1}^K \widehat{f}\left(x_k, y_j\right)=\sum_{k=1}^K\sum_{j=1}^J \widehat{f}\left(x_k, y_j\right)=1.\]

Exercise

Check that this is the case for our example.

We can also compute the marginal relative frequency distributions for both variables. Try to figure out the information in these frequency distributions.

with(data, prop.table(xtab.EMP.ED, margin = 1))
with(data, prop.table(xtab.EMP.ED, margin = 2))

These marginal relative frequency distributions are defined as: \[\begin{eqnarray*} \widehat{f}_X\left(x\right) &=& \sum_{j=1}^J\widehat{f}\left(x, y_j\right) \\ \widehat{f}_Y\left(y\right) &=& \sum_{k=1}^K\widehat{f}\left(x_k, y\right) \end{eqnarray*}\]

Exercise

Check that this is the case for our example.

Our question “How does labor force participation change with educational attainment?” would be hard to answer with the joint and marginal relative frequency distributions. Why? Because the question somehow requires us to consider what would be the labor force participation of workers who have a certain level of educational attainment. For example, we can calculate the labor force participation of workers who have a high school education. But we can do this for all levels of EDUCN. The result is an example of a conditional relative frequency distribution.

ctab.EMP.ED <- with(data, prop.table(xtab.EMP.ED, margin = 2))
ctab.EMP.ED

These conditional relative frequencies are defined as: \[\widehat{f}_{Y|X=x} \left(y|x\right) = \frac{\widehat{f}\left(x,y\right)}{\widehat{f}_X\left(x\right)}.\] What do you notice about how these conditional relative frequencies varying with \(x\)? We can even plot a graph:

plot(c(1,2,3),ctab.EMP.ED[2,])

Exercise

What do you notice about the pattern produced by the graph?
Why did we not execute plot(c(1,2,3),ctab.EMP.ED[2,])?

Note

We could also have defined conditional relative frequencies of educational attainment given labor force participation, specifically: \[\widehat{f}_{X|Y=y} \left(x|y\right) = \frac{\widehat{f}\left(x,y\right)}{\widehat{f}_Y\left(y\right)}.\]

But given our question for this example and the definitions of \(X\) and \(Y\), it makes more sense to consider \(\widehat{f}_{Y|X=x} \left(y|x\right)\) than \(\widehat{f}_{X|Y=y} \left(x|y\right)\).

These conditional relative frequencies have the following properties, beyond the definition:

The sum of the conditional relative frequencies for a fixed category is equal to 1, i.e. \[\sum_{j=1}^J \widehat{f}_{Y|X=x} \left(y_j|x\right) = 1.\]
With weights equal to the marginal relative frequencies of the conditioning variable, the weighted average of the conditional relative frequencies is equal to the marginal relative frequency, i.e. \[\sum_{k=1}^K \widehat{f}_{Y|X=x_k} \left(y|x_k\right) \widehat{f}_X\left(x_k\right)=\widehat{f}_Y\left(y\right).\]

Exercise

Check that these properties hold for our example.

We can calculate summaries of these relative frequency distributions:

Sample mean: \[\widehat{\mathbb{E}}\left(Y\right)=\sum_{j=1}^J y_j\widehat{f}_Y\left(y_j\right)\]
Sample variance: \[\widehat{\mathsf{Var}}\left(Y\right)=\sum_{j=1}^J \left(y_j-\widehat{\mathbb{E}}\left(Y\right)\right)^2\widehat{f}_Y\left(y_j\right)\]
Sample covariance: \[\widehat{\mathsf{Cov}}\left(X,Y\right)=\sum_{j=1}^J\sum_{k=1}^K \left(x_k-\widehat{\mathbb{E}}\left(X\right)\right) \left(y_j-\widehat{\mathbb{E}}\left(Y\right)\right)\widehat{f}\left(x_k,y_j\right) \]
Sample conditional mean: \[\widehat{\mathbb{E}}\left(Y|X=x_k\right)=\sum_{j=1}^J y_j\widehat{f}_Y\left(y_j|x_k\right)\]
Sample conditional variance: \[\widehat{\mathsf{Var}}\left(Y|X=x_k\right)=\sum_{j=1}^J \left(y_j-\widehat{\mathbb{E}}\left(Y|X=x_k\right)\right)^2\widehat{f}_{Y|X=x_k}\left(y_j|x_k\right)\]

Exercise

The summaries you have seen, like the sample mean, does not exactly look like what you are used to. Show that \(\widehat{\mathbb{E}}\left(Y\right)\) is the same as \(\overline{Y}\) you might have encountered in statistics before.

In terms of R implementation, we can compute the conditional means and variances of participation gives educational attainment as follows:

with(data, tapply(EMPSTATN, EDUCN, mean))
with(data, tapply(EMPSTATN, EDUCN, var))

In addition, there are connections between the conditional and unconditional (or marginal) summaries. First, observe that \[\begin{eqnarray*}\widehat{\mathbb{E}}\left(Y\right) &=& \sum_{j=1}^J y_j \widehat{f}\left(y_j\right)\\ &=& \sum_{j=1}^J\sum_{k=1}^K y_j\widehat{f}\left(x_k,y_j\right) \\ &=& \sum_{j=1}^J\sum_{k=1}^K y_j\widehat{f}_{Y|X=x_k}\left(y_j|x_k\right)\widehat{f}_X\left(x_k\right) \\ &=& \sum_{k=1}^K\sum_{j=1}^J y_j\widehat{f}_{Y|X=x_k}\left(y_j|x_k\right)\widehat{f}_X\left(x_k\right) \\ &=& \sum_{k=1}^K \mathbb{E}\left(Y|X=x_k\right)\widehat{f}_X\left(x_k\right) \\ &=& \widehat{\mathbb{E}}\left[\widehat{\mathbb{E}}\left(Y|X\right)\right] \end{eqnarray*}\]

As a result, the sample (unconditional or marginal) mean is the sample average of the sample conditional means. Second, it can be shown that \[\widehat{\mathsf{Var}}\left(Y\right) = \widehat{\mathbb{E}}\left(\widehat{\mathsf{Var}}\left(Y|X\right)\right)+ \widehat{\mathsf{Var}}\left(\widehat{\mathbb{E}}\left(Y|X\right)\right).\] This equality links the sample variance to the sum of the sample average of the sample conditional variances and the sample variance of the sample conditional means.

Exercise

There are other ways to present information contained in these conditional relative frequencies. We can consider the odds for labor force participation for different values of EDUC, e.g. for people who attained a high school level, the odds for participation is \[\frac{\widehat{f}_{1|X=2}\left(1|2\right)}{\widehat{f}_{0|X=2}\left(0|2\right)}.\] There are other odds you can report.

Can you communicate what information the odds contain?
We can also plot the odds as a function of the level of EDUC. Similarly, we can also plot the logarithm of the odds as a function of the level of EDUC. What information do these plots contain?
Study the lines of R code below and relate them to this exercise:

# Odds for participation
ctab.EMP.ED[2,]/ctab.EMP.ED[1,]
# Plotting the odds
plot(c(1,2,3),ctab.EMP.ED[2,]/ctab.EMP.ED[1,])
# Plotting the log odds
plot(c(1,2,3),log(ctab.EMP.ED[2,]/ctab.EMP.ED[1,]))

2.2.1.3 Population-level concepts

So far, what we have illustrated are some calculations based on the data that we have. Another way to say this is that we used a sample to calculate a variety of relative frequency distributions and summaries related to these distributions.

From your previous statistics course, you must have discussed that relative frequencies tend to stabilize to probabilities when the sample size is large and under some conditions. As a result, it makes sense to think about the corresponding population counterpart of these relative frequencies.

Another way to motivate the ensuing discussion is that we can imagine that participating in the labor force as a coin toss. What we get to observe in the data are like outcomes of a coin toss. As a result, the language of random variables, specifically a binary or Bernoulli random variable, becomes very useful. A coin toss could either be heads or tails. We use the language of probability to represent these uncertain outcomes.

What is different this time is that the outcomes of a coin toss will vary depending on what value \(X\) will take. In effect, you allow for a coin with different probabilities of obtaining heads or tails depending on the level of educational attainment. Admittedly, what is described here is fictional, but then again we are on the path towards constructing a model.

Note

Focus on Appendix B. Wooldridge (2020) uses the phrase “density” even in the discrete case. To avoid confusion, cross that off and replace it with “mass”, especially if working with the discrete case. Leave “density” alone for the continuous case.

Your task is to see the parallels and to make sure you can distinguish the sample and population concepts. For the moment, skip the parts where independence is being discussed.

Read the first part of Section B-1, the move on to Section B-1a.
Read Section B-2a up to the end of page 688. Specifically pay attention to joint distribution, joint probability mass function, and marginal probability mass function.
Read Section B-2b in page 690. Specifically pay attention to conditional distribution and conditional probability mass function.
Read Section B-3a. Skip the parts which involve continuous distributions.
Read Section B-3b up to Example B.5.
Read Section B-3d up to B-3g.
Read Section B-4a and B-4b in pages 697-698 up to “… as we will see shortly”. Specifically pay attention to the population covariance. Continue with Section B-4c (B.29) for the definition of the population correlation coefficient.
Read Section B-4e in pages 700-702. Specifically pay attention to the conditional expectation and how it can be thought of as a function.
1. When working with the conditional expectation, there are many senses of the idea.
2. One is that \(\mathbb{E}\left(Y|X=x\right)\) is a number for a given \(x\).
3. Another one is to think of \(x\) as a dummy argument. Provided that \(x\) is a feasible and possible value that \(X\) takes, then \(\mathbb{E}\left(Y|X=x\right)\) is a function of \(x\).
4. Since \(x\) is just one of the values that \(X\), we are uncertain as to what value \(\mathbb{E}\left(Y|X=x\right)\) will take. We can write this situation as \(\mathbb{E}\left(Y|X\right)\) being a random variable with its down distribution and will have its own features and summaries.
Read Section B-4g in page 704. Specifically pay attention to conditional variance.

2.2.1.4 A first step towards an econometric model

Based on the analysis earlier of the dataset from IPUMS, we come to the conclusion that labor force participation and educational attainment are “related” in the following senses:

The conditional relative frequency distribution of EMPSTATN given EDUCN varies with the value taken by EDUCN.
The conditional relative frequency distribution of EMPSTATN given EDUC is not the same as the marginal relative frequency distribution of EMPSTATN.
The sample conditional mean of EMPSTATN given EDUCN varies with the value taken by EDUCN.
The sample conditional mean of EMPSTATN given EDUCN is not the same as the sample mean of EMPSTATN.
The sample conditional variance of EMPSTATN given EDUCN varies with the value taken by EDUCN.
The sample conditional variance of EMPSTATN given EDUCN is not the same as the sample variance of EMPSTATN.
EMPSTATN and EDUCN have nonzero sample covariance. As a consequence, they have a nonzero sample correlation coefficient.

These observations are a starting point for constructing a model for the relationship between labor force participation and educational attainment. A model can be thought of as a set of statements and assumptions about the joint distribution of random variables at the level of the population.

In contrast, Wooldridge (2020) in his glossary, explicitly defines an econometric model as “an equation relating the dependent variable to a set of explanatory variables and unobserved disturbances, where unknown population parameters determine the ceteris paribus effect of each explanatory variable”. To some extent, this is a good starting definition but what makes this definition unsettling is that it is not obvious where disturbances come from, why these are not observable, how parameters become relevant, and how these parameters represent ceteris paribus effects.

Because of our main question and our exploration of the data, we will be focusing on the conditional distribution of labor force participation given educational attainment. Below you will find different ways of setting up a model for this conditional distribution.

Important

Notice that we are formulating a model for the conditional distribution, or if you wish, the conditional probability mass function, rather than the conditional relative frequency distribution. The latter was used as inspiration for the conditional distribution.

If you have not read the relevant parts of Appendix B, stop and read those first.

One formulation is to treat \(f_{Y|X=x}\left(y|x\right)=\mathbb{P}\left(Y=y|X=x\right)\) as a general or unspecified function of \(x\). For our example, \(x\) could take on three feasible or possible values (or support points), since \(X\) could be 1, 2, or 3. As a result, we have \[\begin{eqnarray*}\mathbb{P}\left(Y=1|X=1\right) &=& p_{1|1} \\ \mathbb{P}\left(Y=1|X=2\right) &=& p_{1|2} \\ \mathbb{P}\left(Y=1|X=3\right) &=& p_{1|3} \end{eqnarray*} \] with \(p_{1|1}\), \(p_{1|2}\), and \(p_{1|3}\) left unknown. The entire function is the parameter.

Exercise

For our example so far, why didn’t we have to say anything about \(\mathbb{P}\left(Y=0|X=x \right)=p_{0,x}\) for \(x=1,2,3\)?
There are restrictions on the values of \(p_{1|1}\), \(p_{1|2}\), and \(p_{1|3}\). What are these?

Another formulation could be to work on the conditional expectation of labor force participation given educational attainment and treat this conditional expectation as a general or unspecified function of \(x\). Since \(Y\) is binary, we can conclude that for all support points \(x\), \[\begin{eqnarray*} \mathbb{E}\left(Y|X=x\right) &=& 0\cdot \mathbb{P}\left(Y=0|X=x\right)+1\cdot \mathbb{P}\left(Y=1|X=x\right) \\ &=&\mathbb{P}\left(Y=1|X=x\right).\end{eqnarray*}\] Therefore, in this special case of \(Y\) being a binary 0/1 variable, the conditional expectation has the same information as our first formulation.

Exercise

We could also have worked on the conditional variance. In the special case of \(Y\) being a 0/1 variable, prove that \[\mathsf{Var}\left(Y|X=x\right)=\mathbb{P}\left(Y=1|X=x \right)\left[1-\mathbb{P}\left(Y=1|X=x \right)\right].\]

We could also make additional assumptions about the form of \(\mathbb{P}\left(Y=1|X=x \right)\) instead of leaving its values unspecified and almost unrestricted. Suppose we write \[\mathbb{E}\left(Y|X=x\right)=\mathbb{P}\left(Y=1|X=x \right)=\beta_0+\beta_1 x. \tag{2.1}\] \(\beta_0\) and \(\beta_1\) are left unknown and these are the parameters. The first equality is always true given the structure of our example. The second equality is an assumption.

Exercise

You will explore what is implied by the assumption on the conditional expectation. Read Section B-4f, specifically Properties CE.1, CE.2, and CE.4.

Review (B.27) of Section B-4b. We will be working on \(\mathbb{E}\left(XY\right)\).
Apply Property CE.4. In the formula for that property, let the role of \(Y\) be played by \(XY\) and \(X\) be played by \(X\). Write down the immediate consequence of applying CE.4 to \(\mathbb{E}\left(XY\right)\).
Next, apply Property CE.2. Write down the immediate consequence. Afterwards, substitute the assumption about \(\mathbb{E}\left(Y|X=x\right)\) and simplify by applying Property E.3 in page 693.
Repeat the previous two items but this time for \(\mathbb{E}\left(Y\right)\).
Compute \(\mathsf{Cov}\left(X,Y\right)\). What do you notice?

Make a conclusion about the implications of imposing the assumption that \(\mathbb{E}\left(Y|X=x\right)=\beta_0+\beta_1 x\). What meanings do \(\beta_0\) and \(\beta_1\) have?

We could also make a different assumption about the form of \(\mathbb{P}\left(Y=1|X=x \right)\) instead of stating that it is equal to \(\beta_0+\beta_1x\). Suppose we write \[\mathbb{E}\left(Y|X=x\right)=\mathbb{P}\left(Y=1|X=x \right)=\frac{\exp\left(\beta_0+\beta_1 x\right)}{1+\exp\left(\beta_0+\beta_1 x\right)}. \tag{2.2}\] \(\beta_0\) and \(\beta_1\) are left unknown and are the parameters. The first equality is always true given the structure of our example. The second equality is again an assumption.

Exercise

You might wonder about the differences between the assumptions regarding \(\mathbb{E}\left(Y|X=x\right)=\mathbb{P}\left(Y=1|X=x \right)\) found in Equation 2.1 and Equation 2.2.

There are supposed to be restrictions on \(\mathbb{P}\left(Y=1|X=x\right)\) given its meaning. Which of the two models you have seen so far obeys these restrictions? Explain.
If Equation 2.1 holds, compute \[\mathbb{P}\left(Y=1|X=x+1 \right)-\mathbb{P}\left(Y=1|X=x \right).\] Would the value of \(x\) change the computed value?
If Equation 2.2 holds, compute both \[\mathbb{P}\left(Y=1|X=x+1 \right)-\mathbb{P}\left(Y=1|X=x \right)\] and \[\log\left[\frac{\mathbb{P}\left(Y=1|X=x \right)}{\mathbb{P}\left(Y=0|X=x \right)}\right].\] How is the latter connected to a previous exercise on odds?
Given the previous item, provide an interpretation of \(\beta_1\). Is it related to \(\mathsf{Cov}\left(X,Y\right)\)?

2.2.1.5 Final remarks

The model in Equation 2.1 is called a linear probability model . A discussion can also be found in Wooldridge (2020) Section 7-5. The model in Equation 2.2 is called a logit model or a logistic regression model. A discussion can also be found in Wooldridge (2020) Section 17-1. Both these models would be discussed in some more detail later in this chapter.

It probably should not be very surprising that if you return to our dataset and run a linear regression of labor force participation on educational attainment, then we have a procedure to estimate \(\beta_0\) and \(\beta_1\) in the linear probability model.

Exercise

Can you articulate an intuitive reason why you would expect what was described to happen?
Run the regression just described.
Take a stab at providing an interpretation to \(\widehat{\beta}_1\).

If you want to estimate \(\beta_0\) and \(\beta_1\) in Equation 2.2, you need a new command called glm(). This command is a very popular command for estimating coefficients of generalized linear models.

glm(EMPSTATN ~ EDUCN, family = binomial(link = "logit"), data = data)

We have to determine how to interpret the results and how they differ from least squares. We will have a chance to do this later in this chapter.

Finally, we have to determine if all the results can actually be used to answer the main question: “How does labor force participation change with educational attainment?”

2.2.2 Which sequence of coin flips is real?

2.2.2.1 Setup

Suppose you have two sequences of 100 coin flips. The code 1 represents heads while the code 0 represents tails.

Sequence A: 0011100011/0010000100/0010001000/1000000001/0011001010

1100001111/1100110001/0101100100/1000100000/0011111001

Sequence B: 0100010100/1100010100/1110100110/0011110100/0111010001

1000110111/1000100101/1011011100/0110010001/0010000100

One way to determine which of the sequence of coin flips is real is to determine what are the long-run properties of sequences of coin flips. We also need to make assumptions about what kind of coin flips we have in mind. Altogether, we are creating a model for the data we could observe.

To make things less complex but more concrete, consider flipping a fair coin 4 times. Define two variables:

\(Y\) is the length of the longest run. A run is a sequence of consecutive flips of the same type.
\(X\) is the total number of switches from heads to tails or tails to heads for a sequence of coin flips.

Observe that we are trying to measure characteristics of sequences of coin flips. We also make an assumption that the coin flips are fair and that the coin flips are independent of each other.¹ We now ask what are the possible sequence of coin flips we can observe and their corresponding \(X\) and \(Y\) measurements.

As you may have noticed, we are uncertain about what sequence of coin flips (hence, \(X\) and \(Y\)) we will get to observe. So we have to list down all possible sequences of coin flips we can observe and then figure out which pairs of \((X,Y)\) we can observe. After that, we have to determine how likely these pairs will happen. In effect, we are constructing the joint distribution of \((X,Y)\).

Exercise

List down all the possible sequence of coin flips when you flip a coin 4 times in a row. Was the assumption of a fair coin necessary for you to list them down?
What are the corresponding values of the pairs \((X,Y)\) for each possible sequence of coin flips?
What happens if you plan to do the previous items when there are 100 flips in a row?
Return to the 4 flip situation. You are to calculate \(\mathbb{P}\left(X=x, Y=y\right)\) for different values of the pair \((x,y)\).
Calculate \(\mathbb{E}\left(Y\right)\), \(\mathsf{Var}\left(Y\right)\), \(\mathsf{Cov}\left(X,Y\right)\), \(\mathbb{E}\left(Y|X\right)\), and \(\mathsf{Var}\left(Y|X\right)\).
Given your calculations, are you able to find \(\beta_0\) and \(\beta_1\) so that \(\mathbb{E}\left(Y|X=x\right)=\beta_0+\beta_1 x\) for all support points \(x\)?
Given your calculations, are you able to find \(\sigma^2\) so that \(\mathsf{Var}\left(Y|X=x\right)=\sigma^2\) for all support points \(x\)?

If you have worked on the previous exercise, then you have obtained the joint distribution of \((X,Y)\) at the population level. Imagine how you would do this for 100 coin flips. You would come to the conclusion that there has to be a better way.

2.2.2.2 Monte Carlo simulation

We can conduct a Monte Carlo Simulation to “recover” the joint distribution of \((X,Y)\). “Recover” here means that we are going to rely on the idea that probability is a long-run relative frequency. One possible simulation algorithm which may be humanly possible is

Flip a coin 4 times. Record the result.
Compute \(X\) and \(Y\) for the result.
Repeat Steps 1 and 2 a large number of times.
Create a relative frequency table for \(X\) and \(Y\).

Step 3 is probably not humanly possible if the number of repetitions reach, say, 10000. This is where R becomes quite useful.

You will be running a Monte Carlo Simulation of the situation described. What makes this simulation slightly difficult is how to compute \(X\) and \(Y\) for every possible sequence of 4 fair coin flips.

Exercise

Compare the joint distribution of \((X,Y)\) you have obtained earlier with that of the simulated counterpart obtained using R.
What are the similarities and differences? Is the algorithm functioning as intended?
How do you adjust the code so that you can obtain the simulated counterpart of the joint distribution of \((X,Y)\) if there were 100 coin flips instead?

Let us now produce a scatterplot of the simulated pairs of \((X,Y)\) which were drawn to obey the joint distribution of \((X,Y)\). We have somehow found a “relationship” between \(X\) and \(Y\).

Exercise

Run a regression of the simulated \(Y\) on the simulated \(X\).
Compute the conditional relative frequency distribution of the simulated \(Y\) given the simulated \(X\).
Compare your findings to the corresponding quantities based on the joint distribution of \((X,Y)\).

What do these regression coefficients represent? How can we use this regression to answer our original question of determining which sequence of coin flips is real?

2.3 Optimal prediction

This section is based on Section 1.3 of Pua (2024) and we provide some answers to the following questions (some of which have been answered partially in this chapter):

Why would the objective function used in least squares be a good idea?
Can the meaning of the regression coefficients extend to outside the sample that we have?
What would be necessary so that we can ably talk about cause and effect?

2.3.1 Setup and core ideas

The setting we consider for the prediction problem is as follows. There is some random variable \(Y\) whose value you want to predict or guess. You may or may not have additional information about another random variable \(X\). In the end, you have to specify a prediction rule or a formula that you can use to make predictions. We write this rule as a function of \(X\), say \(g(X)\).

An example of a prediction rule is an unknown real number \(\beta_0\). In this case, \(g(X)=\beta_0\), so it does not matter what \(X\) is and your prediction is always some constant \(\beta_0\). You make a prediction error or forecast error \(Y-\beta_0\), but this error is also a random variable having a distribution.

Next, you need some criterion to assess whether your prediction is “good”. This criterion is called a loss function. We will focus on a particular loss function called the squared error loss function. It is defined as \(L\left(z\right)=z^{2}\), where \(z\) is just a dummy argument². Losses will typically depend on the errors which were made. As a result, losses will have a distribution too.

Because losses are random, we need to specify a non-random criterion which we could minimize. Traditionally, we target expected losses. This may seem arbitrary but many prediction and forecasting contexts use expected losses. Form an expected squared loss \(\mathbb{E}\left[\left(Y-\beta_0\right)^{2}\right]\). This expected loss is sometimes called mean squared error (MSE).

2.3.1.1 Best constant prediction

You can show that the population mean \(\mathbb{E}\left(Y\right)\) is the unique solution to the following optimization problem: \[\min_{\beta_0}\mathbb{E}\left[\left(Y-\beta_0\right)^{2}\right] \tag{2.3}\]

We also have a neat connection with the definition of the population variance. \(\mathsf{Var}\left(Y\right)\) becomes the smallest expected loss from using \(\mathbb{E}\left(Y\right)\) as your prediction for \(Y\).

Do you see any similarities with what you have seen before? How about the connections and parallels to simple linear regression as a summary? How about the differences?

2.3.1.2 Best linear prediction

What if we want to incorporate additional information from \(X\) in our prediction rule? We have to change the specification of our prediction rule. Earlier, we considered \(g(X)=\beta_0\) as a prediction rule. We now specify \(g(X)=\beta_0+\beta_1 X\) as a prediction rule.

We repeat the process of calculating the MSE but for a new specification of the prediction rule. Under an MSE criterion, the minimization problem is now \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X\right)^{2}\right], \tag{2.4}\] with optimal coefficients \(\beta_{0}^{*}\) and \(\beta_{1}^{*}\) satisfying the first-order conditions: \[\mathbb{E}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X\right) = 0, \ \ \mathbb{E}\left[X\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X\right)\right] = 0\]

As a result, we have \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X\right),\qquad\beta_{1}^{*}=\dfrac{\mathbb{E}\left(XY\right)-\mathbb{E}\left(Y\right)\mathbb{E}\left(X\right)}{\mathbb{E}\left(X^2\right)-\mathbb{E}\left(X\right)^2}.\]

In the end, we can rewrite as \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X,Y\right)}{\mathsf{Var}\left(X\right)}. \tag{2.5}\]

Take note of how we computed linear regression in lm(). Do you find some similarities and parallels? What are the differences?

2.3.1.3 Best prediction

What if we want to incorporate additional information from \(X\) in our prediction rule, but this time we want to leave \(g(X)\) unspecified?

Under an MSE criterion, the minimization problem is now \[\min_{g(X)}\mathbb{E}\left[\left(Y-g(X)\right)^{2}\right].\] We now impose a restriction to make some progress, without introducing a lot of new mathematical ideas.

Let us assume that \(X\) is a discrete random variable with support points \(x_1,\ldots, x_K\). We do not need to assume that \(Y\) is also discrete. If we want to find the best function \(g\) which will minimize mean squared error and \(X\) takes on a finite number of support points, we can look at this case by case.

Let \(x_k\) be any one of those support points. For the subpopulation where \(X=x_k\), how would we optimally predict the \(Y\) of any arbitrarily selected unit from that subpopulation? In other words, we are looking for the number \(g\left(x_k\right)\) to minimize MSE. This is nothing but best constant prediction in Section 2.3.1.1. We know from Section 2.3.1.1 that the best constant prediction would be population mean for that subpopulation, i.e., \(\mathbb{E}\left(Y|X=x_k\right)\).

Because the argument in the previous paragraph applies to all support points, the choice of \(g\) which will solve the minimization problem would have the same core structure. It is the \(\mathbb{E}\left(Y|X\right)\).

2.3.1.4 Connections to Chapter 1

By now, you should have noticed that the tools which have been introduced to you in Chapter 1 and this chapter can be theoretically justified for prediction use cases. As a result, findings based on the tools you have learned so far could be expressed in the language of prediction. Of course, the justification requires us to maintain additional assumptions:

The data that we observe is but one of the many realizations or draws from the joint distribution \((X,Y)\). This aspect becomes more refined as you progress towards more complicated settings.
The metric to judge whether you have made a good prediction is an MSE criterion. You have to accept this whether you like it or not, especially if you apply the tools.
A very technical requirement is that these expected values actually exist. You might argue that we are talking about them already, so they must exist, right? In some application areas like climate and finance, there are tools which don’t rely on the existence of these expected values. In addition, their use cases make it hard to justify the use of expected values for prediction.

2.3.2 Which prediction stance should you adopt?

The short answer is that it depends on your research question.

2.4 A menagerie of models

2.4.1 The contrast between linear regression and the linear regression model

2.4.1.1 Constant plus error model

Let us revisit the best constant prediction of some random variable \(Y\). Define \(U\) to be the error produced by a best linear prediction of \(Y\), i.e. \[U=Y-\beta_0^*.\] From the optimality of \(\beta_0^*\), we can conclude that \[\mathbb{E}\left(U\right)=0.\] As a result, we can always write \[Y=\beta_0^*+U\] where \(U\) satisfies \(\mathbb{E}\left(U\right)=0\) if we are focused on prediction.

What I just described is also related to a basic form of a signal plus noise model used in statistics and econometrics. In fact, many textbooks directly assume (or implicitly assume) that \[Y=\beta_0+\mathrm{error}\] with \(\beta_0\) unknown and \(\mathbb{E}\left(\mathrm{error}\right)=0\). The task is then to recover \(\beta_0\).

How is this “signal plus noise” model related to linear regression? The term \(\mathrm{error}\) in the “signal plus noise” model was assumed to have a property that \(\mathbb{E}\left(\mathrm{error}\right)=0\). Effectively, the term \(\mathrm{error}\) is also an error produced by best linear prediction of \(Y\), because of its zero population mean property. Therefore, the unknown \(\beta_0\) is not just some arbitrary unknown constant. It has to be equal to \(\beta_0^*\).

2.4.1.2 Simple linear regression model

Another version of the simple “signal plus noise” model is as follows. Assume that \[Y=\beta_0+\beta_1X+\mathrm{error}\] with \(\beta_0,\beta_1\) being unknown, \(\mathbb{E}\left(\mathrm{error}\right)=0\), and \(\mathsf{Cov}\left(X,\mathrm{error}\right)=0\). Take note that although I am using the same notation \(\mathrm{error}\) as earlier, they represent different error terms. This version is sometimes called a simple linear regression model. The task is then to recover \(\beta_0,\beta_1\).

How is this simple linear regression model related to linear regression? The term \(\mathrm{error}\) was assumed to have properties that \(\mathbb{E}\left(\mathrm{error}\right)=0\) and \(\mathsf{Cov}\left(X,\mathrm{error}\right)=0\). Effectively, the term \(\mathrm{error}\) is also an error produced by best linear prediction of \(Y\) given \(X\). Therefore, the unknown \(\beta_0,\beta_1\) is actually equal to \(\beta_0^*\) and \(\beta_1^*\), respectively.

2.4.1.3 Why be so pedantic about this?

In my opinion, starting from a “signal plus noise” model or a simple linear regression model may not be the most appropriate, as most textbooks do. Why?

It mixes up two things: linear regression as a way to calculate coefficients and a linear regression model as the assumed underlying data generating process.
It is not clear why a linear regression model would be assumed. In contrast, by being very specific at the beginning about prediction tasks, we do not even need to have a model of how the data were generated. All we need is the presence of a joint distribution of \((X, Y)\) with finite variances.

Therefore, if you find yourself reading a paper or book or any other material where they use linear regression or a linear regression model, you have to figure out what exactly they are doing and what exactly they are assuming.

The most unsatisfying aspect of seeing a linear regression model is the unclear nature of the origins of the assumed form. For example, stating that \(Y=\beta_0+\beta_1 X+\mathrm{error}\), where \(\mathrm{error}\) satisfies \(\mathbb{E}\left(\mathrm{error}\right)=0\) and \(\mathsf{Cov}\left(X,\mathrm{error}\right)=0\), is an assumption. Although we can use linear regressions to recover \(\beta_0\) and \(\beta_1\) under some conditions, an analysis solely relying on this assumption without a justification or without a clear research question is inadequate by default.

2.4.2 Limited dependent variable models

Limited dependent variable models are models whose regressands are discrete in nature or have limitations with respect to their observability.

One example is the class of binary response models. These models are sometimes called binary choice models. The regressand is binary in nature. We actually have started studying them earlier in this chapter, see Section 2.2.1.4. We now dig a bit deeper and see how linear regression can be applied. At the same time, you will find out its limitations which would require us to consider alternative approaches.

Regrettably, other classes of limited dependent variables will not be covered in our course, but they are of substantial practical interest:

We may have regressands whose values we may not observe because of the economic context or by design. For example, we do not observe the wages of unemployed workers. Refer to Section 17-2, 17-4, and 17-5 for more examples.
We may have regressands which measure counts or how many. This produces nonnegative integer values. For example, the number of patents a firm has can be of interest when one is studying innovative activities. Refer to Section 17-3 for more examples.

2.4.2.1 Linear probability models

Note

Read Section 7-5. Stop reading when you reach “To correctly interpret …” in page 240.

If \(Y\) is binary and one writes a linear regression model of the form \[Y=\beta_0+\beta_1X_1+\ldots+\beta_kX_k+U,\] what would do we gain? To answer this question, it is important to recall that a linear regression model makes assumptions about \(U\). In particular, we have to assume that \(\mathbb{E}\left(U|X_1,\ldots, X_k\right)=0\). What this implies is that \[\mathbb{E}\left(Y|X_1,\ldots, X_k\right)=\beta_0+\beta_1X_1+\ldots+\beta_kX_k.\] Because \(Y\) is binary and from Section 2.2.1.4, we already know even without assuming a linear regression model that \[\mathbb{P}\left(Y=1|X_1,\ldots, X_k\right)=\mathbb{E}\left(Y|X_1,\ldots, X_k\right).\] Therefore, we must have \[\mathbb{P}\left(Y=1|X_1,\ldots, X_k\right)=\beta_0+\beta_1X_1+\ldots+\beta_kX_k. \tag{2.6}\] The only difference is that we have \(k\) regressors instead of just one like in Section 2.2.1.4. In this sense, there is nothing fundamentally new.

Section 7-5 of the book does introduce terminology which you should note. I will use the notation \(\mathbf{X}\) to denote the list of regressors \(X_1,\ldots, X_k\). This differs from our prior usage of \(\mathbf{X}\) in Section 1.5.4. You have to infer from the given context. The terms to remember are:

The response probability or the conditional probability of success \(\mathbb{P}\left(Y=1|\mathbf{X}\right)\)
The linear probability model, which is just a linear regression model with a binary \(Y\)

The interpretation of the regression coefficients has to change a bit given the setting of a binary \(Y\). In particular, you can show that \[\frac{\partial \mathbb{P}\left(Y=1|\mathbf{X}\right)}{\partial X_j}=\beta_j.\] As a result, we have \[\Delta \mathbb{P}\left(Y=1|\mathbf{X}\right) = \beta_j \Delta X_j.\] When we estimate \(\beta_0,\beta_1,\ldots, \beta_k\) by OLS, we have \[\Delta \widehat{\mathbb{P}}\left(Y=1|\mathbf{X}\right) = \widehat{\beta}_j \Delta X_j.\]

Note

Continue reading Section 7-5 from the point I asked you to stop. Stop reading when you reach “Predicted probabilities …” in page 242. You will be reading an empirical illustration based on classic paper on female labor supply by Mroz (1987). For the moment, ignore the standard errors, \(t\)-statistics, or any other mention of the word “statistically significant”. What matters for now is the basis of interpretation for a linear probability model and the issues which arise.

Exercise

Reproduce the results found in (7.29).

The analysis in the empirical illustration is quite good, especially when looking at how plausible the results are. For example, consider the interpretation of the coefficient of educ. If we compare women with the same level of husband’s earnings, experience, age, number of children less than 6 years old, and number of children between 6 to 18 years old, then such women with more than one year of education are predicted to be about 3.8 percentage points more likely to be in the labor force.

Notice that the interpretation is quite clunky and messy. It might feel boring, but it does not mislead, especially for audiences who do not know anything about binary choice models. It is possible to improve the interpretation by choosing a difference in education level of more than one year and by using some shorthand. For example: If we compare women with the same characteristics measured in (7.29), then such women with more than 10 years of education are predicted to be about 38 percentage points more likely to be in the labor force.

Warning

Stop and think about the difference between 38% more and 38 percentage points more. This is vital in communication.

Clearly, it does not make sense to push the interpretation further by considering years of education larger than 10. You can definitely calculate descriptive statistics to determine what is the maximum plausible value and how many have that maximum plausible value. Even then, pushing the interpretation towards extreme values may produce nonsense. For example, the illustration discusses the differences in the predicted probability of being in the labor force for women who have 4 more kids less than 6 years old. The predicted probability is reduced by 104.8 percentage points. Clearly, this is nonsense. This can be a limitation of a linear probability model, but it can be overcome by paying attention to the characteristics of the data you have.

Fitted values and residuals can be computed in the same way as we do with linear regressions. Unfortunately, these fitted values may not make sense because they may exceed one or be below zero. In addition, it is hard to motivate the use of R-squared because \(Y\) takes on only two values. The better approach is to consider a goodness-of-fit measure called the percent correctly predicted. A discussion could be found in page 242.

Exercise

Work out and improve every interpretation found in the discussion of (7.29). Verify every presented finding, including supporting findings like “In the sample, just under 20% of the women have at least one young child.”
Use your favorite AI tool to assist you in reproducing Figure 7.3 in R.
Do the same for Example 7.12.
For now, ignore directions which ask about tests, statistical significance, and confidence intervals. Work on Chapter 7 Problem 7(i) and Computer Exercises C8 (ii) to (v), C9, C13 except (iii), and C14.

2.4.2.2 Logit and probit models

Note

Read the first two paragraphs at the beginning of Section 17-1. Proceed to Sections 17-1a and 17-1d. As part of your reading, it is important to contrast what you see with the discussion on linear probability models.

Skip the discussion of the pseudo R-squared.

The main difference of logit and probit models when compared to linear probability models comes from the the form of \(\mathbb{P}\left(Y=1|\mathbf{X}\right)\). Before, we have Equation 2.6. Now, we have \[\mathbb{P}\left(Y=1|\mathbf{X}\right)=G\left(\beta_0+\beta_1X_1+\ldots+\beta_k X_k\right), \tag{2.7}\] with \(G\left(\cdot\right)\) obeying the condition \(0<G(z)<1\) for any \(z\). Different choices of the function \(G\) lead to different binary response models.

If \(G\) is the logistic function, then we have a logit model.
If \(G\) is the standard normal cumulative distribution function (cdf), then we have a probit model.
There are other choices but they are not as popular like the cloglog, loglog, robit, Pregibon, and cauchit. Refer to some R implementations found in Koenker (2006) and to more theory justifying these other choices found in Koenker and Yoon (2009). The robit model is introduced in Liu (2004).
It is also possible to leave \(G\) unspecified, but this will be entering into the territory of semiparametric econometrics. Regrettably, this is somewhat outside the scope of our course. If you want to learn more, consider reading Horowitz and Savin (2001).

Unfortunately, the meaning of the \(\beta\)’s in Equation 2.7 is different from the \(\beta\)’s in Equation 2.6. They are not directly comparable simply because we need \(G\) to compute the probability of success. In fact, the differences in response probabilities arising due to differences in \(X_j\), holding other regressors fixed, depends on all other \(\beta\)’s aside from \(\beta_j\), i.e., \[\frac{\partial \mathbb{P}\left(Y=1|\mathbf{X}\right)}{\partial X_j}=g\left(\beta_0+\beta_1 X_1+\ldots+\beta_k X_k\right)\beta_j,\] with \(g(\cdot)=G^{\prime}(\cdot)\). Another important thing to note is that the differences in response probabilities depends on the values of \(X_1,\ldots, X_k\). This feature is very different from the linear probability model and is sometimes described as a nonlinearity. This implies that different settings for \(\mathbf{X}\) lead to different values for the differences in response probabilities.

As a result of this feature, there is a wide variety of “effects” reported in the literature:

Relative effects
Effects of changes in a discrete regressor
Partial effects at the average or marginal effects at the average
Average partial effects or average marginal effects

Exercise

Work on Chapter 17 Problems 1 and 2.
For now, ignore directions which ask about tests, statistical significance, and confidence intervals. Work on Chapter 17 Computer Exercises C1, C2, C8, C14, and C15 (i) to (vi).

2.4.2.3 Discrete dependent variable models

If you want a preview of how to deal with discrete dependent variable models without necessarily going into the more advanced topics, work on Section 7-7 and Chapter 7 Computer Exercise C15 completely.

2.4.3 Time series models

2.4.4 Potential outcome models

2.4.5 Structural equations

References

Hendry, David F., and Bent Nielsen. 2007. Econometric Modeling: A Likelihood Approach. Princeton University Press.

Horowitz, Joel L., and N. E. Savin. 2001. “Binary Response Models: Logits, Probits and Semiparametrics.” Journal of Economic Perspectives 15 (4): 43–56. https://doi.org/10.1257/jep.15.4.43.

Koenker, Roger. 2006. “Parametric Links for Binary Response.” R Journal 6: 32–34. https://journal.r-project.org/articles/RN-2006-032/.

Koenker, Roger, and Jungmo Yoon. 2009. “Parametric Links for Binary Choice Models: A Fisherian–Bayesian Colloquy.” Journal of Econometrics 152 (2): 120–30. https://doi.org/https://doi.org/10.1016/j.jeconom.2009.01.009.

Liu, Chuanhai. 2004. “Robit Regression: A Simple Robust Alternative to Logistic and Probit Regression.” In Applied Bayesian Modeling and Causal Inference from Incomplete‐data Perspectives, 227–38. John Wiley & Sons, Ltd. https://onlinelibrary.wiley.com/doi/abs/10.1002/0470090456.ch21.

Mroz, Thomas A. 1987. “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions.” Econometrica 55 (4): 765–99. http://www.jstor.org/stable/1911029.

Pua, Andrew Adrian Yu. 2024. Econometrics. https://ecometr.neocities.org.

Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. 7th edition. Cengage Learning.

This should remind you of the underlying assumptions for a binomial experiment.↩︎
Remember the discussion on dummy arguments before?↩︎