Gov 2002: Problem Set 8

Published

April 6, 2023

This content is from Spring 2022. Go to Fall 2023 site

Submission instructions | PDF | Rmd |

Problem Set Instructions

This problem set is due on April 12, 11:59 pm Eastern time. Please upload a PDF of your solutions to gradescope. We will accept hand-written solutions for problems 1-3 but we strongly advise you to typeset your answers in Rmarkdown. Problem 4 should be typeset. Please list the names of other students you worked with on this problem set.

Question 1

Let X and Y be random variables with finite variances, and let W=YE(Y|X) be the CEF error. This is the population version of the sample residual: the difference between the true value of Y and the predicted value of Y via a conditional expectation function (CEF) involving X.

  1. Compute E(W) and E(W|X).

  2. Compute Var(W), for the case that W|XN(0,X2) with XN(0,1).

  3. Now consider a third finite-variance random variable Z. Suppose the following CEF is true in the population: E[Y|X,Z]=β0+β1X+β2Z+β3Z2+βXZ Find the partial effects of X and Z on E[Y|X,Z].

Question 2

In this problem we will explore how centering an independent (subtracting off the variable’s mean) affects the interpretation of coefficients in linear projections.

  1. Suppose that L[Y1,X]=β0+Xβ1. Let Z=XE[X]. Find the coefficients of the linear projection L[Y1,Z]=α0+Zα1 in terms of β0 and β1. Does centering X around its mean affect these parameters?

  2. Now suppose that L[Y|X1,X2]=β0+β1X1+β2X2+β3X1X2. Derive an expression for the partial effect of X1, L[Y|X1,X2]X1, and the expectation of that partial effect (where the expectation is over the distribution of X1 and X2).

  3. A common trick with interactions is to center one of the variables for easier interpretation. Let Z2=X2μ2, where μ2=E[X2]. Rewrite the linear projection L[Y|X1,X2] as a function of Z2 instead of X2 and relate the new coefficient on X1 to the linear projection in (b). That is, write [ L[Y X_1, X_2] = _0 + _1 X_1 + _2 Z_2 + _3 X_1Z_2 ] and express the α coefficients in terms of (β0,β1,β2,β3) and μ2. How does the coefficient obtained in part (c) relate to the average of the partial effects in (b)? (Hint: you’ll need to add and subtract certain values to obtain the new expression.)

  4. In a sentence or two, explain the substantive interpretation of α1 and why using Z2 instead of X2 might be useful. (Hint: consider a case such where X1 is assignment to some treatment and X2 is birth year.) Does this transformation affect the interpretation of the interaction term?

Question 3

This question highlights the importance of the assumptions we make about the population regression function.

(a)

Suppose the following linear model is true in the population for some outcome variable Y: Y=XTβ+u

Show that if this model is true and E[u|X]=0, then E[Y|X]=XTβ.

(Note that this is the opposite of what we showed in lecture, where we saw that if we assume E[Y|X]=XTβ, then the conditional mean zero assumption holds, E[u|X]=0.)

(b)

With regression we don’t typically make many distributional assumptions about X, except for a few crucial ones. In particular, we saw that for the linear projection to be well-defined we needed QXX=E[XXT] to be positive definite and thus invertible.

Let X=(1,X)T so we are in a bivariate regression setting. Show that if Var(X)=0, then QXX is not positive definite. (Hint: look for linear dependencies in the columns of QXX.)

Question 4: Regression Analysis of Subprime Loans

This problem will guide you through thinking about the conditional expectation function and how it relates to regression and how we can connect it back to hypothesis testing.

For this problem, we are going to use the subprime data. Recall that these are data collected by the U.S. government on all home lending transactions in Cape Coral and Fort Myers. They contain information on each loan applicant and give information on whether that applicant received a subprime loan (high.rate) as well as on the amount of the loan (loan.amount). They also contain basic demographic information such as race, gender, and income.

Assume the data represent the “truth” (i.e., an entire population). Also assume that the data in this population are distributed i.i.d. Take a sample of size 250, without replacement, from this population. Set your seed to 02138 before doing so. You will be working with this sample throughout this problem.

(a)

You care about the relationship between the variables income and loan.amount – seems like there should be a relationship between those two, right?

As per usual, you have a friend (you really need to get some new friends) who proposes that you use the following strategy to see if there is a relationship: Create a new income income.bin variable that takes on four values using the cut() function in R:

  • a value for if income falls into the [0, 25] percentile range (which you can find via quantile()),
  • a value if it falls into the (25, 50] range,
  • a value for the (50, 75] range, and a value for the (75, 100] range.

Note that the lower bounds are NOT inclusive, except for the first range.

Run a regression of loan.amount on income.bin, and report the coefficients, standard errors, R2 and sample size in a nicely formatted table (recall section).

(b)

Let’s compare the approach in (a) to using a regression on the original continuous variable. What is an assumption we make in the approach in (a) that we don’t make when we run a regression? We are looking for an assumption related to the fact that you have taken a continuous variable and stratified it into four categories.

(c)

Based on the results of the regression on binned income in (a), do you think the linearity assumption needed for a bivariate regression on the original continuous variable holds in this case? Why or why not?

(d)

In spite of your friend’s opinion, you decide to run a regression of loan.amount on income. Run this regression within your sample, and report the coefficients, standard errors, R2 and sample size in a nicely formatted table (recall section).

(e)

Interpret the estimated coefficients. Is the interpretation consistent with the results from (a)?