Gov 2002: Problem Set 6

Published

March 23, 2023

This content is from Spring 2022. Go to Fall 2023 site

Problem Set Instructions

This problem set is due on March 29, 11:59 pm Eastern time. Please upload a PDF of your solutions to Gradescope. We will accept hand-written solutions but we strongly advise you to typeset your answers in Rmarkdown. Please list the names of other students you worked with on this problem set.

Question 1 (20 points)

This problem will use the subprime data to walk you through a very common inference problem - testing whether the difference between two population values is non-zero. To begin this problem first download subprime.csv and load it into R.

We are going to be interested primarily in the loan.amount variable - the amount that each loan recipient received. Suppose a lawsuit has been filed in U.S. District Court by a group of Fort Myers women who claim that women in the area were loaned less money than men. The defendants – a group of local mortgage lenders – are vigorously denying these claims, and the case is now advancing to trial. Having heard about your expertise in this area, the federal judge hearing the case has brought you in to provide expert testimony. Your task in this problem is to assist the judge in her determination.

Suppose you were only able to interview \(100\) male and \(100\) female loan recipients at random, making them iid. To simulate this in R, set the seed to 02138 and draw \(100\) observations randomly from the male subset of the subprime data and \(100\) observations randomly from the female subset of the data. These \(200\) observations constitute your sample. Calculate (1) the average loan amount (loan.amount) for women in your sample, (2) the average loan amount for men, and (3) and (4), the sample standard deviation for each. Report those results in a nicely formatted table.
Let \(\mu_{m}\) and \(\mu_{w}\) be the population average loan amount for men and women respectively. Let \(\sigma^2_{m}\) and \(\sigma^2_{w}\) be the population variances in loan amount for men and women respectively. Denote the sample average loan amounts for men by \(\bar{X}_{m}\) and for women by \(\bar{X}_{w}\). What is the expected value of the sampling distribution of \(\bar{X}_{m} - \bar{X}_{w}\)? What is the variance of the sampling distribution of \(\bar{X}_{m} - \bar{X}_{w}\)?
Compute and report your sample difference in average loan amount for men and women. Recall that for large samples, the sampling distribution of a mean or difference-in-means is approximately normal. Suppose that we know that the true population \(\sigma^2_{m} = 32381.57\) and \(\sigma^2_{w} = 19097.95\). Using the normal approximation, what is the probability that we would observe a difference-in-means at least as extreme as the one in our sample if the true population difference-in-means \(\mu_{m} - \mu_{w}\) equals 0? Note that by “at least as extreme,” we mean a value that is further away from \(0\) than the value we observe - that is, \(P(|\bar{X}_{m} - \bar{X}_{w}| \ge \alpha)\) where \(\alpha\) is our observed value and \(||\) is the absolute value operator.
Hint: In R you can get the probability that a normally distributed random variable takes on a value less than or equal to some value q using the command pnorm(q, mean, sd) where mean is the mean of the normal distribution and sd is the standard deviation.
Comment on your result in (c). Given what we observe in our sample, is it likely that there is no difference in loan amounts for men and women? A common threshold for “rejecting” our assumed hypothesis that \(\mu_{m} - \mu_{w} = 0\) is observing a sample that would occur with probability \(.05\) or less if that hypothesis were true (that is, a very unlikely sample). Would we reject the hypothesis that there is no difference in average loan amounts between men and women?

Question 2 (20 points)

In this problem, we will explore the implications of the Central Limit Theorem for uncertainty estimation and hypothesis testing. Start by creating two variables, X1 and X2, using the following code:

set.seed(02139)
X1 <- rnorm(100000, 5, 2)
X2 <- rexp(100000, 0.2)

For the purposes of this problem, we will treat these variables (each with 100,000 elements) as the full population. We will take samples from these two datasets to evaluate the coverage probability of 95% confidence intervals for the population mean using different types of data and different sample sizes.

Plot and describe the full distributions of X1 and X2. What is the mean of each random variable?
Now, create a loop to take 100 samples of size 8 from each dataset and record the sample mean and confidence interval bounds for each sample mean. Plot your results for each dataset, making sure your plots show the simulated confidence intervals that include the population mean in a different color than the confidence intervals that do include the population mean. What is the coverage probability for each variable (in other words, what proportion of your confidence intervals for the mean do not include the true population mean)? Compare your results for X1 and X2. Are they similar or different? Why?
Repeat the simulation in b) for samples of size 8, 20, 50, and 500, and increase your number of simulations to 1000. Report the coverage probability for each of your eight simulations in a table (you do not need to create additional plots). How do your results change? What differences do you see between X1 and X2?
Interpret your findings in parts b) and c). How does the central limit theorem explain what you see in the simulations?

Question 3 (20 points)

All probability distributions have moments, which are standard expressions that define its shape in ways you’ve already heard of and other more nuanced ways (the variance, the skew, kurtosis, etc.). Describing a population distribution (or empirical sample distribution) in terms of its moments is really useful in social science (e.g. the skew of income in the U.S. population is positive) Specifically, the \(n\)th central moment of a random variable \(X\) is defined as \(E[(X-E[X])^n]\), but it is more common to work with the \(n\)th moment defined as as \(E[X^n]\) (getting rid of the \(E[X]^n\) part).

Suppose the random variable \(X\) for your population has the the following first four moments: \(E[X]=1/2\), \(E[X^2]=1/2\), \(E[X^3]=3/4\), \(E[X^4]=3/2\). Suppose you took an i.i.d. sample \(\{X_1,\ldots,X_{20}\}\) of size 20 from this distribution. Let \(T=(X^2_1 +...+ X^2_{20})/20 = \overline{X^2}\), an estimator of the second moment.

What are \(E[T]\) and \(V(T)\)? Be sure to explain why.
Use the central limit theorem to approximate the probability (in R) that \(T\) is less than or equal to 1.

Question 4 (20 points)

In class we learned that if a the variance of a sequence of random variables with finite mean goes to zero as \(n\to\infty\), then the sequence will converge in probability to some value. But this is a sufficient condition, not a necessary one. To see this, consider the sequence of random variables \(X_n\) be with probability distribution:

\[ X_n = \begin{cases} 0 & \text{with probability } 1 - 1/n \\ n & \text{with probability } 1/n \\ \end{cases} \]

Find \(\mathbb{E}[X_n]\).
Find \(\text{Var}(X_n)\). Does the variance of the sequence grow or shrink as \(n\) grows?
Use the definition of convergence in probability to show that \(X_n \xrightarrow{p} 0\).

Question 5 (20 points)

Suppose that \(X_1, X_2, ..., X_n\) are an iid sample from the following distribution:

\[ \begin{aligned} f_X = \frac{1}{2}(1 + \theta x), \quad-1 < x < 1, \ -1 < \theta < 1. \end{aligned} \]

Show that \(3\bar{X}_n\) is a consistent estimator of \(\theta\). Hint: are there finite-sample properties of this estimator that can help establish consistency?