The data set families contains information about 43,886 families living in the city of Cyberville. The city has four regions: the Northern region has 10,149 families, the Eastern region has 10,390 families, the Southern region has 13,457 families, and the Western region has 9,890. For each family, the following information is recorded:
1. Family type
1: Husband-wife family
2: Male-head family
3: Female-head family
2. Number of persons in family
3. Number of children in family
4. Family income 5.
6. Education level of head of household
31: Less than 1st grade
32: 1st, 2nd, 3rd, or 4th grade
33: 5th or 6th grade
34: 7th or 8th grade
35: 9th grade
36: 10th grade
37: 11th grade
38: 12th grade, no diploma
39: High school graduate, high school diploma, or equivalent
40: Some college but no degree
41: Associate degree in college (occupation/vocation program)
42: Associate degree in college (academic program)
43: Bachelor’s degree (e.g., B.S., B.A., A.B.)
44: Master’s degree (e.g., M.S., M.A., M.B.A.)
45: Professional school degree (e.g., M.D., D.D.S., D.V.M., LL.B., J.D.)
46: Doctoral degree (e.g., Ph.D., .D.)
In these exercises, you will try to learn about the families of by using sampling.
a. Take a simple random sample of 500 families. Estimate the following population parameters, calculate the estimated standard errors of these estimates, and form 95% confidence intervals:
i. The proportion of female-headed families ii. The average number of children per family iii. The proportion of heads of households who did not receive a high school diploma iv. The average family income Repeat the preceding parameters for five different simple random samples of size 500 and compare the results.
b. Take 100 samples of size 400.
i. For each sample, find the average family income.
ii. Find the average and standard deviation of these 100 estimates and make a histogram of the estimates.
iii. Superimpose a plot of a normal density with that mean and standard deviation of the histogram and comment on how well it appears to fit.
iv. Plot the empirical cumulative distribution function (see Section 10.2). On this plot, superimpose the normal cumulative distribution function with mean and standard deviation as earlier. Comment on the fit.
v. Another method for examining a normal approximation is via a normal probability plot (Section 9.9). Make such a plot and comment on what it shows about the approximation. vi. For each of the 100 samples, find a 95% confidence interval for the population average income. How many of those intervals actually contain the population target?
vii. Take 100 samples of size 100. Compare the averages, standard deviations, and histograms to those obtained for a sample of size 400 and explain how the theory of simple random sampling relates to the comparisons.
c. For a simple random sample of 500, compare the incomes of the three family types by comparing histograms and boxplots (see Chapter 10.6).
d. Take simple random samples of size 400 from each of the four regions.
i. Compare the incomes by region by making parallel boxplots.
ii. Does it appear that some regions have larger families than others?
iii. Are there differences in education level among the four regions?
e. Formulate a question of your choice and attempt to answer it with a simple random sample of size 400.
f. Does stratification help in estimating the average family income? From a simple random sample of size 400, estimate the average income and also the standard error of your estimate. Form a 95% confidence interval. Next, allocate the 400 observations proportionally to the four regions and estimate the average income from the stratified sample. Estimate the standard error and form a 95% confidence interval. Compare your results to the results of the simple random sample.