Full Code of OpenIntroOrg/openintro-statistics for AI

master fee25091fb24 cached

543 files

5.2 MB

1.4M tokens

1 requests

Download .txt

Showing preview only (5,580K chars total). Download the full file or copy to clipboard to get everything.

Repository: OpenIntroOrg/openintro-statistics
Branch: master
Commit: fee25091fb24
Files: 543
Total size: 5.2 MB

Directory structure:
gitextract_luzkuarl/

├── .gitignore
├── LICENSE.md
├── README.md
├── ch_distributions/
│   ├── TeX/
│   │   ├── binomial_distribution.tex
│   │   ├── ch_distributions.tex
│   │   ├── geometric_distribution.tex
│   │   ├── negative_binomial_distribution.tex
│   │   ├── normal_distribution.tex
│   │   ├── poisson_distribution.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── 6895997/
│       │   └── 6895997.R
│       ├── amiIncidencesOver100Days/
│       │   └── amiIncidencesOver100Days.R
│       ├── between59And62/
│       │   └── between59And62.R
│       ├── eoce/
│       │   ├── GRE_intro/
│       │   │   └── gre_intro.R
│       │   ├── area_under_curve_1/
│       │   │   └── area_under_curve_1.R
│       │   ├── area_under_curve_2/
│       │   │   └── area_under_curve_2.R
│       │   ├── college_fem_heights/
│       │   │   └── college_fem_heights.R
│       │   └── stats_scores/
│       │       └── stats_scores.R
│       ├── fcidMHeights/
│       │   ├── fcidMHeights-helpers.R
│       │   └── fcidMHeights.R
│       ├── fourBinomialModelsShowingApproxToNormal/
│       │   └── fourBinomialModelsShowingApproxToNormal.R
│       ├── geometricDist35/
│       │   └── geometricDist35.R
│       ├── geometricDist70/
│       │   └── geometricDist70.R
│       ├── height40Perc/
│       │   └── height40Perc.R
│       ├── height82Perc/
│       │   └── height82Perc.R
│       ├── mikeAndJosePercentiles/
│       │   └── mikeAndJosePercentiles.R
│       ├── nbaNormal/
│       │   ├── nbaNormal-helpers.R
│       │   └── nbaNormal.R
│       ├── normApproxToBinomFail/
│       │   └── normApproxToBinomFail.R
│       ├── normalExamples/
│       │   ├── normalExamples-helpers.R
│       │   └── normalExamples.R
│       ├── normalQuantileExer/
│       │   ├── QQNorm.R
│       │   ├── normalQuantileExer-data.R
│       │   ├── normalQuantileExer.R
│       │   └── normalQuantileExerAdditional.R
│       ├── normalTails/
│       │   └── normalTails.R
│       ├── pokerNormal/
│       │   └── pokerNormal.R
│       ├── satAbove1190/
│       │   └── satAbove1190.R
│       ├── satActNormals/
│       │   └── satActNormals.R
│       ├── satBelow1030/
│       │   └── satBelow1030.R
│       ├── satBelow1300/
│       │   └── satBelow1300.R
│       ├── simpleNormal/
│       │   └── simpleNormal.R
│       ├── smallNormalTails/
│       │   └── smallNormalTails.R
│       ├── standardNormal/
│       │   └── standardNormal.R
│       ├── subtracting2Areas/
│       │   └── subtracting2Areas.R
│       ├── subtractingArea/
│       │   └── subtractingArea.R
│       ├── twoSampleNormals/
│       │   └── twoSampleNormals.R
│       └── twoSampleNormalsStacked/
│           └── twoSampleNormalsStacked.R
├── ch_foundations_for_inf/
│   ├── TeX/
│   │   ├── ch_foundations_for_inf.tex
│   │   ├── confidence_intervals.tex
│   │   ├── hypothesis_testing.tex
│   │   ├── one_sided_tests.tex
│   │   ├── review_exercises.tex
│   │   └── variability_in_estimates.tex
│   └── figures/
│       ├── 95PercentConfidenceInterval/
│       │   └── 95PercentConfidenceInterval.R
│       ├── ARCHIVE/
│       │   └── sampling_10k_prop_56p/
│       │       └── sampling_10k_prop_56p.R
│       ├── arrayOfFigureAreasForChiSquareDistribution/
│       │   ├── chiSquareAreaAbove10WithDF4/
│       │   │   └── chiSquareAreaAbove10WithDF4.R
│       │   ├── chiSquareAreaAbove11Point7WithDF7/
│       │   │   └── chiSquareAreaAbove11Point7WithDF7.R
│       │   ├── chiSquareAreaAbove4Point3WithDF2/
│       │   │   └── chiSquareAreaAbove4WithDF2.R
│       │   ├── chiSquareAreaAbove5Point1WithDF5/
│       │   │   └── chiSquareAreaAbove5Point1WithDF5.R
│       │   ├── chiSquareAreaAbove6Point25WithDF3/
│       │   │   └── chiSquareAreaAbove6Point25WithDF3.R
│       │   └── chiSquareAreaAbove9Point21WithDF3/
│       │       └── chiSquareAreaAbove9Point21WithDF3.R
│       ├── bladesTwoSampleHTPValueQC/
│       │   └── bladesTwoSampleHTPValueQC.R
│       ├── business_one_sided_20_21-p_value/
│       │   └── business_one_sided_20_21-p_value.R
│       ├── chiSquareDistributionWithInceasingDF/
│       │   └── chiSquareDistributionWithInceasingDF.R
│       ├── choosingZForCI/
│       │   └── choosingZForCI.R
│       ├── clt_prop_grid/
│       │   └── clt_prop_grid.R
│       ├── communityCollegeClaimedHousingExpenseDistribution/
│       │   └── communityCollegeClaimedHousingExpenseDistribution.R
│       ├── eoce/
│       │   ├── adult_heights/
│       │   │   └── adult_heights.R
│       │   ├── age_at_first_marriage_intro/
│       │   │   └── age_at_first_marriage_intro.R
│       │   ├── assisted_reproduction_one_sample_randomization/
│       │   │   └── assisted_reproduction_one_sample_randomization.R
│       │   ├── cflbs/
│       │   │   └── cflbs.R
│       │   ├── college_credits/
│       │   │   └── college_credits.R
│       │   ├── egypt_revolution_one_sample_randomization/
│       │   │   └── egypt_revolution_one_sample_randomization.R
│       │   ├── exclusive_relationships/
│       │   │   ├── exclusive_relationships.R
│       │   │   └── survey.csv
│       │   ├── gifted_children_ht/
│       │   │   └── gifted_children_ht.R
│       │   ├── gifted_children_intro/
│       │   │   └── gifted_children_intro.R
│       │   ├── identify_dist_ls_pop/
│       │   │   └── identify_dist_ls_pop.R
│       │   ├── identify_dist_symm_pop/
│       │   │   └── identify_dist_symm_pop.R
│       │   ├── pennies_ages/
│       │   │   ├── penniesAges.Rda
│       │   │   └── pennies_ages.R
│       │   ├── penny_weights/
│       │   │   └── penny_weights.R
│       │   ├── social_experiment_two_sample_randomization/
│       │   │   └── social_experiment_two_sample_randomization.R
│       │   ├── songs_on_ipod/
│       │   │   └── songs_on_ipod.R
│       │   ├── thanksgiving_spending_intro/
│       │   │   └── thanksgiving_spending_intro.R
│       │   └── yawning_two_sample_randomization/
│       │       └── yawning_two_sample_randomization.R
│       ├── geomFitEvaluationForSP500For1990To2011/
│       │   └── geomFitEvaluationForSP500For1990To2011.R
│       ├── geomFitPValueForSP500For1990To2011/
│       │   └── geomFitPValueForSP500For1990To2011.R
│       ├── googleHTForDiffAlgPerformancePValue/
│       │   └── googleHTForDiffAlgPerformancePValue.R
│       ├── helpers.R
│       ├── jurorHTPValueShown/
│       │   └── jurorHTPValueShown.R
│       ├── mammograms/
│       │   └── mammograms.R
│       ├── normal_dist_mean_500_se_016/
│       │   └── normal_dist_mean_500_se_016.R
│       ├── nuclearArmsReduction/
│       │   └── nuclearArmsReduction.R
│       ├── p-hat_from_53_and_59-not-used/
│       │   └── p-hat_from_53_and_59.R
│       ├── p-hat_from_53_and_59_computation/
│       │   ├── NormTailsCalc.R
│       │   └── p-hat_from_53_and_59_computation.R
│       ├── p-hat_from_867_and_907-not-used/
│       │   └── p-hat_from_867_and_907.R
│       ├── p-hat_from_86_and_90/
│       │   └── p-hat_from_86_and_90.R
│       ├── quadcopter/
│       │   └── quadcopter_attribution.txt
│       ├── sampling_100_prop_X/
│       │   └── sampling_100_prop_X.R
│       ├── sampling_10_prop_25p/
│       │   ├── sampling_10_prop_25p - one figure.R
│       │   └── sampling_10_prop_25p.R
│       ├── sampling_10k_prop_887p/
│       │   └── sampling_10k_prop_887p.R
│       ├── sampling_10k_prop_88p/
│       │   └── sampling_10k_prop_88p.R
│       ├── sampling_5k_prop_50p/
│       │   └── sampling_5k_prop_50p.R
│       ├── sampling_X_prop_56p/
│       │   └── sampling_X_prop_56p.R
│       ├── sulphStudyFindPValueUsingNormalApprox/
│       │   └── sulphStudyFindPValueUsingNormalApprox.R
│       └── whyWeWantPValue/
│           └── whyWeWantPValue.R
├── ch_inference_for_means/
│   ├── TeX/
│   │   ├── ch_inference_for_means.tex
│   │   ├── comparing_many_means_with_anova.tex
│   │   ├── difference_of_two_means.tex
│   │   ├── one-sample_means_with_the_t-distribution.tex
│   │   ├── paired_data.tex
│   │   ├── power_calculations_for_a_difference_of_means.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── babySmokePlotOfTwoGroupsToExamineSkew/
│       │   └── babySmokePlotOfTwoGroupsToExamineSkew.R
│       ├── cbrRunTimesMenWomen/
│       │   └── cbrRunTimesMenWomen.R
│       ├── classData/
│       │   └── classData.R
│       ├── distOfDiffOfSampleMeansForBWOfBabySmokeData/
│       │   └── distOfDiffOfSampleMeansForBWOfBabySmokeData.R
│       ├── eoce/
│       │   ├── adult_heights/
│       │   │   └── adult_heights.R
│       │   ├── age_at_first_marriage_intro/
│       │   │   └── age_at_first_marriage_intro.R
│       │   ├── anova_exercise_1/
│       │   │   └── anova_exercise_1.R
│       │   ├── chick_wts_anova/
│       │   │   └── chick_wts.R
│       │   ├── chick_wts_linseed_horsebean/
│       │   │   └── chick_wts.R
│       │   ├── child_care_hours/
│       │   │   ├── child_care_hours.R
│       │   │   └── china.csv
│       │   ├── cleveland_sacramento/
│       │   │   └── cleveland_sacramento.R
│       │   ├── college_credits/
│       │   │   └── college_credits.R
│       │   ├── diamonds_1/
│       │   │   └── diamonds.R
│       │   ├── exclusive_relationships/
│       │   │   ├── exclusive_relationships.R
│       │   │   └── survey.csv
│       │   ├── friday_13th_accident/
│       │   │   └── friday_13th_accident.R
│       │   ├── friday_13th_traffic/
│       │   │   └── friday_13th_traffic.R
│       │   ├── fuel_eff_city/
│       │   │   ├── fuel_eff.csv
│       │   │   └── fuel_eff_city.R
│       │   ├── fuel_eff_hway/
│       │   │   ├── fuel_eff.csv
│       │   │   └── fuel_eff_hway.R
│       │   ├── gifted_children/
│       │   │   └── gifted_children.R
│       │   ├── gifted_children_ht/
│       │   │   └── gifted_children_ht.R
│       │   ├── gifted_children_intro/
│       │   │   └── gifted_children_intro.R
│       │   ├── global_warming_v2_1/
│       │   │   └── global_warming_v2_1.R
│       │   ├── gpa_major/
│       │   │   ├── gpa_major.R
│       │   │   └── survey.csv
│       │   ├── hs_beyond_1/
│       │   │   └── hs_beyond.R
│       │   ├── oscar_winners/
│       │   │   └── oscar_winners.R
│       │   ├── prison_isolation_T/
│       │   │   ├── prison_isolation.R
│       │   │   └── prison_isolation.csv
│       │   ├── prius_fuel_efficiency/
│       │   │   └── prius_fuel_efficiency.R
│       │   ├── prius_fuel_efficiency_update/
│       │   │   └── prius_fuel_efficiency.R
│       │   ├── t_distribution/
│       │   │   └── t_distribution.R
│       │   ├── torque_on_rusty_bolt/
│       │   │   ├── torque_on_rusty_bolt (Autosaved).R
│       │   │   └── torque_on_rusty_bolt.R
│       │   └── work_hours_education/
│       │       ├── gss2010.Rda
│       │       └── work_hours_education.R
│       ├── fDist2And423/
│       │   └── fDist2And423.R
│       ├── fDist3And323/
│       │   └── fDist3And323.R
│       ├── mlbANOVA/
│       │   └── mlbANOVA.R
│       ├── outliers_and_ss_condition/
│       │   └── outliers_and_ss_condition.R
│       ├── pValueOfTwoTailAreaOfExamVersionsWhereDFIs26/
│       │   └── pValueOfTwoTailAreaOfExamVersionsWhereDFIs26.R
│       ├── pValueShownForSATHTOfOver100PtGain/
│       │   └── pValueShownForSATHTOfOver100PtGain.R
│       ├── power_best_sample_size/
│       │   └── power_best_sample_size.R
│       ├── power_curve/
│       │   └── power_curve.R
│       ├── power_null_0_0-76/
│       │   └── power_null_0_0-76.R
│       ├── power_null_0_1-7/
│       │   └── power_null_0_1-7.R
│       ├── rissosDolphin/
│       │   └── ReadMe.txt
│       ├── run10SampTimeHistogram/
│       │   └── run10SampTimeHistogram.R
│       ├── satImprovementHTDataHistogram/
│       │   └── satImprovementHTDataHistogram.R
│       ├── stemCellTherapyForHearts/
│       │   └── stemCellTherapyForHearts.R
│       ├── stemCellTherapyForHeartsPValue/
│       │   └── stemCellTherapyForHeartsPValue.R
│       ├── tDistAppendixTwoEx/
│       │   └── tDistAppendixTwoEx.R
│       ├── tDistCompareToNormalDist/
│       │   └── tDistCompareToNormalDist.R
│       ├── tDistConvergeToNormalDist/
│       │   └── tDistConvergeToNormalDist.R
│       ├── tDistDF18LeftTail2Point10/
│       │   └── tDistDF18LeftTail2Point10.R
│       ├── tDistDF20RightTail1Point65/
│       │   └── tDistDF20RightTail1Point65.R
│       ├── textbooksF18/
│       │   ├── diffInTextbookPricesF18.R
│       │   └── textbooksF18HTTails.R
│       ├── textbooksS10/
│       │   ├── diffInTextbookPricesS10.R
│       │   └── textbooksS10HTTails.R
│       ├── textbooks_scatter/
│       │   └── textbooks_scatter.R
│       └── toyANOVA/
│           └── toyANOVA.R
├── ch_inference_for_props/
│   ├── TeX/
│   │   ├── ch_inference_for_props.tex
│   │   ├── difference_of_two_proportions.tex
│   │   ├── inference_for_a_single_proportion.tex
│   │   ├── review_exercises.tex
│   │   ├── testing_for_goodness_of_fit_using_chi-square.tex
│   │   └── testing_for_independence_in_two-way_tables.tex
│   └── figures/
│       ├── arrayOfFigureAreasForChiSquareDistribution/
│       │   ├── chiSquareAreaAbove10WithDF4/
│       │   │   └── chiSquareAreaAbove10WithDF4.R
│       │   ├── chiSquareAreaAbove11Point7WithDF7/
│       │   │   └── chiSquareAreaAbove11Point7WithDF7.R
│       │   ├── chiSquareAreaAbove4Point3WithDF2/
│       │   │   └── chiSquareAreaAbove4WithDF2.R
│       │   ├── chiSquareAreaAbove5Point1WithDF5/
│       │   │   └── chiSquareAreaAbove5Point1WithDF5.R
│       │   ├── chiSquareAreaAbove6Point25WithDF3/
│       │   │   └── chiSquareAreaAbove6Point25WithDF3.R
│       │   └── chiSquareAreaAbove9Point21WithDF3/
│       │       └── chiSquareAreaAbove9Point21WithDF3.R
│       ├── bladesTwoSampleHTPValueQC/
│       │   └── bladesTwoSampleHTPValueQC.R
│       ├── chiSquareDistributionWithInceasingDF/
│       │   └── chiSquareDistributionWithInceasingDF.R
│       ├── eoce/
│       │   ├── assisted_reproduction_one_sample_randomization/
│       │   │   └── assisted_reproduction_one_sample_randomization.R
│       │   ├── egypt_revolution_one_sample_randomization/
│       │   │   └── egypt_revolution_one_sample_randomization.R
│       │   ├── social_experiment_two_sample_randomization/
│       │   │   └── social_experiment_two_sample_randomization.R
│       │   └── yawning_two_sample_randomization/
│       │       └── yawning_two_sample_randomization.R
│       ├── geomFitEvaluationForSP500/
│       │   ├── geomFitEvaluationForSP500.R
│       │   └── sp500_1950_2018.csv
│       ├── geomFitPValueForSP500/
│       │   └── geomFitPValueForSP500.R
│       ├── iPodChiSqTail/
│       │   └── iPodChiSqTail.R
│       ├── jurorHTPValueShown/
│       │   └── jurorHTPValueShown.R
│       ├── mammograms/
│       │   └── mammograms.R
│       ├── paydayCC_norm_pvalue/
│       │   └── paydayCC_norm_pvalue.R
│       └── quadcopter/
│           └── quadcopter_attribution.txt
├── ch_intro_to_data/
│   ├── TeX/
│   │   ├── case_study_using_stents_to_prevent_strokes.tex
│   │   ├── ch_intro_to_data.tex
│   │   ├── data_basics.tex
│   │   ├── experiments.tex
│   │   ├── review_exercises.tex
│   │   └── sampling_principles_and_strategies.tex
│   └── figures/
│       ├── county_fed_spendVsPoverty/
│       │   └── county_fed_spendVsPoverty.R
│       ├── eoce/
│       │   ├── air_quality_durham/
│       │   │   ├── air_quality_durham.R
│       │   │   └── pm25_2011_durham.csv
│       │   ├── airports/
│       │   │   ├── airports.R
│       │   │   └── data/
│       │   │       └── cb_2013_us_state_20m/
│       │   │           ├── cb_2013_us_state_20m.dbf
│       │   │           ├── cb_2013_us_state_20m.prj
│       │   │           ├── cb_2013_us_state_20m.shp
│       │   │           ├── cb_2013_us_state_20m.shp.iso.xml
│       │   │           ├── cb_2013_us_state_20m.shp.xml
│       │   │           ├── cb_2013_us_state_20m.shx
│       │   │           └── state_20m.ea.iso.xml
│       │   ├── antibiotic_use_children/
│       │   │   └── antibiotic_use_children.R
│       │   ├── association_plots/
│       │   │   └── association_plots.R
│       │   ├── cleveland_sacramento/
│       │   │   └── cleveland_sacramento.R
│       │   ├── county_commute_times/
│       │   │   ├── countyMap.R
│       │   │   └── county_commute_times.R
│       │   ├── county_hispanic_pop/
│       │   │   ├── countyMap.R
│       │   │   └── county_hispanic_pop.R
│       │   ├── county_income_education/
│       │   │   └── county_income_education.R
│       │   ├── dream_act_mosaic/
│       │   │   └── dream_act_mosaic.R
│       │   ├── estimate_mean_median_simple/
│       │   │   └── estimate_mean_median_simple.R
│       │   ├── gpa_study_hours/
│       │   │   ├── gpa_study_hours.R
│       │   │   ├── gpa_study_hours.csv
│       │   │   └── gpa_study_hours.rda
│       │   ├── hist_box_match/
│       │   │   └── hist_box_match.R
│       │   ├── hist_vs_box/
│       │   │   └── hist_vs_box.R
│       │   ├── income_coffee_shop/
│       │   │   └── income_coffee_shop.R
│       │   ├── infant_mortality_rel_freq/
│       │   │   ├── factbook.rda
│       │   │   └── infant_mortality.R
│       │   ├── internet_life_expactancy/
│       │   │   ├── factbook.rda
│       │   │   └── internet_life_expactancy.R
│       │   ├── internet_life_expectancy/
│       │   │   ├── factbook.rda
│       │   │   └── internet_life_expectancy.R
│       │   ├── mammal_life_spans/
│       │   │   └── mammal_life_spans.R
│       │   ├── marathon_winners/
│       │   │   └── marathon_winners.R
│       │   ├── office_productivity/
│       │   │   └── office_productivity.R
│       │   ├── oscar_winners/
│       │   │   └── oscar_winners.R
│       │   ├── raise_taxes_mosaic/
│       │   │   └── raise_taxes_mosaic.R
│       │   ├── randomization_avandia/
│       │   │   └── randomization_avandia.R
│       │   ├── randomization_heart_transplants/
│       │   │   ├── inference.RData
│       │   │   └── randomization_heart_transplants.R
│       │   ├── reproducing_bacteria/
│       │   │   └── reproducing_bacteria.R
│       │   ├── seattle_pet_names/
│       │   │   └── seattle_pet_names.R
│       │   ├── stats_scores_box/
│       │   │   └── stats_scores_box.R
│       │   └── unvotes/
│       │       └── unvotes.R
│       ├── expResp/
│       │   └── expResp.R
│       ├── figureShowingBlocking/
│       │   └── figureShowingBlocking.R
│       ├── interest_rate_vs_income/
│       │   └── interest_rate_vs_loan_amount.R
│       ├── interest_rate_vs_loan_amount/
│       │   └── interest_rate_vs_loan_amount.R
│       ├── interest_rate_vs_loan_income_ratio/
│       │   └── interest_rate_vs_loan_income_ratio.R
│       ├── loan_amount_vs_income/
│       │   └── loan_amount_vs_income.R
│       ├── mnWinter/
│       │   └── ReadMe.txt
│       ├── multiunitsVsOwnership/
│       │   └── multiunitsVsOwnership.R
│       ├── popToSample/
│       │   ├── popToSampleGraduates.R
│       │   ├── popToSubSampleGraduates.R
│       │   └── surveySample.R
│       ├── pop_change_v_med_income/
│       │   └── pop_change_v_med_income.R
│       ├── pop_change_v_per_capita_income/
│       │   └── pop_change_v_per_capita_income.R
│       ├── samplingMethodsFigure/
│       │   ├── SamplingMethodsFunctions.R
│       │   ├── samplingMethodsFigure.R
│       │   └── samplingMethodsFigures.R
│       └── variables/
│           ├── sunCausesCancer.R
│           └── variables.R
├── ch_probability/
│   ├── TeX/
│   │   ├── ch_probability.tex
│   │   ├── conditional_probability.tex
│   │   ├── continuous_distributions.tex
│   │   ├── defining_probability.tex
│   │   ├── random_variables.tex
│   │   ├── review_exercises.tex
│   │   └── sampling_from_a_small_population.tex
│   └── figures/
│       ├── BreastCancerTreeDiagram/
│       │   ├── BreastCancerTreeDiagram.R
│       │   └── Mammogram Research.txt
│       ├── bookCostDist/
│       │   └── bookCostDist.R
│       ├── bookWts/
│       │   └── bookWts.R
│       ├── cardsDiamondFaceVenn/
│       │   └── cardsDiamondFaceVenn.R
│       ├── changeInLeonardsStockPortfolioFor36Months/
│       │   └── changeinleonardsstockportfoliofor36months.R
│       ├── complementOfD/
│       │   └── complementOfD.R
│       ├── contBalance/
│       │   └── contBalance.R
│       ├── diceSumDist/
│       │   └── diceSumDist.R
│       ├── dieProp/
│       │   └── dieProp.R
│       ├── disjointSets/
│       │   └── disjointSets.R
│       ├── eoce/
│       │   ├── cat_weights/
│       │   │   └── cat_weights.R
│       │   ├── poverty_language/
│       │   │   ├── poverty_language.R
│       │   │   └── poverty_language.tiff
│       │   ├── swing_voters/
│       │   │   ├── swing_voters.R
│       │   │   └── swing_voters.tiff
│       │   ├── tree_drawing_box_plots/
│       │   │   └── tree_drawing_box_plots.R
│       │   ├── tree_exit_poll/
│       │   │   └── tree_exit_poll.R
│       │   ├── tree_hiv_swaziland/
│       │   │   └── tree_hiv_swaziland.R
│       │   ├── tree_lupus/
│       │   │   └── tree_lupus.R
│       │   ├── tree_thrombosis/
│       │   │   └── tree_thrombosis.R
│       │   └── tree_twins/
│       │       └── tree_twins.R
│       ├── fdicHeightContDist/
│       │   └── fdicHeightContDist.R
│       ├── fdicHeightContDistFilled/
│       │   └── fdicHeightContDistFilled.R
│       ├── fdicHistograms/
│       │   ├── fdicHistograms.R
│       │   └── fdicHistograms.rda
│       ├── indepForRollingTwo1s/
│       │   └── indepForRollingTwo1s.R
│       ├── loans_app_type_home_venn/
│       │   └── loans_app_type_home_venn.R
│       ├── photoClassifyVenn/
│       │   └── photoClassifyVenn.R
│       ├── smallpoxTreeDiagram/
│       │   └── smallpoxTreeDiagram.R
│       ├── testTree/
│       │   └── testTree.R
│       ├── treeDiagramAndPass/
│       │   └── treeDiagramAndPass.R
│       ├── treeDiagramGarage/
│       │   └── treeDiagramGarage.R
│       ├── usHeightsHist180185/
│       │   └── usHeightsHist180185.R
│       └── usHouseholdIncomeDistBar/
│           └── usHouseholdIncomeDistBar.R
├── ch_regr_mult_and_log/
│   ├── TeX/
│   │   ├── ch_regr_mult_and_log.tex
│   │   ├── checking_model_assumptions_using_graphs.tex
│   │   ├── introduction_to_logistic_regression.tex
│   │   ├── introduction_to_multiple_regression.tex
│   │   ├── model_selection.tex
│   │   ├── mult_regr_case_study.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── eoce/
│       │   ├── absent_from_school_mlr/
│       │   │   └── absent_from_school_mlr.R
│       │   ├── absent_from_school_model_select_backward/
│       │   │   └── absent_from_school_model_select_backward.R
│       │   ├── absent_from_school_model_select_forward/
│       │   │   └── absent_from_school_model_select_forward.R
│       │   ├── baby_weights_conds/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_conds.R
│       │   ├── baby_weights_mlr/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_mlr.R
│       │   ├── baby_weights_model_select_backward/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_model_select_backward.R
│       │   ├── baby_weights_model_select_forward/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_model_select_backward.R
│       │   ├── baby_weights_parity/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_parity.R
│       │   ├── baby_weights_smoke/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_smoke.R
│       │   ├── challenger_disaster_predict/
│       │   │   ├── challenger_disaster_predict.R
│       │   │   └── orings.rda
│       │   ├── gpa/
│       │   │   ├── gpa.R
│       │   │   └── gpa_survey.csv
│       │   ├── gpa_iq_conds/
│       │   │   ├── gpa_iq.csv
│       │   │   └── gpa_iq_conds.R
│       │   ├── log_regr_ex/
│       │   │   └── log_regr_ex.R
│       │   ├── movie_returns_altogether/
│       │   │   ├── horror_movies_conds.R
│       │   │   └── movie_profit.csv
│       │   ├── movie_returns_by_genre/
│       │   │   ├── horror_movies_conds.R
│       │   │   └── movie_profit.csv
│       │   ├── possum_classification_model_select/
│       │   │   └── possum_classification_model_select.R
│       │   ├── spam_filtering_model_sel/
│       │   │   └── spam_filtering_model_sel.R
│       │   └── spam_filtering_predict/
│       │       └── spam_filtering_predict.R
│       ├── loansDiagnostics/
│       │   └── loans_analysis.R
│       ├── loansSingles/
│       │   ├── intRateVsPastBankrScatter.R
│       │   └── intRateVsVerIncomeScatter.R
│       ├── logisticModel/
│       │   └── logisticModel.R
│       ├── logitTransformationFigureHoriz/
│       │   └── logitTransformationFigureHoriz.R
│       ├── marioKartDiagnostics/
│       │   └── marioKartAnalysis.R
│       └── marioKartSingle/
│           └── marioKartSingle.R
├── ch_regr_simple_linear/
│   ├── TeX/
│   │   ├── ch_regr_simple_linear.tex
│   │   ├── fitting_a_line_by_least_squares_regression.tex
│   │   ├── inference_for_linear_regression.tex
│   │   ├── line_fitting_residuals_and_correlation.tex
│   │   ├── review_exercises.tex
│   │   └── types_of_outliers_in_linear_regression.tex
│   └── figures/
│       ├── brushtail_possum/
│       │   └── ReadMe.txt
│       ├── elmhurstPlots/
│       │   └── elmhurstScatterW2Lines.R
│       ├── eoce/
│       │   ├── beer_blood_alcohol_inf/
│       │   │   ├── beer_blood_alcohol.txt
│       │   │   └── beer_blood_alcohol_inf.R
│       │   ├── body_measurements_hip_weight_corr_units/
│       │   │   └── body_measurements_hip_weight.R
│       │   ├── body_measurements_shoulder_height_corr_units/
│       │   │   └── body_measurements_shoulder_height.R
│       │   ├── body_measurements_weight_height_inf/
│       │   │   └── body_measurements_weight_height_inf.R
│       │   ├── cat_body_heart_reg/
│       │   │   └── cat_body_heart_reg.R
│       │   ├── coast_starlight_corr_units/
│       │   │   ├── coast_starlight.R
│       │   │   └── coast_starlight.txt
│       │   ├── crawling_babies_corr_units/
│       │   │   ├── crawling_babies.R
│       │   │   └── crawling_babies.csv
│       │   ├── exams_grades_correlation/
│       │   │   ├── exam_grades.txt
│       │   │   └── exams_grades_correlation.R
│       │   ├── full_lin_regr_1/
│       │   │   ├── prof_evals_beauty.csv
│       │   │   └── rate_my_prof.R
│       │   ├── full_lin_regr_2/
│       │   │   ├── prof_evals_beauty.csv
│       │   │   └── rate_my_prof.R
│       │   ├── helmet_lunch/
│       │   │   └── helmet_lunch.R
│       │   ├── husbands_wives_age_inf/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_age_inf.R
│       │   ├── husbands_wives_correlation/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_correlation.R
│       │   ├── husbands_wives_height_inf/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_height_inf.R
│       │   ├── husbands_wives_height_inf_2s/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_height_inf_2s.R
│       │   ├── identify_relationships_1/
│       │   │   └── identify_relationships_1.R
│       │   ├── identify_relationships_2/
│       │   │   └── identify_relationships_2.R
│       │   ├── match_corr_1/
│       │   │   └── match_corr_1.R
│       │   ├── match_corr_2/
│       │   │   └── match_corr_2.R
│       │   ├── match_corr_3/
│       │   │   ├── match_corr_2.R
│       │   │   └── match_corr_3.R
│       │   ├── murders_poverty_reg/
│       │   │   ├── murders.csv
│       │   │   └── murders_poverty.R
│       │   ├── outliers_1/
│       │   │   └── outliers_1.R
│       │   ├── outliers_2/
│       │   │   └── outliers_2.R
│       │   ├── rate_my_prof/
│       │   │   ├── prof_evals_beauty.csv
│       │   │   └── rate_my_prof.R
│       │   ├── speed_height_gender/
│       │   │   ├── speed_height_gender.R
│       │   │   └── speed_survey.csv
│       │   ├── starbucks_cals_carbos/
│       │   │   ├── starbucks.csv
│       │   │   └── starbucks_cals_carbos.R
│       │   ├── starbucks_cals_protein/
│       │   │   ├── starbucks.csv
│       │   │   └── starbucks_cals_protein.R
│       │   ├── tourism_spending_reg_conds/
│       │   │   ├── tourism_spending.csv
│       │   │   └── tourism_spending_reg_cond.R
│       │   ├── trees_volume_height_diameter/
│       │   │   └── trees_volume_height_diameter.R
│       │   ├── trends_in_residuals/
│       │   │   └── trends_in_residuals.R
│       │   ├── urban_homeowners_cond/
│       │   │   ├── urban_homeowners_cond.R
│       │   │   └── urban_state_data.csv
│       │   ├── urban_homeowners_outlier/
│       │   │   ├── urban_homeowners_outlier.R
│       │   │   └── urban_state_data.csv
│       │   └── visualize_residuals/
│       │       └── visualize_residuals.R
│       ├── identifyingInfluentialPoints/
│       │   └── identifyingInfluentialPoints.R
│       ├── imperfLinearModel/
│       │   └── imperfLinearModel.R
│       ├── marioKartNewUsed/
│       │   └── marioKartNewUsed.R
│       ├── notGoodAtAllForALinearModel/
│       │   └── notGoodAtAllForALinearModel.R
│       ├── outlierPlots/
│       │   └── outlierPlots.R
│       ├── pValueMidtermUnemp/
│       │   └── pValueMidtermUnemp.R
│       ├── perfLinearModel/
│       │   └── perfLinearModel.R
│       ├── posNegCorPlots/
│       │   ├── CorrelationPlot.R
│       │   ├── corForNonLinearPlots.R
│       │   └── posNegCorPlots.R
│       ├── sampleLinesAndResPlots/
│       │   └── sampleLinesAndResPlots.R
│       ├── scattHeadLTotalL/
│       │   └── scattHeadLTotalL.R
│       ├── scattHeadLTotalLLine/
│       │   └── scattHeadLTotalLLine.R
│       ├── scattHeadLTotalLResidualPlot/
│       │   └── scattHeadLTotalLResidualPlot.R
│       ├── scattHeadLTotalLSex/
│       │   └── scattHeadLTotalLSex.R
│       ├── scattHeadLTotalLTube/
│       │   └── scattHeadLTotalLTube.R
│       ├── unemploymentAndChangeInHouse/
│       │   └── unemploymentAndChangeInHouse.R
│       └── whatCanGoWrongWithLinearModel/
│           ├── makeTubeAdv.R
│           └── whatCanGoWrongWithLinearModel.R
├── ch_summarizing_data/
│   ├── TeX/
│   │   ├── case_study_malaria_vaccine.tex
│   │   ├── ch_summarizing_data.tex
│   │   ├── considering_categorical_data.tex
│   │   ├── examining_numerical_data.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── boxPlotLayoutNumVar/
│       │   └── boxPlotLayoutNumVar.R
│       ├── carsPriceVsWeight/
│       │   └── carsPriceVsWeight.R
│       ├── countyIncomeSplitByPopGain/
│       │   └── countyIncomeSplitByPopGain.R
│       ├── countyIntensityMaps/
│       │   ├── countyIntensityMaps.R
│       │   └── countyMap.R
│       ├── county_pop_change_v_pop_transform/
│       │   └── county_pop_change_v_pop_transform.R
│       ├── county_pop_transformed/
│       │   └── county_pop_transformed.R
│       ├── discRandDotPlot/
│       │   └── discRandDotPlot.R
│       ├── email50LinesCharacters/
│       │   └── email50LinesCharacters.R
│       ├── email50LinesCharactersMod/
│       │   └── email50LinesCharactersMod.R
│       ├── email50NumCharDotPlotRobustEx/
│       │   └── email50NumCharDotPlotRobustEx.R
│       ├── email50NumCharHist/
│       │   └── email50NumCharHist.R
│       ├── emailCharactersDotPlot/
│       │   └── emailCharactersDotPlot.R
│       ├── emailNumberBarPlot/
│       │   └── emailNumberBarPlot.R
│       ├── emailNumberPieChart/
│       │   └── emailNumberPieChart.R
│       ├── emailSpamNumberMosaicPlot/
│       │   └── emailSpamNumberMosaicPlot.R
│       ├── emailSpamNumberSegBar/
│       │   └── emailSpamNumberSegBar.R
│       ├── eoce/
│       │   ├── air_quality_durham/
│       │   │   ├── air_quality_durham.R
│       │   │   └── pm25_2011_durham.csv
│       │   ├── antibiotic_use_children/
│       │   │   └── antibiotic_use_children.R
│       │   ├── association_plots/
│       │   │   └── association_plots.R
│       │   ├── cleveland_sacramento/
│       │   │   └── cleveland_sacramento.R
│       │   ├── county_commute_times/
│       │   │   ├── countyMap.R
│       │   │   └── county_commute_times.R
│       │   ├── county_hispanic_pop/
│       │   │   ├── countyMap.R
│       │   │   └── county_hispanic_pop.R
│       │   ├── dream_act_mosaic/
│       │   │   └── dream_act_mosaic.R
│       │   ├── estimate_mean_median_simple/
│       │   │   └── estimate_mean_median_simple.R
│       │   ├── hist_box_match/
│       │   │   └── hist_box_match.R
│       │   ├── hist_vs_box/
│       │   │   └── hist_vs_box.R
│       │   ├── income_coffee_shop/
│       │   │   └── income_coffee_shop.R
│       │   ├── infant_mortality_rel_freq/
│       │   │   ├── factbook.rda
│       │   │   └── infant_mortality.R
│       │   ├── mammal_life_spans/
│       │   │   └── mammal_life_spans.R
│       │   ├── marathon_winners/
│       │   │   └── marathon_winners.R
│       │   ├── office_productivity/
│       │   │   └── office_productivity.R
│       │   ├── oscar_winners/
│       │   │   └── oscar_winners.R
│       │   ├── raise_taxes_mosaic/
│       │   │   └── raise_taxes_mosaic.R
│       │   ├── randomization_avandia/
│       │   │   └── randomization_avandia.R
│       │   ├── randomization_heart_transplants/
│       │   │   ├── inference.RData
│       │   │   └── randomization_heart_transplants.R
│       │   ├── reproducing_bacteria/
│       │   │   └── reproducing_bacteria.R
│       │   └── stats_scores_box/
│       │       └── stats_scores_box.R
│       ├── histMLBSalaries/
│       │   └── histMLBSalaries.R
│       ├── loan50IncomeHist/
│       │   └── loan50IncomeHist.R
│       ├── loan50IntRateHist/
│       │   └── loan50IntRateHist.R
│       ├── loan50LoanAmountHist/
│       │   └── loan50LoanAmountHist.R
│       ├── loan50_amt_vs_income/
│       │   └── loan50_amt_vs_income.R
│       ├── loan50_amt_vs_interest/
│       │   └── loan50_amt_vs_interest.R
│       ├── loan_amount_dot_plot/
│       │   └── loan_amount_dot_plot.R
│       ├── loan_app_type_home_mosaic_plot/
│       │   └── loan_app_type_home_mosaic_plot.R
│       ├── loan_app_type_home_seg_bar/
│       │   └── loan_app_type_home_seg_bar.R
│       ├── loan_homeownership_bar_plot/
│       │   └── loan_homeownership_bar_plot.R
│       ├── loan_homeownership_pie_chart/
│       │   └── loan_homeownership_pie_chart.R
│       ├── loan_int_rate_box_plot_layout/
│       │   └── loan_int_rate_box_plot_layout.R
│       ├── loan_int_rate_dot_plot/
│       │   └── loan_int_rate_dot_plot.R
│       ├── loan_int_rate_robust_ex/
│       │   └── loan_int_rate_robust_ex.R
│       ├── malaria_rand_dot_plot/
│       │   └── malaria_rand_dot_plot.R
│       ├── medianHHIncomePoverty/
│       │   └── medianHHIncomePoverty.R
│       ├── sdAsRuleForEmailNumChar/
│       │   └── sdAsRuleForEmailNumChar.R
│       ├── sdRuleForIncome/
│       │   └── sdRuleForIncome.R
│       ├── sdRuleForIntRate/
│       │   └── sdRuleForIntRate.R
│       ├── sdRuleForLoanAmount/
│       │   └── sdRuleForLoanAmount.R
│       ├── severalDiffDistWithSdOf1/
│       │   └── severalDiffDistWithSdOf1.R
│       ├── singleBiMultiModalPlots/
│       │   └── singleBiMultiModalPlots.R
│       └── total_income_dot_plot/
│           └── total_income_dot_plot.R
├── eoce.bib
├── extraTeX/
│   ├── data/
│   │   └── data.tex
│   ├── eoceSolutions/
│   │   └── eoceSolutions.tex
│   ├── index/
│   │   └── index.tex
│   ├── preamble/
│   │   ├── copyright.tex
│   │   ├── copyright_derivative.tex
│   │   ├── preface.tex
│   │   ├── review_copy.tex
│   │   ├── title.tex
│   │   └── title_derivative.tex
│   ├── style/
│   │   ├── colorsV1.tex
│   │   ├── hardcover.tex
│   │   ├── headers.tex
│   │   ├── headers_simple.tex
│   │   ├── style.tex
│   │   ├── style_appendices.tex
│   │   ├── style_simple.tex
│   │   ├── tablet.tex
│   │   └── video.tex
│   └── tables/
│       ├── TeX/
│       │   ├── chiSquareTable.tex
│       │   ├── tTable.tex
│       │   └── zTable.tex
│       ├── code/
│       │   ├── chiSquareProbTable.R
│       │   └── normalProbTable.R
│       └── figures/
│           ├── chiSquareTail/
│           │   └── chiSquareTail.R
│           ├── normalTails/
│           │   ├── normalTails.R
│           │   └── subtractingArea/
│           │       └── subtractingArea.R
│           └── tTails/
│               └── tTails.R
├── fullminipage.sty
├── main.tex
└── openintro-statistics.Rproj

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.log
*.aux
main-blx.bib
main.bbl
main.blg
main.idx
main.ilg
main.ind
main.out
main.pdf
main.run.xml
main.synctex.gz
main.toc
main.upa
*.DS_Store
*gitignore~
*.Rapp.history
*~
Icon[^/]
\#*
*.dropbox
_README
*-deprecated*
*.Rhistory
OS4-201[89]-[01][0-9]-[0-3][0-9] [A-Z].pdf
main.synctex(busy)
.Rproj.user


================================================
FILE: LICENSE.md
================================================

OpenIntro Statistics is available at http://www.openintro.org under a Creative Commons Attribution-ShareAlike 3.0 Unported license (CC BY-SA):

http://creativecommons.org/licenses/by-sa/3.0/

This `LICENSE` file describes guidelines when the textbook's source files are modified and / or shared. The CC BY-SA license guidelines supersede any guidelines put forth here; follow the CC BY-SA license if there is any discrepancy between that license and these guidelines.

You may contact us if you would like to request an alternative licensing option at

https://www.openintro.org/contact

1. Communication obligation. Any derivative work must communicate that it is licensed under a CC BY-SA license.

2. Figure attribution. Some photographs may be owned by other creators who made the images available under a Creative Commons license and were used in this work. If you use a photograph, please check in the textbook whether the figure is a work of another party. If you use any such images, provide appropriate attribution to the original photographer (e.g. see OpenIntro Statistics for what we believe to be appropriate attribution in these instances).

3. Derivative title. No derivative may include "OpenIntro" in the title, unless it is included in text of the form "Derivative of OpenIntro". Additionally, the title may not match any OpenIntro textbook (or be a translated equivalent) and also not imply a connection (e.g. "[Introductory Statistics with Randomization and Simulation](https://openintro.org/book/isrs/) for Biology" is not be permitted). A novel title is required to avoid product confusion or the appearance that your new resource is associated with OpenIntro.

Use of the OpenIntro trademark and logo are strictly prohibited and are not licensed for use. The only appropriate use is when indicating the original resource that has been modified. Example: "This book was built using 'OpenIntro Statistics', and that original book may be found at openintro.org/book/os."

4. Below are other suggested guidelines for attribution.

- The first two pages of any derivative work should be the title page and the copyright page. We encourage contributors to use the following two files provided in the textbook's source: file, extraTeX > preamble > title_derivative.tex, copyright_derivative.tex. We understand that it may be useful to modify them, so consider them an initial template.

- We advise that contributing authors' names be listed in chronological order corresponding to their contribution. We also encourage contributing authors to provide a brief description of their contribution.


================================================
FILE: README.md
================================================
Project Organization
--------------------

- Each chapter's content is in one of the eight chapter folders that start with "ch_". Within each folder, there is a "figures" folder and a "TeX" folder. The TeX folder contains the text files that are used to typeset the chapters in the textbook.
- In many cases, R code is supplied with figures to regenerate the figure. It will often be necessary to install the "openintro" R package that is available from GitHub (https://github.com/OpenIntroOrg) if you would like to regenerate a figure. Other packages may also occasionally be required.
- Exercise figures may be found in [chapter folder] > figures > eoce > [exercise figure folders]. "EOCE" means end-of-chapter exercises.
- The extraTeX folder contains files for the front and back matter of the textbook and also the style files. Note that use of any style files, like all other files here, is under the Creative Commons license cited in the LICENSE file.

- - -

Typesetting the Textbook
------------------------

The textbook may be typeset using the main.tex file. This file pulls in all of the necessary TeX files and figures. For a final typesetting event, typeset in the following order

- LaTeX 3 times.
- MakeIndex once.
- BibTeX once.
- LaTeX once.
- MakeIndex once.
- LaTeX once.

This isn't important for casual browsing, but it is important for a "final" version. The repetitive typesetting is to account for when typesetting changes references slightly, since typesetting the first few times can move content from one page to the next, e.g. as a \ref{...} gets filled in.

- - -

Learning LaTeX
--------------

If you are not familiar with LaTeX but would like to learn how to use it, check out the slides from two LaTeX mini-courses at

https://github.com/OpenIntroOrg/mini-course-materials

PDFs:

[Basics of LaTeX](https://github.com/OpenIntroOrg/mini-course-materials/raw/master/LaTeX_Basics/basicsOfLatex.pdf)

[Math and BibTeX](https://github.com/OpenIntroOrg/mini-course-materials/raw/master/LaTeX_Math_and_BibTeX/bibtexMathInLatex.pdf)

For a more authoritative review, the book "Guide to LaTeX" is an excellent resource.

Also, see the branches of [this repo](https://github.com/statkclee/mini-course-materials) by Kwangchun Lee for Korean translations of these mini-course materials.


================================================
FILE: ch_distributions/TeX/binomial_distribution.tex
================================================
\exercisesheader{}

% 17

\eoce{\qt{Underage drinking, Part I\label{underage_drinking_intro}}
Data collected by the Substance Abuse and Mental Health
Services Administration (SAMSHA) suggests that 69.7\% of
18-20 year olds consumed alcoholic beverages in any given
year.\footfullcite{webpage:alcohol}
\begin{parts}
\item Suppose a random sample of ten 18-20 year olds is taken. Is the use 
of the binomial distribution appropriate for calculating the probability that 
exactly six consumed alcoholic beverages? Explain.
\item Calculate the probability that exactly 6 out of 10 randomly sampled 18-
20 year olds consumed an alcoholic drink.
\item What is the probability that exactly four out of ten 18-20 year 
olds have \textit{not} consumed an alcoholic beverage?
\item What is the probability that at most 2 out of 5 randomly sampled 18-20 
year olds have consumed alcoholic beverages?
\item What is the probability that at least 1 out of 5 randomly sampled 18-20 
year olds have consumed alcoholic beverages?
\end{parts}
}{}

% 18

\eoce{\qt{Chickenpox, Part I\label{chicken_pox_intro}} Boston Children's
Hospital estimates that 90\% of Americans have had chickenpox by 
the time they reach adulthood. \footfullcite{bostonchildrenshospital:chickenpox}
\begin{parts}
\item Suppose we take a random sample of 100 American adults. Is the use of 
the binomial distribution appropriate for calculating the probability that exactly 97 
out of 100 randomly sampled American adults had chickenpox during childhood? Explain.
\item Calculate the probability that exactly 97 out of 100 randomly sampled 
American adults had chickenpox during childhood.
\item What is the probability that exactly 3 out of a new sample of 100 
American adults have \textit{not} had chickenpox in their childhood?
\item What is the probability that at least 1 out of 10 randomly sampled 
American adults have had chickenpox?
\item What is the probability that at most 3 out of 10 randomly sampled 
American adults have \textit{not} had chickenpox?
\end{parts}
}{}

% 19

\eoce{\qt{Underage drinking, Part II\label{underage_drinking_normal_approx}}
We learned in Exercise~\ref{underage_drinking_intro}
that about 70\% of 18-20 year olds consumed alcoholic
beverages in any given year. We now consider a random 
sample of fifty 18-20 year olds.
\begin{parts}
\item How many people would you expect to have consumed alcoholic beverages? 
And with what standard deviation?
\item Would you be surprised if there were 45 or more people who have 
consumed alcoholic beverages?
\item What is the probability that 45 or more people in this sample have 
consumed alcoholic beverages? How does this probability relate to your answer 
to part (b)?
\end{parts}
}{}

% 20

\eoce{\qt{Chickenpox, Part II\label{chicken_pox_normal_approx}} We learned in 
Exercise~\ref{chicken_pox_intro} that about 90\% of American adults had 
chickenpox before adulthood. We now consider a random sample of 120 American 
adults.
\begin{parts}
\item How many people in this sample would you expect to have had chickenpox 
in their childhood? And with what standard deviation?
\item Would you be surprised if there were 105 people who have had chickenpox 
in their childhood?
\item What is the probability that 105 or fewer people in this sample have 
had chickenpox in their childhood? How does this probability relate to your 
answer to part (b)?
\end{parts}
}{}

% 21

\eoce{\qt{Game of dreidel\label{dreidel}} A dreidel is a four-sided spinning top 
with the Hebrew letters \textit{nun}, \textit{gimel}, \textit{hei}, and 
\textit{shin}, one on each side. Each side is equally likely to come up in a 
single spin of the dreidel. Suppose you spin a dreidel three times. Calculate 
the probability of getting

\noindent\begin{minipage}[c]{0.45\textwidth}
\begin{parts}
\item at least one \textit{nun}? 
\item exactly 2 \textit{nun}s? 
\item exactly 1 \textit{hei}? 
\item at most 2 \textit{gimel}s? \vspace{3mm}
\end{parts}
\end{minipage}%
\begin{minipage}[c]{0.25\textwidth}
\ \vspace{2mm}

\Figures[An image of two wooden dreidels.]{0.95}{eoce/dreidel}{dreidel.jpg}\vspace{2mm}
\end{minipage}%
\begin{minipage}[c]{0.28\textwidth}%
{\footnotesize Photo by Staccabees, cropped \\
  (\oiRedirect{textbook-flickr_staccabees_dreidels}{http://flic.kr/p/7gLZTf}) \\
  \oiRedirect{textbook-CC_BY_2}{CC~BY~2.0~license}} \\
\end{minipage}
}{}

\D{\newpage}

% 22

\eoce{\qt{Arachnophobia\label{arachnophobia}}
A Gallup Poll found that 7\% of teenagers (ages 13 to 17)
suffer from arachnophobia and are extremely afraid of spiders.
At a summer camp there are 10 teenagers sleeping in each tent.
Assume that these 10 teenagers are independent of each other.%
\footfullcite{webpage:spiders}
\begin{parts}
\item Calculate the probability that at least one of them suffers from 
arachnophobia.
\item Calculate the probability that exactly 2 of them suffer from 
arachnophobia.
\item Calculate the probability that at most 1 of them suffers from 
arachnophobia. 
\item If the camp counselor wants to make sure no more than 1 teenager in 
each tent is afraid of spiders, does it seem reasonable for him to randomly 
assign teenagers to tents?
\end{parts}
}{}

% 23

\eoce{\qt{Eye color, Part II\label{eye_color_binomial}} 
Exercise~\ref{eye_color_geometric} introduces a husband and wife with brown 
eyes who have 0.75 probability of having children with brown eyes, 0.125 
probability of having children with blue eyes, and 0.125 probability of 
having children with green eyes.
\begin{parts}
\item What is the probability that their first child will have green eyes and 
the second will not?
\item What is the probability that exactly one of their two children will 
have green eyes?
\item If they have six children, what is the probability that exactly two 
will have green eyes?
\item If they have six children, what is the probability that at least one 
will have green eyes?
\item What is the probability that the first green eyed child will be the 
$4^{th}$ child? 
\item Would it be considered unusual if only 2 out of their 6 children had 
brown eyes?
\end{parts}
}{}

% 24

\eoce{\qt{Sickle cell anemia\label{sickle_cell_anemia}} Sickle cell anemia is a 
genetic blood disorder where red blood cells lose their flexibility and 
assume an abnormal, rigid, ``sickle" shape, which results in a risk of 
various complications. If both parents are carriers of the disease, then a 
child has a 25\% chance of having the disease, 50\% chance of being a 
carrier, and 25\% chance of neither having the disease nor being a carrier. 
If two parents who are carriers of the disease have 3 children, what is the 
probability that 
\begin{parts}
\item two will have the disease?
\item none will have the disease?
\item at least one will neither have the disease nor be a carrier?
\item the first child with the disease will the be $3^{rd}$ child?
\end{parts}
}{}

% 25

\eoce{\qt{Exploring permutations\label{explore_combinations}} The formula for the 
number of ways to arrange $n$ objects is $n! = n\times(n-1)\times \cdots 
\times 2 \times 1$. This exercise walks you through the derivation of this 
formula for a couple of special cases.

\indent A small company has five employees: Anna, Ben, Carl, Damian, and 
Eddy. There are five parking spots in a row at the company, none of which are 
assigned, and each day the employees pull into a random parking spot. That 
is, all possible orderings of the cars in the row of spots are equally likely.
\begin{parts}
\item On a given day, what is the probability that the employees park in 
alphabetical order?
\item If the alphabetical order has an equal chance of occurring relative to 
all other possible orderings, how many ways must there be to arrange the five 
cars?
\item Now consider a sample of 8 employees instead. How many possible ways 
are there to order these 8 employees' cars?
\end{parts}
}{}

% 26

\eoce{\qt{Male children\label{male_children}} While it is often assumed that the 
probabilities of having a boy or a girl are the same, the actual probability 
of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 
kids. 
\begin{parts}
\item Use the binomial model to calculate the probability that two of them 
will be boys.
\item Write out all possible orderings of 3 children, 2 of whom are boys. Use 
these scenarios to calculate the same probability from part (a) but using the 
addition rule for disjoint outcomes. Confirm that your answers from parts (a) 
and (b) match.
\item If we wanted to calculate the probability that a couple who plans to 
have 8 kids will have 3 boys, briefly describe why the approach from part (b) 
would be more tedious than the approach from part (a).
\end{parts}
}{}


================================================
FILE: ch_distributions/TeX/ch_distributions.tex
================================================
\begin{chapterpage}{Distributions of random variables}
  \chaptertitle[30]{Distributions of random \titlebreak{} variables}
  \label{ch_distributions}
  \chaptersection{normalDist}
  %\chaptersection{assessingNormal}
  \chaptersection{geomDist}
  \chaptersection{binomialModel}
  \chaptersection{negativeBinomial}
  \chaptersection{poisson}
\end{chapterpage}
\renewcommand{\chapterfolder}{ch_distributions}


\chapterintro{In this chapter,
  we discuss statistical distributions that frequently
  arise in the context of data analysis or statistical
  inference.
  We start with the normal distribution in the first section,
  which is used frequently in later chapters of this book.
  The remaining sections will occasionally be referenced
  but may be considered optional for the content in this
  book.}

%_________________
\section{Normal distribution}
\label{normalDist}

\index{distribution!normal|(}
\index{normal distribution|(}

Among all the distributions we see in practice,
one is overwhelmingly the most common.
The symmetric, unimodal, bell curve is ubiquitous
throughout statistics.
Indeed it is so common, that people often know it as the
\termsub{normal curve}{normal distribution} or
\term{normal distribution}\index{distribution!normal|textbf}%
,\footnote{It
  is also introduced as the Gaussian distribution after Frederic
  Gauss, the first person to formalize its mathematical
  expression.}
shown in Figure~\ref{simpleNormal}.
Variables such as SAT scores and heights of US adult males
closely follow the normal distribution.

\begin{figure}[h]
  \centering
  \Figure[A bell-shaped curve that is symmetric about its center is shown. This is the normal distribution. From the left, the curve starts low, grad lifting off the horizontal axis before more steeply rising, before it starts to rise more slowly and flattens at its peak. From the peak, it starts to decrease slowly and then more steeply, before gradually flattening out as it approaches the horizontal axis. This is the bell-shaped normal distribution, an it is the shape of many distributions we will encounter throughout this book. In general, going forward, this bell-shaped distribution shape should be remembered whenever the normal distribution is discussed.]{0.5}{simpleNormal}
  \caption{A normal curve.}
  \label{simpleNormal}
\end{figure}

\begin{onebox}{Normal distribution facts}
  Many variables are nearly normal, but none are exactly normal.
  Thus the normal distribution, while not perfect for any single
  problem, is very useful for a variety of problems.
  We will use it in data exploration and to solve important
  problems in statistics.
\end{onebox}


\subsection{Normal distribution model}

The \term{normal distribution} always describes a symmetric,
unimodal, bell-shaped curve.
However, these curves can look different depending on the
details of the model.
Specifically, the normal distribution model can be adjusted
using two parameters: mean and standard deviation.
As you can probably guess, changing the mean shifts the bell
curve to the left or right, while changing the standard deviation
stretches or constricts the curve.
Figure~\ref{twoSampleNormals} shows the normal distribution
with mean $0$ and standard deviation $1$ in the left panel
and the normal distributions with mean $19$ and standard
deviation $4$ in the right panel.
Figure~\ref{twoSampleNormalsStacked} shows these distributions
on the same axis.

\begin{figure}[h]
  \centering
  \Figure[Two normal distributions are shown. The first has a center of 0 and a standard deviation of 1, where the two tails of the normal distribution curve are essentially indistinguishable from a height of 0 for values less than -3 or larger than positive 3. The second normal distribution is centered at 19 and has a standard deviation of 4, where the height of the distribution is indistinguishable from 0 when it is more than 3 standard deviations from the mean.]{0.7}{twoSampleNormals}
  \caption{Both curves represent the normal distribution.
      However, they differ in their center and spread.}
  \label{twoSampleNormals}
\end{figure}

\begin{figure}[h]
  \centering
  \Figure[Two normal distributions are shown on the same plot. The first has a mean of 0 and a standard deviation of 1. The second has a mean of 19 and a standard deviation of 4. One important property visible in the plot is, because distributions are required to have an area of 1, the normal distribution with a standard deviation of 1 appears much narrower and but also much taller than the second distribution that has a standard deviation of 4.]{0.6}{twoSampleNormalsStacked}
  \caption{The normal distributions shown in
      Figure~\ref{twoSampleNormals} but plotted together
      and on the same scale.}
  \label{twoSampleNormalsStacked}
\end{figure}

If a normal distribution has mean $\mu$ and standard deviation
$\sigma$, we may write the distribution as $N(\mu, \sigma)$.
The two distributions in Figure~\ref{twoSampleNormalsStacked}
may be written as
\begin{align*}
N(\mu=0,\sigma=1)
  \quad \text{and} \quad
  N(\mu=19,\sigma=4)
\end{align*}
Because the mean and standard deviation describe a normal
distribution exactly, they are called the distribution's
\termsub{parameters}{parameter}.
The normal distribution with mean $\mu = 0$ and
standard deviation $\sigma = 1$ is called the
\term{standard normal distribution}%
\index{normal distribution!standard|textbf}.

\begin{exercisewrap}
\begin{nexercise}
Write down the short-hand for a normal distribution
with\footnotemark{} \\
%\begin{enumerate}[(a)]
%\setlength{\itemsep}{0mm}
%\item
(a)
    mean~5 and standard deviation~3, \\
%\item
(b)
    mean~-100 and standard deviation~10, and \\
%\item
(c)
    mean~2 and standard deviation~9.
%\end{enumerate}
\end{nexercise}
\end{exercisewrap}
\footnotetext{(a)~$N(\mu=5,\sigma=3)$.
  (b)~$N(\mu=-100, \sigma=10)$.
  (c)~$N(\mu=2, \sigma=9)$.}


\subsection{Standardizing with Z-scores}

\noindent%
We often want to put data onto a standardized scale,
which can make comparisons more reasonable.

\newcommand{\satmean}{1100}
\newcommand{\satsd}{200}
\newcommand{\actmean}{21}
\newcommand{\actsd}{6}
\newcommand{\annsatscore}{1300}
\newcommand{\annsatzscore}{1}
\newcommand{\tomsatscore}{24}
\newcommand{\tomsatzscore}{0.5}

\begin{examplewrap}
\begin{nexample}{Table~\vref{satACTstats} shows the mean
    and standard deviation for total scores on the SAT and ACT.
    The distribution of SAT and ACT scores are both nearly normal.
    Suppose Ann scored \annsatscore{} on her SAT and Tom scored
    \tomsatscore{} on his ACT.
    Who performed better?}
  \label{actSAT}%
  We use the standard deviation as a guide.
  Ann is \annsatzscore{} standard deviation above average
  on the SAT: $\satmean{} + \satsd{} = \annsatscore{}$.
  Tom is \tomsatzscore{} standard deviations above the mean
  on the ACT:
  $\actmean{} + \tomsatzscore{} \times \actsd{} = \tomsatscore{}$.
  In Figure~\ref{satActNormals}, we can see that Ann tends
  to do better with respect to everyone else than Tom did,
  so her score was better.
\end{nexample}
\end{examplewrap}

\begin{figure}[h]
\centering
\begin{tabular}{l r r}
  \hline
  & SAT & ACT \\
  \hline
  Mean \hspace{0.3cm} & \satmean{} & \actmean{} \\
  SD & \satsd{} & \actsd{} \\
  \hline
\end{tabular}
\caption{Mean and standard deviation for the SAT and ACT.}
\label{satACTstats}
\end{figure}

\begin{figure}
  \centering
  \Figure[Ann's and Tom's scores shown against the SAT and ACT distributions, which are each shown as normal distributions. The SAT distribution has a mean of 1100 and a standard deviation of 200, while the ACT distribution has a mean of 21 and standard deviation of 6. Ann's score is 1300 for the SAT, and Tom's score is 24 for the ACT. Based on their positioning in their respective plots, it is evident that Ann has a higher relative value for her SAT distribution than Tom has for his ACT score.]{0.6}{satActNormals}
  \caption{Ann's and Tom's scores shown against
      the SAT and ACT distributions.}
  \label{satActNormals}
\end{figure}

Example~\ref{actSAT} used a standardization technique called
a Z-score, a method most commonly employed for nearly normal
observations but that may be used with any distribution.
The \term{Z-score}\index{Z@$Z$} of an observation is defined
as the number of standard deviations it falls above or below
the mean.
If the observation is one standard deviation above the mean,
its Z-score is~1.
If it is 1.5 standard deviations \emph{below} the mean,
then its Z-score is -1.5.
If $x$ is an observation from a distribution $N(\mu, \sigma)$,
we define the Z-score mathematically as
\begin{align*}
Z = \frac{x - \mu}{\sigma}
\end{align*}
Using $\mu_{SAT} = \satmean{}$, $\sigma_{SAT} = \satsd{}$,
and $x_{_{\text{Ann}}} = \annsatscore{}$, we find Ann's Z-score:
\begin{align*}
Z_{_{\text{Ann}}}
  = \frac{x_{_{\text{Ann}}} - \mu_{_{\text{SAT}}}}
      {\sigma_{_{\text{SAT}}}}
  = \frac{\annsatscore{} - \satmean{}}{\satsd{}}
  = \annsatzscore{}
\end{align*}

\begin{onebox}{The Z-score}
  The Z-score of an observation is the number of standard
  deviations it falls above or below the mean.
  We compute the Z-score for an observation $x$ that follows
  a distribution with mean $\mu$ and standard deviation
  $\sigma$ using
  \begin{align*}
  Z = \frac{x - \mu}{\sigma}
  \end{align*}
\end{onebox}

\begin{exercisewrap}
\begin{nexercise}
Use Tom's ACT score, \tomsatscore{}, along with the ACT mean and
standard deviation to find his Z-score.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{$Z_{Tom}
  = \frac{x_{\text{Tom}} - \mu_{\text{ACT}}}
      {\sigma_{\text{ACT}}}
  = \frac{\tomsatscore{} - \actmean{}}{\actsd{}}
  = \tomsatzscore{}$}

Observations above the mean always have positive Z-scores,
while those below the mean always have negative Z-scores.
If an observation is equal to the mean,
such as an SAT score of \satmean{}, then the Z-score is $0$.

\begin{exercisewrap}
\begin{nexercise}
Let $X$ represent a random variable from $N(\mu=3, \sigma=2)$,
and suppose we observe $x=5.19$. \\
%\begin{enumerate}[(a)]
%\setlength{\itemsep}{0mm}
%\item
(a)
    Find the Z-score of $x$. \\
%\item
(b)
    Use the Z-score to determine how many standard deviations
    above or below the mean $x$ falls.\footnotemark{}
%\end{enumerate}
\end{nexercise}
\end{exercisewrap}
\footnotetext{(a) Its Z-score is given by
    $Z
      = \frac{x-\mu}{\sigma}
      = \frac{5.19 - 3}{2}
      = 2.19/2
      = 1.095$.
    (b)~The observation $x$ is 1.095 standard deviations
    \emph{above} the mean.
    We know it must be above the mean since $Z$ is positive.}

\begin{exercisewrap}
\begin{nexercise} \label{headLZScore}
Head lengths of brushtail possums follow a normal
distribution with mean 92.6 mm and standard deviation 3.6 mm.
Compute the Z-scores for possums with head lengths of 95.4 mm
and 85.8~mm.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{For $x_1=95.4$ mm:
    $Z_1
      = \frac{x_1 - \mu}{\sigma}
      = \frac{95.4 - 92.6}{3.6}
      = 0.78$.
    For $x_2=85.8$ mm:
    $Z_2 = \frac{85.8 - 92.6}{3.6} = -1.89$.}

We can use Z-scores to roughly identify which observations
are more unusual than others.
An observation $x_1$ is said to be more unusual than another
observation $x_2$ if the absolute value of its Z-score is larger
than the absolute value of the other observation's Z-score:
$|Z_1| > |Z_2|$.
This technique is especially insightful when a distribution
is symmetric.

%\D{\newpage}

\begin{exercisewrap}
\begin{nexercise}
Which of the observations in Guided Practice~\ref{headLZScore}
is more unusual?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{Because the \emph{absolute value} of Z-score
  for the second observation is larger than that of the first,
  the second observation has a more unusual head length.}


\subsection{Finding tail areas}

It's very useful in statistics to be able to identify tail areas
of distributions.
For instance, what fraction of people have an SAT score below
Ann's score of 1300?
This is the same as the \term{percentile} Ann is at, which is
the percentage of cases that have lower scores than Ann.
We can visualize such a tail area like the curve and shading
shown in Figure~\ref{satBelow1300}.

\begin{figure}[h]
  \centering
  \Figure[A normal distribution is shown with a mean of 1100 and a standard deviation of 200. The distribution is shaded to the left of the value 1300, meaning the area bound by the horizontal axis, the bell-shaped curve (up to the horizontal value of 1300) and a vertical line at 1300 is shaded.]{0.45}{satBelow1300}
  \caption{The area to the left of $Z$ represents the
      fraction of people who scored lower than Ann.}
  \label{satBelow1300}
\end{figure}

There are many techniques for doing this, and we'll discuss
three of the options.
\begin{enumerate}
\item
    The most common approach in practice is to use
    statistical software.
    For example, in the program \R{}, we could find the area
    shown in Figure~\ref{satBelow1300} using the
    following command, which takes in the Z-score
    and returns the lower tail area: \\
    {\color{white}.....}%
        \texttt{> pnorm(1)} \\
    {\color{white}.....}%
        \texttt{[1] 0.8413447} \\
    According to this calculation,
    the region shaded that is below 1300
    represents the proportion 0.841 (84.1\%) of SAT test
    takers who had Z-scores below $Z = 1$.
    More generally, we can also specify the cutoff explicitly
    if we also note the mean and standard deviation: \\
    {\color{white}.....}%
        \texttt{> pnorm(1300, mean = 1100, sd = 200)} \\
    {\color{white}.....}%
        \texttt{[1] 0.8413447} %\\
    %\Add{More examples for using \R{} are provided
    %  at the end of the section.}


    There are many other software options, such as Python or SAS;
    even spreadsheet programs such as
    Excel and Google Sheets support these calculations.
\item
    A common strategy in classrooms is to use a graphing
    calculator, such as a TI or Casio calculator.
    These calculators require a series of button presses
    that are less concisely described.
    You can find instructions on using these calculators
    for finding tail areas of a normal distribution in the
    OpenIntro video library:
    \begin{center}
    \oiRedirect{textbook-openintro_videos}
        {www.openintro.org/videos}
    \end{center}
\item
    The last option for finding tail areas is to use
    what's called a \term{probability table};
    these are occasionally used in classrooms
    but rarely in practice.
    Appendix~\ref{normalProbabilityTable}
    contains such a table and a guide for how to use it.
\end{enumerate}
We will solve normal distribution problems in this section
by always first finding the Z-score.
The reason is that we will encounter close parallels
called \indexthis{test statistics}{test statistic}
beginning in Chapter~\ref{ch_foundations_for_inf};
these are, in many instances, an equivalent of a Z-score.

%No matter the approach you choose,
%try the Guided Practice exercises in this section
%using your preferred method.


\D{\newpage}

\subsection{Normal probability examples}
\label{normal_probability_examples}

\noindent%
Cumulative SAT scores are approximated well by a normal model,
$N(\mu = \satmean{}, \sigma = \satsd{})$.

\newcommand{\shannonsat}{1190}
\newcommand{\shannonsatz}{0.45}
\begin{examplewrap}
\begin{nexample}{Shannon is a randomly selected SAT taker,
    and nothing is known about Shannon's SAT aptitude.
    What is the probability Shannon scores at least
    \shannonsat{} on her SATs?}
  \label{satAbove1190Exam}%
  First, always draw and label a picture of the normal
  distribution.
  (Drawings need not be exact to be useful.)
  We are interested in the chance she scores above
  \shannonsat{}, so we shade this upper tail:
  \begin{center}
  \Figure[A normal distribution with a mean of 1100 and standard deviation of 200 has the area below the distribution shaded for horizontal values larger than 1300.]{0.4}{satAbove1190}
  \end{center}
  The picture shows the mean and the values at
  2~standard deviations above and below the mean.
  The simplest way to find the shaded area under
  the curve makes use of the Z-score of the cutoff value.
  With $\mu = \satmean{}$, $\sigma = \satsd{}$,
  and the cutoff value $x = \shannonsat{}$,
  the Z-score is computed as
  \begin{align*}
  Z = \frac{x - \mu}{\sigma}
    = \frac{\shannonsat{} - \satmean{}}{\satsd{}}
    = \frac{90}{\satsd{}}
    = \shannonsatz{}
  \end{align*}
  Using statistical software (or another preferred method),
  we can find the area left of $Z = \shannonsatz{}$ as 0.6736.
  %This is Shannon's \term{percentile},
  %which is the fraction of folks who scored below her score
  %of \shannonsat{}.
  To find the area \emph{above} $Z = \shannonsatz{}$,
  we compute one minus the area of the lower tail:
  \begin{center}
  \Figure[A full shaded normal distribution is shown, then a "minus" sign, then a normal distribution with most of its region shaded up to a little above the mean, then an equals sign, and then a normal distribution with an area in the upper tail shaded. Above those images is the text "1.0000 minus 0.6736 equals 0.3264". This visualization is intended to show how we can think of finding an upper tail of the normal distribution as taking the entire area below the distribution (which has a value of 1) and subtracting a portion of the area to the left to get an area to the right.]{0.4}{subtractingArea}
  \end{center}
  The probability Shannon scores at least 1190 on the SAT
  is 0.3264.
\end{nexample}
\end{examplewrap}

\begin{onebox}{Always draw a picture first,
    and find the Z-score second}
  For any normal probability situation,
  \emph{always always always} draw and label the
  normal curve and shade the area of interest first.
  The picture will provide an estimate of the probability.
  After drawing a figure to represent the situation,
  identify the Z-score for the value of interest.
\end{onebox}

\begin{exercisewrap}
\begin{nexercise}
If the probability of Shannon scoring at least \shannonsat{}
is 0.3264, then what is the probability she scores less than
\shannonsat{}?
Draw the normal curve representing this exercise,
shading the lower region instead of the upper one.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{We found this probability in
  Example~\ref{satAbove1190Exam}: 0.6736. \\
  \Figures[A normal distribution with mean 1100 and standard deviation 200 is shaded from the left up to a vertical line a little above the distribution mean.]{0.35}{subtractingArea}{subtracted}}

\D{\newpage}

\newcommand{\edwardsat}{1030}
\newcommand{\edwardsatz}{-0.35}
\newcommand{\edwardsatlower}{0.3632}
\begin{examplewrap}
\begin{nexample}{Edward earned a \edwardsat{} on his SAT.
    What is his percentile?}
  \label{edwardSatBelow\edwardsat{}}%
  First, a picture is needed.
  Edward's \hiddenterm{percentile} is the proportion of people
  who do not get as high as a \edwardsat{}.
  These are the scores to the left of \edwardsat{}.
\begin{center}
\Figure[A normal distribution with mean 1100 and standard deviation 200 is shaded from the left up to a vertical line a little below the distribution mean. This area is labeled as "40\% (0.40)".]{0.3}{satBelow1030}
\end{center}
Identifying the mean $\mu=\satmean{}$, the standard
deviation $\sigma=\satsd{}$, and the cutoff for the tail
area $x=\edwardsat{}$ makes it easy to compute the Z-score:
\begin{align*}
Z
  = \frac{x - \mu}{\sigma}
  = \frac{\edwardsat{} - \satmean{}}{\satsd{}}
  = \edwardsatz{}
\end{align*}
Using statistical software, we get a tail area of 0.3632.
Edward is at the $36^{th}$ percentile.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
Use the results of Example~\ref{edwardSatBelow\edwardsat{}}
to compute the proportion of SAT takers who did better than
Edward.
Also draw a new picture.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{If Edward did better than 36\% of SAT takers,
  then about 64\% must have done better than him. \\
  \Figures{0.25}{satBelow1030}{satAbove1030}}

\begin{onebox}{Finding areas to the right}
  Many software programs return the area to the left
  when given a Z-score.
  If you would like the area to the right, first find the
  area to the left and then subtract this amount from~one.
\end{onebox}

\newcommand{\stuartsat}{1500}
\newcommand{\stuarsatz}{2}
\begin{exercisewrap}
\begin{nexercise}
Stuart earned an SAT score of \stuartsat{}.
Draw a picture for each part. \\
(a)~What is his percentile? \\
(b)~What percent of SAT takers did better than
  Stuart?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{We leave the drawings to you.
  (a) $Z = \frac{\stuartsat{} - \satmean{}}{\satsd{}}
         = \stuarsatz{}
         \to 0.9772$.
  (b) $1 - 0.9772 = 0.0228$.}

Based on a sample of 100 men, the heights of male adults
in the US is nearly normal with mean 70.0''
and standard deviation 3.3''.

\begin{exercisewrap}
\begin{nexercise}
Mike is 5'7'' and Jose is 6'4'', and they both live in the US. \\
(a) What is Mike's height percentile? \\
(b) What is Jose's height percentile? \\
Also draw one picture for each~part.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{First put the heights into inches:
  67 and 76 inches.
  Figures are shown below. \\
  (a) $Z_{\text{Mike}} = \frac{67 - 70}{3.3} = -0.91\ \to\ 0.1814$.
  (b) $Z_{\text{Jose}} = \frac{76 - 70}{3.3} = 1.82\ \to\ 0.9656$.
  \\
  \Figure[Two plots are shown. The first plot is labeled "Mike" and shows a normal distribution with a mean of 70 and the left tail below 67 is shaded. The second plot is labeled "Jose" and shows a normal distribution with a mean of 70 and a large portion of the normal distribution up to the value 76 shaded.]{0.45}{mikeAndJosePercentiles}}

\D{\newpage}

The last several problems have focused on finding the
percentile (lower tail) or the upper tail for a particular observation.
What if you would like to know the observation corresponding
to a particular percentile?

\begin{examplewrap}
\begin{nexample}{Erik's height is at the $40^{th}$ percentile.
    How tall is he?}\label{normalExam40Perc}
  As always, first draw the picture.\vspace{-4mm}
  \begin{center}
  \Figure{0.3}{height40Perc}\vspace{-1mm}
  \end{center}
  In this case, the lower tail probability is known (0.40),
  which can be shaded on the diagram.
  We want to find the observation that corresponds to this value.
  As a first step in this direction, we determine the Z-score
  associated with the $40^{th}$ percentile.
  Using software, we can obtain the corresponding Z-score
  of about -0.25.

  Knowing $Z_{Erik} = -0.25$ and the population parameters
  $\mu = 70$ and $\sigma = 3.3$ inches, the Z-score formula can be
  set up to determine Erik's unknown height, labeled
  $x_{_{\text{Erik}}}$:
  \begin{align*}
  -0.25
    = Z_{_{\text{Erik}}}
    = \frac{x_{_{\text{Erik}}} - \mu}{\sigma}
    = \frac{x_{_{\text{Erik}}} - 70}{3.3}
  \end{align*}
  Solving for $x_{_{\text{Erik}}}$ yields a height of 69.18 inches.
  That is, Erik is about 5'9''.
\end{nexample}
\end{examplewrap}

\begin{examplewrap}
\begin{nexample}{What is the adult male height at the
    $82^{nd}$ percentile?}
  Again, we draw the figure first.\vspace{-3mm}
  \begin{center}
  \Figure[A normal distribution with mean 70 and standard deviation 3.3 is shaded from the left up to a vertical line a bit above the distribution mean. The shaded area to the left of the vertical line is labeled as "82\% (0.82)" and the upper, unshaded tail is labeled "18\% (0.18)".]{0.28}{height82Perc}\vspace{-1mm}
  \end{center}
  Next, we want to find the Z-score at the $82^{nd}$ percentile,
  which will be a positive value and can be found using software
  as $Z = 0.92$.
  Finally, the height $x$ is found using the Z-score formula
  with the known mean $\mu$, standard deviation $\sigma$,
  and Z-score $Z = 0.92$:
  \begin{align*}
  0.92 = Z = \frac{x-\mu}{\sigma} = \frac{x - 70}{3.3}
  \end{align*}
  This yields 73.04 inches or about 6'1'' as the height
  at the $82^{nd}$ percentile.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
The SAT scores follow $N(\satmean{}, \satsd{})$.\footnotemark{} \\
(a) What is the $95^{th}$ percentile for SAT scores? \\
(b) What is the $97.5^{th}$ percentile for SAT scores?
\end{nexercise}
\end{exercisewrap}
\footnotetext{Short answers:
  (a) $Z_{95} = 1.6449 \to 1429$ SAT score.
  (b) $Z_{97.5} = 1.96 \to 1492$ SAT score.}

\D{\newpage}

\begin{exercisewrap}
\begin{nexercise}\label{more74Less69}
Adult male heights follow $N(70.0$''$, 3.3$''$)$.\footnotemark{} \\
(a)~What is the probability that a randomly selected male
    adult is at least 6'2'' (74 inches)? \\
(b)~What is the probability that a male adult is shorter
    than 5'9'' (69 inches)?
\end{nexercise}
\end{exercisewrap}
\footnotetext{Short answers:
  (a) $Z = 1.21 \to 0.8869$, then subtract this value
      from 1 to get 0.1131.
  (b) $Z = -0.30 \to 0.3821$.}

\begin{examplewrap}
\begin{nexample}{What is the probability that a random adult
    male is between 5'9'' and 6'2''?}
  These heights correspond to 69 inches and 74 inches.
  First, draw the figure.
  The area of interest is no longer an upper or lower
  tail.\vspace{-2mm}
  \begin{center}
  \Figure[A normal distribution is shown with mean 70 and standard deviation 3.3. An area from just below the mean (69) up to a value further into the right tail (74) is shaded.]{0.35}{between59And62}\vspace{-2mm}
  \end{center}
  The total area under the curve is~1.
  If we find the area of the two tails that are not shaded
  (from Guided Practice~\ref{more74Less69}, these areas are
  $0.3821$ and $0.1131$), then we can find the middle
  area:\vspace{-2mm}
  \begin{center}
  \Figure[A plot is shown where we take the full distribution (1.0000), subtract off a lower tail (0.3821) and a small upper tail (0.1131), leaving a normal distribution with just a segment shaded, from just below the mean to a modest amount above the mean, and this last shaded area is labeled 0.5048.]{0.55}{subtracting2Areas}\vspace{-2mm}
  \end{center}
  That is, the probability of being between 5'9'' and 6'2''
  is 0.5048.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
SAT scores follow $N(\satmean{}, \satsd{})$.
What percent of SAT takers get between \satmean{} and
1400?\footnotemark
\end{nexercise}
\end{exercisewrap}
\footnotetext{This is an abbreviated solution.
  (Be sure to draw a figure!)
  First find the percent who get below \satmean{}
  and the percent that get above 1400:
  $Z_{\satmean{}} = 0.00 \to 0.5000$ (area below),
  $Z_{1400} = 1.5 \to 0.0668$ (area above).
  Final answer: $1.0000 - 0.5000 - 0.0668 = 0.4332$.}

\begin{exercisewrap}
\begin{nexercise}
Adult male heights follow $N(70.0$''$, 3.3$''$)$.
What percent of adult males are between 5'5''
and 5'7''?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{5'5'' is 65 inches ($Z = -1.52$).
  5'7'' is 67 inches ($Z = -0.91$).
  Numerical solution: $1.000 - 0.0643 - 0.8186 = 0.1171$,
  i.e. 11.71\%.}


\D{\newpage}

\subsection{68-95-99.7 rule}

Here, we present a useful rule of thumb for the probability of falling within 1, 2, and 3 standard deviations of the mean in the normal distribution. This will be useful in a wide range of practical settings, especially when trying to make a quick estimate without a calculator or Z-table.

\begin{figure}[hht]
\centering
\Figure[A normal distribution is shown. The central region, from one standard deviation below the mean to one standard deviation above the mean, is shaded blue and is labeled with a value of 68\%. The region further out to two standard deviations below the mean to two standard deviations above the mean is shaded green (besides the portion shaded blue) and is labeled with a value of 95\%. The region further out to three standard deviations below the mean to three standard deviations above the mean is shaded yellow (besides the portions shaded green or blue) and is labeled with a value of 99.7\%. Those percentages -- 68\%, 95\%, and 99.7\% -- represent the portions of the area below a normal distribution within 1, 2, and 3 standard deviations of the mean.]{0.63}{6895997}
\caption{Probabilities for falling within 1, 2, and 3 standard deviations of the mean in a normal distribution.}
\label{6895997}
\end{figure}

\begin{exercisewrap}
\begin{nexercise}
Use software, a calculator, or a probability table
to confirm that about 68\%, 95\%, and 99.7\%
of observations fall within 1, 2, and 3, standard deviations
of the mean in the normal distribution, respectively.
For instance, first find the area that falls between $Z=-1$
and $Z=1$, which should have an area of about 0.68.
Similarly there should be an area of about 0.95 between
$Z=-2$ and $Z=2$.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{First draw the pictures.
  Using software, we get 0.6827 within 1~standard deviation,
  0.9545 within 2~standard deviations,
  and 0.9973 within 3~standard deviations.}

It is possible for a normal random variable to fall 4,~5,
or~even more standard deviations from the mean.
However, these occurrences are very rare if the data are
nearly normal.
The probability of being further than 4 standard deviations
from the mean is about 1-in-15,000.
For 5 and 6 standard deviations, it is about 1-in-2 million
and 1-in-500 million, respectively.

\begin{exercisewrap}
\begin{nexercise}
SAT scores closely follow the normal model with mean
$\mu = \satmean{}$ and standard deviation
$\sigma = \satsd{}$.\footnotemark{} \\
(a) About what percent of test takers score 700 to 1500? \\
(b) What percent score between \satmean{} and 1500?
\end{nexercise}
\end{exercisewrap}
\footnotetext{(a) 700 and 1500 represent two standard deviations
  below and above the mean, which means about 95\% of test takers
  will score between 700 and 1500.
  (b)~We found that 700 to 1500 represents about 95\% of test
  takers.
  These test takers would be evenly split by the center of
  the distribution, \satmean{},
  so $\frac{95\%}{2} = 47.5\%$ of all test takers
  score between \satmean{} and 1500.}


{\input{ch_distributions/TeX/normal_distribution.tex}}




%%_________________
%\section{Evaluating the normal approximation}
%\label{assessingNormal}
%
%Many processes can be well approximated by the normal distribution.
%We have already seen two good examples:
%SAT scores and the heights of US adult males.
%While using a normal model can be extremely convenient
%and helpful, it is important to remember normality is
%always an approximation.
%Evaluating the appropriateness of the normal assumption
%is a key step in many data analyses.
%
%\index{normal probability plot|(}
%
%Example~\ref{normalExam40Perc} in Section~\ref{normalDist}
%suggested the distribution of heights of US males is well
%approximated by the normal model.
%We are interested in proceeding under the assumption that
%the data are normally distributed, but first we must check
%to see if this is reasonable.
%
%There are two visual methods for checking the assumption of
%normality, which can be implemented and interpreted quickly.
%The first is a simple histogram with the best fitting normal
%curve overlaid on the plot, as shown in the left panel of
%Figure~\ref{fcidMHeights}.
%The sample mean $\bar{x}$ and standard deviation $s$ are used
%as the parameters of the best fitting normal curve.
%The closer this curve fits the histogram, the more reasonable
%the normal model assumption.
%Another common method is examining a
%\term{normal probability plot},\footnote{Also commonly
%  called a \term{quantile-quantile plot}.}
%shown in the right panel of Figure~\ref{fcidMHeights}.
%The closer the points are to a perfect straight line,
%the more confident we can be that the data follow the
%normal model.
%
%\begin{figure}[h]
%  \centering
%  \Figure{0.7}{fcidMHeights}
%  \caption{A sample of 100 male heights.
%      The observations are rounded to the nearest whole inch,
%      explaining why the points appear to jump in increments
%      in the normal probability plot.}
%  \label{fcidMHeights}
%\end{figure}
%
%\begin{examplewrap}
%\begin{nexample}{Three data sets of 40, 100, and 400
%    samples were simulated from a normal distribution,
%    and the histograms and normal probability plots
%    of the data sets are shown in Figure~\ref{normalExamples}.
%    These will provide a benchmark for what to look for
%    in plots of real data.}
%  \label{normalExamplesExample}%
%  The left panels show the histogram (top) and normal
%  probability plot (bottom) for the simulated data set
%  with 40 observations.
%  The data set is too small to really see clear structure
%  in the histogram.
%  The normal probability plot also reflects this,
%  where there are some deviations from the line.
%  We should expect deviations of this amount for
%  such a small data set.
%
%  The middle panels show diagnostic plots for the
%  data set with 100 simulated observations.
%  The histogram shows more normality and the normal
%  probability plot shows a better fit.
%  While there are a few observations that deviate
%  noticeably from the line, they are not particularly
%  extreme.
%
%  The data set with 400 observations has a histogram
%  that greatly resembles the normal distribution,
%  while the normal probability plot is nearly a perfect
%  straight line.
%  Again in the normal probability plot there is one
%  observation (the largest) that deviates slightly from
%  the line.
%  If that observation had deviated 3 times further from
%  the line, it would be of greater importance in a real
%  data set.
%  Apparent outliers can occur in normally distributed
%  data but they are rare.
%
%  Notice the histograms look more normal as the sample
%  size increases, and the normal probability plot becomes
%  straighter and more stable.
%\end{nexample}
%\end{examplewrap}
%
%\begin{figure}
%  \centering
%  \Figure{0.9}{normalExamples}
%  \caption{Histograms and normal probability plots for
%      three simulated normal data sets; $n=40$ (left),
%      $n=100$ (middle), $n=400$ (right).}
%  \label{normalExamples}
%\end{figure}
%
%\begin{examplewrap}
%\begin{nexample}{Are NBA player heights normally distributed?
%    Consider all 494 NBA players presented in
%    Figure~\ref{nbaNormal}.}
%  We first create a histogram and normal probability plot
%  of the NBA player heights.
%  The histogram in the left panel appears to have too few
%  observations at the upper end since the curve is notably
%  above the histogram.
%  The points in the normal probability plot
%  follow a straight line for much of the center of the
%  distribution, and then deviates more at the upper values.
%  We can compare these characteristics to the sample of
%  400 normally distributed observations in
%  Example~\ref{normalExamplesExample} and see that they
%  represent much stronger deviations from the normal model.
%  NBA player heights do not appear to come from a normal
%  distribution.
%\end{nexample}
%\end{examplewrap}
%
%\begin{examplewrap}
%\begin{nexample}{Can we approximate poker winnings by a normal distribution? We consider the poker winnings of an individual over 50 days. A histogram and normal probability plot of these data are shown in Figure~\ref{pokerNormal}.}
%The data are very strongly right skewed\index{skew!example: very strong} in the histogram, which corresponds to the very strong deviations on the upper right component of the normal probability plot. If we compare these results to the sample of 40 normal observations in Example~\ref{normalExamplesExample}, it is apparent that these data show very strong deviations from the normal model.
%\end{nexample}
%\end{examplewrap}
%
%\begin{figure}
%  \centering
%  \Figure{0.8}{nbaNormal}
%  \caption{Histogram and normal probability plot
%      for the NBA heights from the 2008-9 season.}
%  \label{nbaNormal}
%\end{figure}
%
%\begin{figure}
%  \centering
%  \Figure{0.9}{pokerNormal}
%  \caption{A histogram of poker data with the best
%      fitting normal plot and a normal probability plot.}
%  \label{pokerNormal}
%\end{figure}
%
%\begin{exercisewrap}
%\begin{nexercise}\label{normalQuantileExercise}%
%Determine which data sets represented in
%Figure~\ref{normalQuantileExer} plausibly come from
%a nearly normal distribution.
%Are you confident in all of your conclusions?
%There are 100 (top left), 50 (top right), 500 (bottom left),
%and 15 points (bottom right) in the four plots.\footnotemark{}
%\end{nexercise}
%\end{exercisewrap}
%\footnotetext{Answers may vary a little.
%  The top-left plot shows some deviations in the smallest values
%  in the data set;
%  specifically, the left tail of the data set has some outliers
%  we should be wary of.
%  The top-right and bottom-left plots do not show any obvious
%  or extreme deviations from the lines for their respective
%  sample sizes, so a normal model would be reasonable for these
%  data sets.
%  The bottom-right plot has a consistent curvature that suggests
%  it is not from the normal distribution.
%  If we examine just the vertical coordinates of these
%  observations, we see that there is a lot of data between
%  -20 and 0, and then about five observations scattered
%  between 0 and 70.
%  This describes a distribution that has a strong right skew.}
%
%\begin{figure}
%  \centering
%  \Figure{0.7}{normalQuantileExer}
%  \caption{Four normal probability plots for
%      Guided Practice~\ref{normalQuantileExercise}.}
%  \label{normalQuantileExer}
%\end{figure}
%
%\begin{exercisewrap}
%\begin{nexercise}
%\label{normalQuantileExerciseAdditional}%
%Figure~\ref{normalQuantileExerAdditional} shows normal
%probability plots for two distributions that are skewed.
%One distribution is skewed to the low end (left skewed)
%and the other to the high end (right skewed).
%Which is which?\footnotemark{}
%\end{nexercise}
%\end{exercisewrap}
%\footnotetext{Examine where the points fall along the
%  vertical axis.
%  In the first plot, most points are near the low end
%  with fewer observations scattered along the high end;
%  this describes a distribution that is skewed to the
%  high end.
%  The second plot shows the opposite features,
%  and this distribution is skewed to the low end.}
%
%\begin{figure}[h]
%  \centering
%  \Figures{0.8}{normalQuantileExer}{normalQuantileExerAdditional}
%  \caption{Normal probability plots for
%      Guided Practice~\ref{normalQuantileExerciseAdditional}.}
%  \label{normalQuantileExerAdditional}
%\end{figure}
%
%\index{normal probability plot|)}
\index{normal distribution|)}
\index{distribution!normal|)}




%_________________
\section{Geometric distribution}
\label{geomDist}

How long should we expect to flip a coin until it turns up \resp{heads}? Or how many times should we expect to roll a die until we get a \resp{1}? These questions can be answered using the geometric distribution. We first formalize each trial -- such as a single coin flip or die toss -- using the Bernoulli distribution, and then we combine these with our tools from probability (Chapter~\ref{probability}) to construct the geometric distribution.

\subsection{Bernoulli distribution}
\label{bernoulli}

\newcommand{\insureSprob}{0.7}
\newcommand{\insureSperc}{70\%}
\newcommand{\insureFprob}{0.3}
\newcommand{\insureFperc}{30\%}
\newcommand{\insureDistA}{0.7}
\newcommand{\insureDistB}{0.21}
\newcommand{\insureDistC}{0.063}
\newcommand{\insureDistD}{0.019}
\newcommand{\insureDistE}{0.006}
\newcommand{\insureCDistA}{0.7}
\newcommand{\insureCDistB}{0.91}
\newcommand{\insureCDistC}{0.973}
\newcommand{\insureCDistCComplement}{0.027}
\newcommand{\insureCDistD}{0.992}
\newcommand{\insureCDistE}{0.998}
\newcommand{\insureGeomMean}{1.43}

\index{distribution!Bernoulli|(}

Many health insurance plans in the United States have
a deductible, where the insured individual is responsible
for costs up to the deductible, and then the costs above
the deductible are shared between the individual and
insurance company for the remainder of the year.

Suppose a health insurance company found that \insureSperc{} of the
people they insure stay below their deductible in any given year.
Each of these people can be thought of as a \term{trial}.
We label a person a \term{success} if her healthcare costs
do not exceed the deductible.
We label a person a \term{failure} if she does exceed her
deductible in the year.
Because 70\% of the individuals will not hit their deductible,
we denote the \term{probability of a success} as
$p = \insureSprob{}$.
The probability of a failure is sometimes denoted with
$q = 1 - p$, which would be \insureFprob{} for the insurance
example.

When an individual trial only has two possible outcomes, often
labeled as \resp{success} or \resp{failure}, it is called a
\termsub{Bernoulli random variable}{distribution!Bernoulli}.
We chose to label a person who does not hit her deductible
as a ``success'' and all others as ``failures''.
However, we could just as easily have reversed these labels.
The mathematical framework we will build does not depend
on which outcome is labeled a success and which a failure,
as long as we are consistent.

Bernoulli random variables are often denoted as \resp{1}
for a success and \resp{0} for a failure.
In addition to being convenient in entering data,
it is also mathematically handy.
Suppose we observe ten trials:
\begin{center}
\resp{1} \resp{1} \resp{1} \resp{0} \resp{1} \resp{0} \resp{0} \resp{1} \resp{1} \resp{0}
\end{center}
Then the \term{sample proportion}, $\hat{p}$, is the
sample mean of these observations:
\begin{align*}
\hat{p} = \frac{\text{\# of successes}}{\text{\# of trials}}
    = \frac{1+1+1+0+1+0+0+1+1+0}{10} = 0.6
\end{align*}%
This mathematical inquiry of Bernoulli random variables can
be extended even further.
%\Comment{Maybe the next footnote should instead be an EOCE?}
Because \resp{0} and \resp{1} are numerical outcomes,
we can define the {mean} and {standard deviation}
of a Bernoulli random variable.
(See Exercises~\ref{bernoulli_mean_derivation}
and~\ref{bernoulli_sd_derivation}.)

\begin{onebox}{Bernoulli random variable}
%  A Bernoulli random variable has exactly two possible
%  outcomes, often labeled \resp{1} for the ``success''
%  outcome and \resp{0} for the ``failure'' outcome.\vspace{3mm}
  If $X$ is a random variable that takes value 1 with
  probability of success $p$ and 0 with probability $1-p$,
  then $X$ is a Bernoulli random variable with mean
  and standard deviation
  \begin{align*}
  \mu &= p
      &\sigma&= \sqrt{p(1-p)}
  \end{align*}
\end{onebox}

In general, it is useful to think about a Bernoulli random variable as a random process with only two outcomes: a success or failure. Then we build our mathematical framework using the numerical labels \resp{1} and \resp{0} for successes and failures, respectively.

\index{distribution!Bernoulli|)}


\D{\newpage}

\subsection{Geometric distribution}

\index{distribution!geometric|(}

The \termsub{geometric distribution}{distribution!geometric}
is used to describe how
many trials it takes to observe a success.
Let's first look at an example.

\begin{examplewrap}
\begin{nexample}{Suppose we are working at the insurance
    company and need to find a case where the person did
    not exceed her (or his) deductible as a case study.
    If the probability a person will not exceed her
    deductible is \insureSprob{} and we are drawing people
    at random, what are the chances that the first person
    will not have exceeded her deductible, i.e. be a success?
    The second person?
    The third?
    What about we pull $n - 1$ cases before we find
    the first success, i.e. the first success is the
    $n^{th}$ person?
    (If the first success is the fifth person, then we say $n=5$.)}
  \label{waitForDeductible}%
  The probability of stopping after the first person is just
  the chance the first person will not hit her (or his)
  deductible:~\insureSprob{}.
  The probability the second person is the first to hit
  her deductible is
  \begin{align*}
  &P(\text{second person is the first to not hit deductible}) \\
  &\quad
    = P(\text{the first will, the second won't})
    = (\insureFprob{})(\insureSprob{})
    = \insureDistB{}
  \end{align*}
  Likewise, the probability it will be the third case is
  $(\insureFprob{})(\insureFprob{})(\insureSprob{})
    = \insureDistC$.

  If the first success is on the $n^{th}$ person,
  then there are $n-1$ failures and finally 1 success,
  which corresponds to the probability
  $(\insureFprob{})^{n-1}(\insureSprob{})$.
  This is the same as
  $(1-\insureSprob{})^{n-1}(\insureSprob{})$.
\end{nexample}
\end{examplewrap}

Example~\ref{waitForDeductible} illustrates what the
\termsub{geometric distribution}{distribution!geometric},
which describes the waiting
time until a success for
\term{independent and identically distributed (iid)}
Bernoulli random variables.
In this case, the \emph{independence} aspect just means
the individuals in the example don't affect each other,
and \emph{identical} means they each have the same probability
of success.

The geometric distribution from Example~\ref{waitForDeductible} is shown in Figure~\ref{geometricDist70}. In general, the probabilities for a geometric distribution decrease \term{exponentially} fast.

\begin{figure}[h]
  \centering
  \Figure[The probability distribution of "Number of Trials Until a Success for p = 0.7" is shown, which appears as a bar plot. The possible values shown are 1, 2, 3, 4, 5, 6, 7, and 8. The probabilities for these are about 0.7, 0.21, 0.07, 0.02, 0.01, and then the values are nearly indistinguishable for the values of 6, 7, and 8.]{0.8}{geometricDist70}
  \caption{The geometric distribution when the probability
      of success is $p = \insureSprob{}$.}
  \label{geometricDist70}
\end{figure}

While this text will not derive the formulas for the mean (expected) number of trials needed to find the first success or the standard deviation or variance of this distribution, we present general formulas for each.

\begin{onebox}{Geometric Distribution}
  \index{distribution!geometric|textbf}%
  If the probability of a success in one trial is $p$
  and the probability of a failure is $1-p$, then the
  probability of finding the first success in the
  $n^{th}$ trial is given by\vspace{-1.5mm}
  \begin{align*}
  (1-p)^{n-1}p
  \end{align*}
  The mean (i.e. expected value), variance,
  and standard deviation of this wait time are given by
  \begin{align*}
  \mu &= \frac{1}{p}
      &\sigma^2 &=\frac{1-p}{p^2}
      &\sigma &= \sqrt{\frac{1-p}{p^2}}
  \end{align*}
\end{onebox}

It is no accident that we use the symbol $\mu$ for both the mean and expected value. The mean and the expected value are one and the same.

It takes, on average, $1/p$ trials to get a success under the geometric distribution. This mathematical result is consistent with what we would expect intuitively. If the probability of a success is high (e.g. 0.8), then we don't usually wait very long for a success: $1/0.8 = 1.25$ trials on average. If the probability of a success is low (e.g. 0.1), then we would expect to view many trials before we see a success: $1/0.1 = 10$ trials.

\begin{exercisewrap}
\begin{nexercise}
The probability that a particular case would not exceed their
deductible is said to be \insureSprob{}.
If we were to examine cases until we found one that where
the person did not hit her deductible, how many cases should
we expect to check?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{We would expect to see about
    $1 / \insureSprob{} \approx \insureGeomMean{}$
    individuals to find the first success.}

\begin{examplewrap}
\begin{nexample}{What is the chance that we would find
    the first success within the first 3 cases?}
  \label{insureFirstSuccessInLT4}%
  This is the chance it is the first ($n=1$), second ($n=2$),
  or third ($n=3$) case is the first success, which are three
  disjoint outcomes.
  Because the individuals in the sample are randomly sampled
  from a large population, they are independent.
  We compute the probability of each case and add the separate
  results:
  \begin{align*}
  &P(n=1, 2, \text{ or }3) \\
    & \quad = P(n=1)+P(n=2)+P(n=3) \\
    & \quad = (\insureFprob{})^{1-1}(\insureSprob{})
        + (\insureFprob{})^{2-1}(\insureSprob{})
        + (\insureFprob{})^{3-1}(\insureSprob{}) \\
    & \quad = \insureCDistC{}
  \end{align*}
  There is a probability of \insureCDistC{} that we would
  find a successful case within 3 cases.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
Determine a more clever way to solve Example~\ref{insureFirstSuccessInLT4}.
Show that you get the same result.\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{First find the probability of the complement:
  $P($no success in first 3~trials$)
      = \insureFprob{}^3 = \insureCDistCComplement{}$.
  Next, compute one minus this probability:
  $1 - P($no success in 3 trials$)
      = 1 - \insureCDistCComplement{}
      = \insureCDistC{}$.}

\D{\newpage}

\begin{examplewrap}
\begin{nexample}{Suppose a car insurer has determined
    that 88\% of its drivers will not exceed their deductible
    in a given year.
    If someone at the company were to randomly draw
    driver files until they found one that had not exceeded
    their deductible, what is the expected number of drivers
    the insurance employee must check?
    What is the standard deviation of the number of driver files
    that must be drawn?}
  \label{carInsure08DrawOne}%
  In this example, a success is again when someone will not
  exceed the insurance deductible, which has probability
  $p = 0.88$.
  The expected number of people to be checked is
  $1 / p = 1 / 0.88 = 1.14$ and the standard deviation is
  $\sqrt{(1-p)/p^2} = 0.39$.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
Using the results from Example~\ref{carInsure08DrawOne},
$\mu = 1.14$ and $\sigma = 0.39$, would it be appropriate
to use the normal model to find what proportion
of experiments would end in 3 or fewer trials?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{No. The geometric distribution is always
  right skewed and can never be well-approximated by the
  normal model.}

The independence assumption is crucial to the geometric
distribution's accurate description of a scenario.
Mathematically, we can see that to construct the probability
of the success on the $n^{th}$ trial, we had to use the
Multiplication Rule for Independent Processes.
It is no simple task to generalize the geometric model
for dependent trials.

\index{distribution!geometric|)}


{\input{ch_distributions/TeX/geometric_distribution.tex}}





\section{Binomial distribution}
\label{binomialModel}

\index{distribution!binomial|(}

The \termsub{binomial distribution}{distribution!binomial}
is used to describe
the number of successes in a fixed number of trials.
%,
%and this distribution is occasionally used in statistics,
%especially when doing more careful analysis of samples
%of data where simpler tools are not helpful.
This is different from the geometric distribution,
which described the number of trials we must wait before
we observe a success.


\subsection{The binomial distribution}

%\newcommand{\insureSprob}{0.7}
%\newcommand{\insureSperc}{70\%}
%\newcommand{\insureFprob}{0.3}
%\newcommand{\insureFperc}{30\%}
%\newcommand{\insureDistA}{0.7}
%\newcommand{\insureDistB}{0.21}
%\newcommand{\insureDistC}{0.063}
%\newcommand{\insureDistD}{0.019}
%\newcommand{\insureDistE}{0.006}
%\newcommand{\insureCDistA}{0.7}
%\newcommand{\insureCDistB}{0.91}
%\newcommand{\insureCDistC}{0.973}
%\newcommand{\insureCDistCComplement}{0.027}
%\newcommand{\insureCDistD}{0.992}
%\newcommand{\insureCDistE}{0.998}
%\newcommand{\insureGeomMean}{1.43}
\newcommand{\insureS}{\resp{not}}
\newcommand{\insureF}{\resp{exceed}}
% Doesn't consider binomial coefficient in next calculated value.
\newcommand{\insureBinomCinDSingleScenario}{0.103}
\newcommand{\insureBinomCinD}{0.412}
\newcommand{\insureBinomEinHSingleScenario}{0.00454}
\newcommand{\insureBinomEinH}{0.254}
\newcommand{\insureBinomFourtyExpValue}{28}
\newcommand{\insureBinomFourtySD}{2.9}
\newcommand{\insureBinomFourtyLower}{22}
\newcommand{\insureBinomFourtyUpper}{34}

\noindent%
Let's again imagine ourselves back at the insurance agency
where \insureSperc{} of individuals do not exceed their
deductible.

\begin{examplewrap}
\begin{nexample}{Suppose the insurance agency is considering
    a random sample of four individuals they insure.
    What is the chance exactly one of them will exceed
    the deductible and the other three will not?
    Let's call the four people
    Ariana ($A$),
    Brittany ($B$),
    Carlton ($C$),
    and Damian ($D$)
    for convenience.}
  \label{insureOneOfFourExceedsDeductible}%
  Let's consider a scenario where one person exceeds
  the deductible:
  \begin{align*}
  &P(A=\text{\insureF{}},
      \text{ }B=\text{\insureS{}},
      \text{ }C=\text{\insureS{}},
      \text{ }D=\text{\insureS{}}) \\
    &\quad = P(A=\text{\insureF{}})\ 
        P(B=\text{\insureS{}})\ 
        P(C=\text{\insureS{}})\ 
        P(D=\text{\insureS{}}) \\
    &\quad =  (\insureFprob{})
        (\insureSprob{})
        (\insureSprob{})
        (\insureSprob{}) \\
    &\quad = (\insureSprob{})^3 (\insureFprob{})^1 \\
    &\quad = \insureBinomCinDSingleScenario{}
  \end{align*}
  But there are three other scenarios: Brittany, Carlton,
  or Damian could have been the one to exceed the deductible.
  In each of these cases, the probability is again
  $(\insureSprob{})^3 (\insureFprob{})^1$.
  These four scenarios exhaust all the possible ways that
  exactly one of these four people could have exceeded
  the deductible, so the total probability is
  $4 \times (\insureSprob{})^3 (\insureFprob{})^1
      = \insureBinomCinD{}$.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
Verify that the scenario where Brittany is the only one
to exceed the deductible has probability
$(\insureSprob{})^3 (\insureFprob{})^1$.~\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{
  $P(A=\text{\insureS{}},
      \text{ }B=\text{\insureF{}},
      \text{ }C=\text{\insureS{}},
      \text{ }D=\text{\insureS{}})
    = (\insureSprob{})(\insureFprob{})
        (\insureSprob{})(\insureSprob{})
    = (\insureSprob{})^3 (\insureFprob{})^1$.}


The scenario outlined in Example~\ref{insureOneOfFourExceedsDeductible} is an
example of a binomial distribution scenario.
The \termsub{binomial distribution}{distribution!binomial}
describes the probability of having exactly $k$ successes
in $n$ independent Bernoulli trials with probability
of a success $p$
(in Example~\ref{insureOneOfFourExceedsDeductible},
$n=4$, $k=3$, $p=\insureSprob{}$).
We would like to determine the probabilities associated
with the binomial distribution more generally,
i.e. we want a formula where we can use $n$, $k$, and $p$
to obtain the probability.
To do this, we reexamine each part of
Example~\ref{insureOneOfFourExceedsDeductible}.

There were four individuals who could have been the one
to exceed the deductible, and each of these four scenarios
had the same probability.
Thus, we could identify the final probability as
\begin{align*}
[\text{\# of scenarios}] \times P(\text{single scenario})
\end{align*}
The first component of this equation is the number of ways
to arrange the $k=3$ successes among the $n=4$ trials.
The second component is the probability of any of the four
(equally probable) scenarios.

\D{\newpage}

Consider $P($single scenario$)$ under the general case of
$k$ successes and $n-k$ failures in the $n$ trials.
In any such scenario, we apply the Multiplication Rule
for independent events:
\begin{align*}
p^k (1 - p)^{n - k}
\end{align*}
This is our general formula for $P($single scenario$)$.

Secondly, we introduce a general formula for the number
of ways to choose $k$ successes in $n$ trials,
i.e. arrange $k$ successes and $n - k$ failures:
\begin{align*}
{n\choose k} = \frac{n!}{k! (n - k)!}
\end{align*}
The quantity ${n\choose k}$ is read
\term{n choose k}.\footnote{Other notation for
  $n$ choose $k$ includes $_nC_k$, $C_n^k$, and $C(n,k)$.}
The exclamation point notation (e.g. $k!$) denotes
a \term{factorial} expression.\label{factorial_defined}
\begin{align*}
& 0! = 1 \\
& 1! = 1 \\
& 2! = 2\times1 = 2 \\
& 3! = 3\times2\times1 = 6 \\
& 4! = 4\times3\times2\times1 = 24 \\
& \vdots \\
& n! = n\times(n-1)\times...\times3\times2\times1
\end{align*}
Using the formula, we can compute the number of ways
to choose $k = 3$ successes in $n = 4$ trials:
\begin{align*}
{4 \choose 3} = \frac{4!}{3!(4-3)!}
  = \frac{4!}{3!1!} 
  = \frac{4\times3\times2\times1}{(3\times2\times1) (1)}
  = 4
\end{align*}
This result is exactly what we found by carefully thinking
of each possible scenario in
Example~\ref{insureOneOfFourExceedsDeductible}.

Substituting $n$ choose $k$ for the number of scenarios
and $p^k(1-p)^{n-k}$ for the single scenario probability
yields the general binomial formula.

\begin{onebox}{Binomial distribution}
  Suppose the probability of a single trial being
  a success is $p$.
  Then the probability of observing exactly $k$ successes
  in $n$ independent trials is given by\vspace{-1mm}
  \begin{align*}
  {n\choose k}p^k(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}
  \end{align*}
  The mean, variance, and standard deviation
  of the number of observed successes are\vspace{-2mm}
  \begin{align*}
  \mu &= np
  &\sigma^2 &= np(1-p)
  &\sigma&= \sqrt{np(1-p)}
  \end{align*}
\end{onebox}

\begin{onebox}{Is it binomial? Four conditions to check.}
  \label{isItBinomialTipBox}%
  (1) The trials are independent. \\
  (2) The number of trials, $n$, is fixed. \\
  (3) Each trial outcome can be classified as a \emph{success}
      or \emph{failure}. \\
  (4) The probability of a success, $p$, is the same for
      each trial.
\end{onebox}

\D{\newpage}

\begin{examplewrap}
\begin{nexample}{What is the probability that 3 of 8 randomly
    selected individuals will have exceeded the insurance
    deductible, i.e. that 5 of 8 will not exceed the deductible?
    Recall that 70\% of individuals will not exceed the
    deductible.}
  We would like to apply the binomial model,
  so we check the conditions.
  The number of trials is fixed ($n = 8$) (condition 2)
  and each trial outcome can be classified as a success
  or failure (condition 3).
  Because the sample is random, the trials are independent
  (condition~1) and the probability of a success is the same
  for each trial (condition~4).

  In the outcome of interest, there are $k = 5$ successes
  in $n = 8$ trials (recall that a success is an individual
  who does \emph{not} exceed the deductible), and the
  probability of a success is $p = \insureSprob{}$.
  So the probability that 5 of 8 will not exceed the
  deductible and 3 will exceed the deductible is given by
  \begin{align*}
  { 8 \choose 5}(\insureSprob{})^5
  (1-\insureSprob{})^{8-5}
    &= \frac{8!}{5!(8-5)!}
        (\insureSprob{})^5(1-\insureSprob{})^{8-5} \\
    &= \frac{8!}{5!3!}
        (\insureSprob{})^5(\insureFprob{})^3
  \end{align*}
  Dealing with the factorial part:
  \begin{align*}
  \frac{8!}{5!3!}
    = \frac{8\times7\times6\times5\times4\times3\times2\times1}
        {(5\times4\times3\times2\times1)(3\times2\times1)}
    = \frac{8\times7\times6}{3\times2\times1}
    = 56
  \end{align*}
  Using $(\insureSprob{})^5(\insureFprob{})^3
    \approx \insureBinomEinHSingleScenario{}$,
  the final probability is about
  $56 \times \insureBinomEinHSingleScenario{}
    \approx \insureBinomEinH{}$.
\end{nexample}
\end{examplewrap}

\begin{onebox}{Computing binomial probabilities}
  The first step in using the binomial model is to check
  that the model is appropriate.
  The second step is to identify $n$, $p$, and $k$.
  As the last stage use software or the formulas
  to determine the probability, then interpret the results.%
  \vspace{3mm}

  If you must do calculations by hand, it's often useful
  to cancel out as many terms as possible in the top and
  bottom of the binomial coefficient.
\end{onebox}

\begin{exercisewrap}
\begin{nexercise}
If we randomly sampled 40 case files from the insurance agency
discussed earlier, how many of the cases would you expect to not
have exceeded the deductible in a given year?
What is the standard deviation of the number that would not
have exceeded the deductible?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{We are asked to determine the expected number
  (the mean) and the standard deviation, both of which can
  be directly computed from the formulas:
  $\mu = np = 40 \times \insureSprob{}
    = \insureBinomFourtyExpValue$
  and $\sigma = \sqrt{np(1-p)}
    = \sqrt{40\times \insureSprob{}\times \insureFprob{}}
    = \insureBinomFourtySD{}$.
  Because very roughly 95\% of observations fall within
  2~standard deviations of the mean
  (see Section~\ref{variability}), we would probably observe
  at least \insureBinomFourtyLower{}
  but fewer than \insureBinomFourtyUpper{} individuals
  in our sample who would not exceed the deductible.}

\begin{exercisewrap}
\begin{nexercise}
The probability that a random smoker will develop a severe
lung condition in his or her lifetime is about $0.3$.
If you have 4 friends who smoke, are the conditions for the
binomial model satisfied?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{One possible answer:
  if the friends know each other, then the independence
  assumption is probably not satisfied.
  For example, acquaintances may have similar smoking habits,
  or those friends might make a pact to quit together.}

\D{\newpage}

\begin{exercisewrap}
\begin{nexercise}
\label{noMoreThanOneFriendWSevereLungCondition}%
Suppose these four friends do not know each other
and we can treat them as if they were a random sample
from the population.
Is the binomial model appropriate?
What is the probability that\footnotemark{}
\begin{enumerate}[(a)]
\setlength{\itemsep}{0mm}
\item
    None of them will develop a severe lung condition?
\item
    One will develop a severe lung condition?
\item
    That no more than one will develop a severe lung condition?
\end{enumerate}
\end{nexercise}
\end{exercisewrap}
\footnotetext{To check if the binomial model is appropriate,
  we must verify the conditions.
  (i)~Since we are supposing we can treat the friends
  as a random sample, they are independent.
  (ii)~We have a fixed number of trials ($n=4$).
  (iii)~Each outcome is a success or failure.
  (iv)~The probability of a success is the same for each
  trials since the individuals are like a random sample
  ($p=0.3$ if we say a ``success'' is someone getting
  a lung condition, a morbid choice).
  Compute parts~(a) and~(b) using the binomial formula:
  $P(0)
    = {4 \choose 0} (0.3)^0 (0.7)^4
    = 1\times1\times0.7^4
    = 0.2401$,
  $P(1)
    = {4 \choose 1} (0.3)^1(0.7)^{3}
    = 0.4116$.
  Note: $0!=1$.
  Part~(c) can be computed as the sum of parts~(a) and~(b):
  $P(0) + P(1) = 0.2401 + 0.4116 = 0.6517$.
  That is, there is about a 65\% chance that no more than
  one of your four smoking friends will develop a severe
  lung condition.}

\begin{exercisewrap}
\begin{nexercise}
What is the probability that at least 2 of your 4 smoking
friends will develop a severe lung condition in their
lifetimes?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{The complement (no more than one will develop
  a severe lung condition) as computed in Guided
  Practice~\ref{noMoreThanOneFriendWSevereLungCondition}
  as 0.6517, so we compute one minus this value:~0.3483.}

\begin{exercisewrap}
\begin{nexercise}
Suppose you have 7 friends who are smokers and they can
be treated as a random sample of smokers.\footnotemark{}
\begin{enumerate}[(a)]
\setlength{\itemsep}{0mm}
\item
    How many would you expect to develop a severe lung
    condition, i.e. what is the mean?
\item
    What is the probability that at most 2 of your 7
    friends will develop a severe lung condition.
\end{enumerate}
\end{nexercise}
\end{exercisewrap}
\footnotetext{(a)~$\mu=0.3\times7 = 2.1$.
  (b)~$P($0, 1, or 2 develop severe lung condition$)
      = P(k=0) + P(k=1)+P(k=2) = 0.6471$.}

Next we consider the first term in the binomial probability,
$n$ choose $k$ under some special scenarios.

\begin{exercisewrap}
\begin{nexercise}
Why is it true that ${n \choose 0}=1$ and ${n \choose n}=1$
for any number $n$?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{Frame these expressions into words.
  How many different ways are there to arrange 0 successes
  and $n$ failures in $n$ trials?
  (1 way.)
  How many different ways are there to arrange $n$ successes
  and 0 failures in $n$ trials?
  (1 way.)}

\begin{exercisewrap}
\begin{nexercise}
How many ways can you arrange one success and $n-1$ failures
in $n$ trials?
How many ways can you arrange $n-1$ successes and one failure
in $n$ trials?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{One success and $n-1$ failures:
  there are exactly $n$ unique places we can put
  the success, so there are $n$ ways to arrange one
  success and $n-1$ failures.
  A~similar argument is used for the second question.
  Mathematically, we show these results by verifying
  the following two equations:
  \begin{align*}
  {n \choose 1} = n,
    \qquad {n \choose n-1} = n
  \end{align*}}


\newpage


\subsection{Normal approximation to the binomial distribution}
\label{normalApproxBinomialDistSubsection}

\index{distribution!binomial!normal approximation|(}

The binomial formula is cumbersome when the sample size ($n$) is large, particularly when we consider a range of observations. In some cases we may use the normal distribution as an easier and faster way to estimate binomial probabilities.

\newcommand{\smokeprop}{0.15}
\newcommand{\smokeperc}{15\%}
\newcommand{\smokepropcomp}{0.85}
\newcommand{\smokeperccomp}{85\%}
\newcommand{\smokex}{42}
\newcommand{\smokexplusone}{43}
\newcommand{\smoken}{400}
\newcommand{\smokelowertailbinom}{0.0054}
\newcommand{\smokemean}{60}
\newcommand{\smokemeancomp}{340}
\newcommand{\smokesd}{7.14}
\newcommand{\smokez}{-2.52}
\newcommand{\smokelowertailnormal}{0.0059}

\begin{examplewrap}
\begin{nexample}{Approximately \smokeperc{} of the
    US population smokes cigarettes.
    A local government believed their community had
    a lower smoker rate and commissioned a survey of
    400 randomly selected individuals.
    The survey found that only \smokex{} of the
    \smoken{} participants smoke cigarettes.
    If the true proportion of smokers in the community
    was really \smokeperc{}, what is the probability
    of observing \smokex{} or fewer smokers in a sample
    of \smoken{} people?}
  \label{exactBinomSmokerExSetup}%
  We leave the usual verification that the four conditions
  for the binomial model are valid as an exercise.

  The question posed is equivalent to asking,
  what is the probability of observing
  $k=0$, 1, 2, ..., or \smokex{} smokers in a sample of
  $n = \smoken{}$ when $p=\smokeprop{}$?
  We can compute these \smokexplusone{} different
  probabilities and add them together to find the answer:
  \begin{align*}
  &P(k=0\text{ or }k=1\text{ or }\cdots\text{ or } k=\smokex{}) \\
	&\qquad = P(k=0) + P(k=1) + \cdots + P(k=\smokex{}) \\
	&\qquad = \smokelowertailbinom{}
  \end{align*}
  If the true proportion of smokers in the community
  is $p=\smokeprop{}$, then the probability of observing
  \smokex{} or fewer smokers in a sample of $n=\smoken{}$
  is \smokelowertailbinom{}.
\end{nexample}
\end{examplewrap}

The computations in Example~\ref{exactBinomSmokerExSetup}
are tedious and long.
In general, we should avoid such work if an alternative method
exists that is faster, easier, and still accurate.
Recall that calculating probabilities of a range of values
is much easier in the normal model.
We might wonder, is it reasonable to use the normal model
in place of the binomial distribution?
Surprisingly, yes, if certain conditions are met.

\begin{exercisewrap}
\begin{nexercise}
Here we consider the binomial model when the probability
of a success is $p = 0.10$.
Figure~\ref{fourBinomialModelsShowingApproxToNormal}
shows four hollow histograms for simulated samples from
the binomial distribution using four different sample sizes:
$n = 10, 30, 100, 300$.
What happens to the shape of the distributions as the sample
size increases?
What distribution does the last hollow histogram
resemble?\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{The distribution is transformed from a blocky
  and skewed distribution into one that rather resembles the
  normal distribution in last hollow histogram.}

\begin{figure}[h]
  \centering
  \Figure[Four hollow histograms are shown, each in their own plot, based on a probability of p equals 0.10 and sample sizes of n equals 10, 30, 100, and 300. The first plot for n = 10 shows a distribution centered at 1 and is notably right skewed. The second plot for n = 30 shows a distribution centered at about 3, is just a bit right skewed, and is starting to look a little bit like a bell-shaped distribution. The third plot for n = 100 shows a distribution centered at about 10 and that is almost entirely symmetric with just the slightest indication it is right skewed. This third distribution also looks very bell-shaped. The fourth plot for n = 300 shows a distribution centered at about 30 and that is symmetric. This last plot looks very bell-shaped and resembles a normal distribution.]{0.92}{fourBinomialModelsShowingApproxToNormal}
  \caption{Hollow histograms of samples from the binomial
      model when $p = 0.10$.
      The sample sizes for the four plots are
      $n = 10$, 30, 100, and 300, respectively.}
  \label{fourBinomialModelsShowingApproxToNormal}
\end{figure}

\begin{onebox}{Normal approximation of the binomial distribution}
  The binomial distribution with probability of success
  $p$ is nearly normal when the sample size $n$ is
  sufficiently large that $np$ and $n(1-p)$ are both
  at least 10.
  The approximate normal distribution has parameters
  corresponding to the mean and standard deviation of
  the binomial distribution:\vspace{-1.5mm}
  \begin{align*}
  \mu &= np
      &\sigma& = \sqrt{np(1 - p)}
  \end{align*}
\end{onebox}

The normal approximation may be used when computing
the range of many possible successes.
For instance, we may apply the normal distribution to
the setting of Example~\ref{exactBinomSmokerExSetup}.

\D{\newpage}

\begin{examplewrap}
\begin{nexample}{How can we use the normal approximation
    to estimate the probability of observing \smokex{} or
    fewer smokers in a sample of \smoken{}, if the true
    proportion of smokers is $p = \smokeprop{}$?}
  \label{approxNormalForSmokerBinomEx}
  Showing that the binomial model is reasonable was a
  suggested exercise in Example~\ref{exactBinomSmokerExSetup}.
  We also verify that both $np$ and $n(1-p)$ are at least 10:
  \begin{align*}
  np &= \smoken{} \times \smokeprop{} = \smokemean{}
  &n (1 - p) &= \smoken{} \times \smokepropcomp{}
      = \smokemeancomp{}
  \end{align*}
  With these conditions checked, we may use the normal
  approximation in place of the binomial distribution
  using the mean and standard deviation from the binomial
  model:
  \begin{align*}
  \mu &= np = \smokemean{}
  &\sigma &= \sqrt{np(1 - p)} = \smokesd{}
  \end{align*}
  We want to find the probability of observing
  \smokex{} or fewer smokers using this model.
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
Use the normal model $N(\mu = \smokemean{}, \sigma = \smokesd{})$
to estimate the probability of observing \smokex{} or fewer
smokers.
Your answer should be approximately equal to the solution
of Example~\ref{exactBinomSmokerExSetup}:%
~\smokelowertailbinom{}.~\footnotemark{}
\end{nexercise}
\end{exercisewrap}
\footnotetext{Compute the Z-score first:
  $Z = \frac{\smokex{} - \smokemean{}}{\smokesd{}} = \smokez{}$.
  The corresponding left tail area is \smokelowertailnormal{}.}



\newpage


\subsection{The normal approximation breaks down on small intervals}

The normal approximation to the binomial distribution tends to perform poorly when estimating the probability of a small range of counts, even when the conditions are met.

\newcommand{\smokeA}{49}
\newcommand{\smokeB}{50}
\newcommand{\smokeC}{51}
\newcommand{\smokeABCBinom}{0.0649}
\newcommand{\smokeABCNormal}{0.0421}
\newcommand{\smokeABCNormalFixed}{0.0633}

Suppose we wanted to compute the probability of observing
\smokeA{}, \smokeB{}, or \smokeC{} smokers in \smoken{}
when $p = \smokeprop{}$.
With such a large sample, we might be tempted to apply
the normal approximation and use the range \smokeA{} to \smokeC{}.
However, we would find that the binomial solution and the normal
approximation notably differ:
\begin{align*}
\text{Binomial:}&\ \smokeABCBinom{}
&\text{Normal:}&\ \smokeABCNormal{}
\end{align*}
We can identify the cause of this discrepancy using
Figure~\ref{normApproxToBinomFail}, which shows the areas
representing the binomial probability (outlined) and normal
approximation (shaded).
Notice that the width of the area under the normal
distribution is 0.5 units too slim on both sides of
the interval.

\begin{figure}[h]
  \centering
  \Figure[A normal distribution centered at 60 with a standard deviation of about 7 is shown. (The determination that the standard deviation is about 7 was based on the normal distribution being very close to 0 a distance of about 20 from the mean, and this happens about 3 standard deviations from the mean.) A region of this distribution is shaded from 49 to 51. Additionally, a red outlined area is boxed out between 48.5 and 51.5 that represents the exact binomial distribution.]{1.0}{normApproxToBinomFail}
  \caption{A normal curve with the area between
      \smokeA{} and \smokeC{} shaded.
      The outlined area represents the exact binomial
      probability.}
  \label{normApproxToBinomFail}
\end{figure}

\begin{onebox}{Improving the normal approximation
    for the binomial distribution}
  The normal approximation to the binomial distribution
  for intervals of values is usually improved if cutoff
  values are modified slightly.
  The cutoff values for the lower end of a shaded region
  should be reduced by 0.5, and the cutoff value for the
  upper end should be increased by 0.5.
\end{onebox}

The tip to add extra area when applying the normal
approximation is most often useful when examining
a range of observations.
In the example above, the revised normal distribution
estimate is \smokeABCNormalFixed{}, much closer to the
exact value of \smokeABCBinom{}.
While it is possible to also apply this correction when
computing a tail area, the benefit of the modification
usually disappears since the total interval is typically
quite wide.

\index{distribution!binomial!normal approximation|)}
\index{distribution!binomial|)}


{\input{ch_distributions/TeX/binomial_distribution.tex}}




%_________________
\section{Negative binomial distribution}
\label{negativeBinomial}

\index{distribution!negative binomial|(}

The geometric distribution describes the probability of observing the first success on the $n^{th}$ trial. The \termsub{negative binomial distribution}{distribution!negative binomial} is more general: it describes the probability of observing the $k^{th}$ success on the $n^{th}$ trial.

\begin{examplewrap}
\begin{nexample}{Each day a high school football coach tells his star kicker, Brian, that he can go home after he successfully kicks four 35 yard field goals. Suppose we say each kick has a probability $p$ of being successful. If $p$ is small -- e.g. close to 0.1 -- would we expect Brian to need many attempts before he successfully kicks his fourth field goal?}
We are waiting for the fourth success ($k=4$). If the probability of a success ($p$) is small, then the number of attempts ($n$) will probably be large. This means that Brian is more likely to need many attempts before he gets $k=4$ successes. To put this another way, the probability of $n$ being small is low.
\end{nexample}
\end{examplewrap}

To identify a negative binomial case, we check 4 conditions. The first three are common to the binomial distribution.

\begin{onebox}{Is it negative binomial? Four conditions to check}
(1) The trials are independent. \\
(2) Each trial outcome can be classified as a success or failure. \\
(3) The probability of a success ($p$) is the same for each trial. \\
(4) The last trial must be a success.
\end{onebox}

\begin{exercisewrap}
\begin{nexercise}
Suppose Brian is very diligent in his attempts and he makes each 35 yard field goal with probability $p=0.8$. Take a guess at how many attempts he would need before making his fourth kick.\footnotemark
\end{nexercise}
\end{exercisewrap}
\footnotetext{One possible answer: since he is likely to make each field goal attempt, it will take him at least 4 attempts but probably not more than 6 or 7.}

\begin{examplewrap}
\begin{nexample}{In yesterday's practice, it took Brian only 6 tries to get his fourth field goal. Write out each of the possible sequence of kicks.} \label{eachSeqOfSixTriesToGetFourSuccesses}
Because it took Brian six tries to get the fourth success, we know the last kick must have been a success. That leaves three successful kicks and two unsuccessful kicks (we label these as failures) that make up the first five attempts. There are ten possible sequences of these first five kicks, which are shown in Figure~\ref{successFailureOrdersForBriansFieldGoals}. If Brian achieved his fourth success ($k=4$) on his sixth attempt ($n=6$), then his order of successes and failures must be one of these ten possible sequences.
\end{nexample}
\end{examplewrap}

\begin{figure}[ht]
\newcommand{\succObs}[1]{{\color{oiB}$\stackrel{#1}{S}$}}
\centering
\begin{tabular}{c|c ccc cl | r}
\multicolumn{8}{c}{\hspace{10mm}Kick Attempt} \\
& & 1 & 2 & 3 & 4 & \multicolumn{2}{l}{5\hfill6} \\
\hline
1&& $F$ & $F$ & \succObs{1} & \succObs{2} & \succObs{3} & \succObs{4} \\
2&& $F$ & \succObs{1} & $F$ & \succObs{2} & \succObs{3} & \succObs{4} \\
3&& $F$ & \succObs{1} & \succObs{2} & $F$ & \succObs{3} & \succObs{4} \\
4&& $F$ & \succObs{1} & \succObs{2} & \succObs{3} & $F$ & \succObs{4} \\
5&& \succObs{1} & $F$ & $F$ & \succObs{2} & \succObs{3} & \succObs{4} \\
6&& \succObs{1} & $F$ & \succObs{2} & $F$ & \succObs{3} & \succObs{4} \\
7&& \succObs{1} & $F$ & \succObs{2} & \succObs{3} & $F$ & \succObs{4} \\
8&& \succObs{1} & \succObs{2} & $F$ & $F$ & \succObs{3} & \succObs{4} \\
9&& \succObs{1} & \succObs{2} & $F$ & \succObs{3} & $F$ & \succObs{4} \\
10&& \succObs{1} & \succObs{2} & \succObs{3} & $F$ & $F$ & \succObs{4} \\
\end{tabular}
\caption{The ten possible sequences when the fourth successful kick is on the sixth attempt.}
\label{successFailureOrdersForBriansFieldGoals}
\end{figure}

\begin{exercisewrap}
\begin{nexercise} \label{probOfEachSeqOfSixTriesToGetFourSuccesses}
Each sequence in Figure~\ref{successFailureOrdersForBriansFieldGoals} has exactly two failures and four successes with the last attempt always being a success. If the probability of a success is $p=0.8$, find the probability of the first sequence.\footnotemark
\end{nexercise}
\end{exercisewrap}
\footnotetext{The first sequence:
  $0.2 \times 0.2 \times 0.8 \times
      0.8 \times 0.8 \times 0.8
    = 0.0164$.}

\D{\newpage}

If the probability Brian kicks a 35 yard field goal is $p=0.8$, what is the probability it takes Brian exactly six tries to get his fourth successful kick? We can write this as
{\small\begin{align*}
&P(\text{it takes Brian six tries to make four field goals}) \\
& \quad = P(\text{Brian makes three of his first five field goals, and he makes the sixth one}) \\
& \quad = P(\text{$1^{st}$ sequence OR $2^{nd}$ sequence OR ... OR $10^{th}$ sequence})
\end{align*}
}where the sequences are from Figure~\ref{successFailureOrdersForBriansFieldGoals}. We can break down this last probability into the sum of ten disjoint possibilities:
{\small\begin{align*}
&P(\text{$1^{st}$ sequence OR $2^{nd}$ sequence OR ... OR $10^{th}$ sequence}) \\
&\quad = P(\text{$1^{st}$ sequence}) + P(\text{$2^{nd}$ sequence}) + \cdots + P(\text{$10^{th}$ sequence})
\end{align*}
}The probability of the first sequence was identified in Guided Practice~\ref{probOfEachSeqOfSixTriesToGetFourSuccesses} as 0.0164, and each of the other sequences have the same probability. Since each of the ten sequence has the same probability, the total probability is ten times that of any individual sequence.

The way to compute this negative binomial probability is similar to how the binomial problems were solved in Section~\ref{binomialModel}. The probability is broken into two pieces:
\begin{align*}
&P(\text{it takes Brian six tries to make four field goals}) \\
&= [\text{Number of possible sequences}] \times P(\text{Single sequence})
\end{align*}
Each part is examined separately, then we multiply to get the final result.

We first identify the probability of a single sequence. One particular case is to first observe all the failures ($n-k$ of them) followed by the $k$ successes:
\begin{align*}
&P(\text{Single sequence}) \\
&= P(\text{$n-k$ failures and then $k$ successes}) \\
&= (1-p)^{n-k} p^{k}
\end{align*}

\D{\newpage}

We must also identify the number of sequences for the general case. Above, ten sequences were identified where the fourth success came on the sixth attempt. These sequences were identified by fixing the last observation as a success and looking for all the ways to arrange the other observations. In other words, how many ways could we arrange $k-1$ successes in $n-1$ trials? This can be found using the $n$ choose $k$ coefficient but for $n-1$ and $k-1$ instead:
\begin{align*}
{n-1 \choose k-1} = \frac{(n-1)!}{(k-1)! \left((n-1) - (k-1)\right)!} = \frac{(n-1)!}{(k-1)! \left(n - k\right)!}
\end{align*}
This is the number of different ways we can order $k-1$ successes and $n-k$ failures in $n-1$ trials. If the factorial notation (the exclamation point) is unfamiliar, see page~\pageref{factorial_defined}.

\begin{onebox}{Negative binomial distribution}
  The negative binomial distribution describes the
  probability of observing the $k^{th}$ success on
  the $n^{th}$ trial, where all trials are independent:
  \begin{align*}
  P(\text{the $k^{th}$ success on the $n^{th}$ trial})
      = {n-1 \choose k-1} p^{k}(1-p)^{n-k}
  \end{align*}
  The value $p$ represents the probability that
  an individual trial is a success.
\end{onebox}

\begin{examplewrap}
\begin{nexample}{Show using the formula for the negative binomial distribution that the probability Brian kicks his fourth successful field goal on the sixth attempt is 0.164.}
The probability of a single success is $p=0.8$, the number of successes is $k=4$, and the number of necessary attempts under this scenario is $n=6$.
\begin{align*}
{n-1 \choose k-1}p^k(1-p)^{n-k}\ 
	=\ \frac{5!}{3!2!} (0.8)^4 (0.2)^2\ 
	=\ 10\times 0.0164\ 
	=\ 0.164
\end{align*}
\end{nexample}
\end{examplewrap}

\begin{exercisewrap}
\begin{nexercise}
The negative binomial distribution requires that each kick attempt by Brian is independent. Do you think it is reasonable to suggest that each of Brian's kick attempts are independent?\footnotemark
\end{nexercise}
\end{exercisewrap}
\footnotetext{Answers may vary. We cannot conclusively say they are or are not independent. However, many statistical reviews of athletic performance suggests such attempts are very nearly independent.}

\begin{exercisewrap}
\begin{nexercise}
Assume Brian's kick attempts are independent. What is the probability that Brian will kick his fourth field goal within 5 attempts?\footnotemark
\end{nexercise}
\end{exercisewrap}
\footnotetext{If his fourth field goal ($k=4$) is within five attempts, it either took him four or five tries ($n=4$ or $n=5$). We have $p=0.8$ from earlier. Use the negative binomial distribution to compute the probability of $n = 4$ tries and $n=5$ tries, then add those probabilities together:
\begin{align*}
& P(n=4\text{ OR }n=5) = P(n=4) + P(n=5) \\
&\quad = {4-1 \choose 4-1} 0.8^4 + {5-1 \choose 4-1} (0.8)^4(1-0.8) = 1\times 0.41 + 4\times 0.082 = 0.41 + 0.33 = 0.74
\end{align*}}

\D{\newpage}

\begin{onebox}{Binomial versus negative binomial}
  In the binomial case, we typically have a fixed number
  of trials and instead consider the number of successes.
  In the negative binomial case, we examine how many trials
  it takes to observe a fixed number of successes and
  require that the last observation be a success.
\end{onebox}

\begin{exercisewrap}
\begin{nexercise}
On 70\% of days, a hospital admits at least one heart attack patient. On 30\% of the days, no heart attack patients are admitted. Identify each case below as a binomial or negative binomial case, and compute the probability.\footnotemark
\begin{enumerate}[(a)]
\setlength{\itemsep}{0mm}
\item What is the probability the hospital will admit
    a heart attack patient on exactly three days this week?

\item What is the probability the second day with a heart
    attack patient will be the fourth day of the week?

\item What is the probability the fifth day of next month
    will be the first day with a heart attack patient?
\end{enumerate}
\end{nexercise}
\end{exercisewrap}
\footnotetext{In each part, $p=0.7$. (a) The number of days is fixed, so this is binomial. The parameters are $k=3$ and $n=7$: 0.097. (b) The last ``success'' (admitting a heart attack patient) is fixed to the last day, so we should apply the negative binomial distribution. The parameters are $k=2$, $n=4$: 0.132. (c) This problem is negative binomial with $k=1$ and $n=5$: 0.006. Note that the negative binomial case when $k=1$ is the same as using the geometric distribution.}

\index{distribution!negative binomial|)}


{\input{ch_distributions/TeX/negative_binomial_distribution.tex}}





%_________________
\section{Poisson distribution}
\label{poisson}

\index{distribution!Poisson|(}

\begin{examplewrap}
\begin{nexample}{There are about 8 million individuals
    in New York City.
    How many individuals might we expect to be hospitalized
    for acute myocardial infarction (AMI), i.e. a heart attack,
    each day?
    According to historical records, the average number is
    about 4.4 individuals.
    However, we would also like to know the approximate
    distribution of counts.
    What would a histogram of the number of AMI occurrences
    each day look like if we recorded the daily counts over
    an entire year?}
  \label{amiIncidencesEachDayOver1YearInNYCExample}%
  A histogram of the number of occurrences of AMI on 365 days
  for NYC is shown in
  Figure~\ref{amiIncidencesOver100Days}.\footnotemark{}
  The sample mean (4.38) is similar to the historical average
  of~4.4.
  The sample standard deviation is about 2, and the histogram
  indicates that about 70\% of the data fall between 2.4 and~6.4.
  The distribution's shape is unimodal and skewed to the right.
\end{nexample}
\end{examplewrap}
\footnotetext{These data are simulated. In practice, we should check for an association between successive days.}

\begin{figure}[h]
  \centering
  \Figure[A histogram is shown for "AMI Events (by Day)". There are 11 non-zero values shown: a frequency of about 15 at a value of 1, a frequency of 50 at 2, 70 at 3, 85 at 4, 55 at 5, 45 at 6, 25 at 7, 20 at 8, 5 at 9, 5 at 10, and a frequency of about 2 at 11.]{0.6}{amiIncidencesOver100Days}
  \caption{A histogram of the number of occurrences
      of AMI on 365 separate days in NYC.}
  \label{amiIncidencesOver100Days}
\end{figure}

The \termsub{Poisson distribution}{distribution!Poisson} is often useful for estimating the number of events in a large population over a unit of time. For instance, consider each of the following events:
\begin{itemize}
\setlength{\itemsep}{0mm}
\item having a heart attack,
\item getting married, and
\item getting struck by lightning.
\end{itemize}
The Poisson distribution helps us describe the number of such events that will occur in a day for a fixed population if the individuals within the population are independent. The Poisson distribution could also be used over another unit of time, such as an hour or a~week.

The histogram in Figure~\ref{amiIncidencesOver100Days} approximates a Poisson distribution with rate equal to 4.4. The \term{rate} for a Poisson distribution is the average number of occurrences in a mostly-fixed population per unit of time. In Example~\ref{amiIncidencesEachDayOver1YearInNYCExample}, the time unit is a day, the population is all New York City residents, and the historical rate is 4.4. The parameter in the Poisson distribution is the rate -- or how many events we expect to observe -- and it is typically denoted by $\lambda$\index{Greek!lambda@lambda ($\lambda$)}
(the Greek letter \emph{lambda})  or $\mu$. Using the rate, we can describe the probability of observing exactly $k$ events in a single unit of time.

\D{\newpage}

\begin{onebox}{Poisson distribution}
  Suppose we are watching for events and the number
  of observed events follows a Poisson distribution
  with rate $\lambda$.
  Then
  \begin{align*}
  P(\text{observe $k$ events})
      = \frac{\lambda^{k} e^{-\lambda}}{k!}
  \end{align*}
  where $k$ may take a value 0, 1, 2, and so on,
  and $k!$ represents $k$-factorial, as described on
  page~\pageref{factorial_defined}.
  The letter $e\approx2.718$ is the base of the natural
  logarithm.
  The mean and standard deviation of this distribution
  are $\lambda$ and $\sqrt{\lambda}$, respectively.
\end{onebox}

We will leave a rigorous set of conditions for the Poisson distribution to a later course. However, we offer a few simple guidelines that can be used for an initial evaluation of whether the Poisson model would be appropriate.

A random variable may follow a Poisson distribution if we are looking for the number of events, the population that generates such events is large, and the events occur independently of each other.

Even when events are not really independent --
for instance, Saturdays and Sundays are especially
popular for weddings --
a Poisson model may sometimes still be reasonable
if we allow it to have a different rate for different
times.
In the wedding example, the rate would be modeled as
higher on weekends than on weekdays.
The idea of modeling rates for a Poisson distribution
against a second variable such as the day of week forms
the foundation of some more advanced methods that fall
in the realm of \termsub{generalized linear models}
    {generalized linear model}.
In Chapters~\ref{linRegrForTwoVar}
and~\ref{multipleAndLogisticRegression},
we will discuss a foundation of linear models.

\index{distribution!Poisson|)}


{\input{ch_distributions/TeX/poisson_distribution.tex}}


================================================
FILE: ch_distributions/TeX/geometric_distribution.tex
================================================
\exercisesheader{}

% 11

\eoce{\qtq{Is it Bernoulli\label{is_it_bernouilli}} Determine if each trial can be 
considered an independent Bernoulli trial for the following situations.
\begin{parts}
\item Cards dealt in a hand of poker.
\item Outcome of each roll of a die.
\end{parts}
}{}

% 12

\eoce{\qt{With and without replacement\label{with_without_replacement}} In the 
following situations assume that half of the specified population is male and 
the other half is female.
\begin{parts}
\item Suppose you're sampling from a room with 10 people. What is the 
probability of sampling two females in a row when sampling with replacement? 
What is the probability when sampling without replacement?
\item Now suppose you're sampling from a stadium with 10,000 people. What is 
the probability of sampling two females in a row when sampling with 
replacement? What is the probability when sampling without replacement?
\item We often treat individuals who are sampled from a large population as 
independent. Using your findings from parts~(a) and~(b), explain whether or 
not this assumption is reasonable.
\end{parts}
}{}

% 13

\eoce{\qt{Eye color, Part I\label{eye_color_geometric}} A husband and wife both 
have brown eyes but carry genes that make it possible for their children to 
have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).
\begin{parts}
\item What is the probability the first blue-eyed child they have is their 
third child? Assume that the eye colors of the children are independent of 
each other.
\item On average, how many children would such a pair of parents have before 
having a blue-eyed child? What is the standard deviation of the number of 
children they would expect to have until the first blue-eyed child?
\end{parts}
}{}

% 14

\eoce{\qt{Defective rate\label{defective_rate}} A machine that produces a special 
type of transistor (a component of computers) has a 2\% defective rate. The 
production is considered a random process where each transistor is 
independent of the others.
\begin{parts}
\item What is the probability that the $10^{th}$ transistor produced is the 
first with a defect?
\item What is the probability that the machine produces no defective 
transistors in a batch of 100?
\item On average, how many transistors would you expect to be produced before 
the first with a defect? What is the standard deviation?
\item Another machine that also produces transistors has a 5\% defective rate 
where each transistor is produced independent of the others. On average how 
many transistors would you expect to be produced with this machine before the 
first with a defect? What is the standard deviation?
\item Based on your answers to parts (c) and (d), how does increasing the 
probability of an event affect the mean and standard deviation of the wait 
time until success?
\end{parts}
}{}

% 15

\eoce{\qt{Bernoulli, the mean\label{bernoulli_mean_derivation}}
Use the probability rules from
Section~\ref{randomVariablesSection}
to derive the mean of a Bernoulli random variable,
i.e. a random variable $X$ that takes value 1
with probability $p$ and value 0 with probability $1 - p$.
That is, compute the expected value of a generic
Bernoulli random variable.
}{}

% 16

\eoce{\qt{Bernoulli, the standard deviation\label{bernoulli_sd_derivation}}
Use the probability rules from
Section~\ref{randomVariablesSection}
to derive the standard deviation of a Bernoulli random variable,
i.e. a random variable $X$ that takes value 1
with probability $p$ and value 0 with probability $1 - p$.
That is, compute the square root of the variance of a generic
Bernoulli random variable.
}{}


================================================
FILE: ch_distributions/TeX/negative_binomial_distribution.tex
================================================
\exercisesheader{}

% 27

\eoce{\qt{Rolling a die\label{roll_die}} Calculate the 
following probabilities and indicate which probability distribution model is 
appropriate in each case. You roll a fair die 5 times. What is the 
probability of rolling
\begin{parts}
\item the first 6 on the fifth roll?
\item exactly three 6s?
\item the third 6 on the fifth roll?
\end{parts}
}{}

% 28

\eoce{\qt{Playing darts\label{play_darts}} Calculate the following probabilities 
and indicate which probability distribution model is appropriate in each 
case. A very good darts player can hit the bull's eye (red circle in the 
center of the dart board) 65\% of the time. What is the probability that he
\begin{parts}
\item hits the bullseye for the $10^{th}$ time on the $15^{th}$ try?
\item hits the bullseye 10 times in 15 tries?
\item hits the first bullseye on the third try?
\end{parts}
}{}

% 29

\eoce{\qt{Sampling at school\label{sampling_at_school}} For a sociology class 
project you are asked to conduct a survey on 20 students at your school. You 
decide to stand outside of your dorm's cafeteria and conduct the survey on a 
random sample of 20 students leaving the cafeteria after dinner one evening. 
Your dorm is comprised of 45\% males and 55\% females.
\begin{parts}
\item Which probability model is most appropriate for calculating the 
probability that the $4^{th}$ person you survey is the $2^{nd}$ female? 
Explain.
\item Compute the probability from part (a).
\item The three possible scenarios that lead to $4^{th}$ person you survey 
being the $2^{nd}$ female are
\[ \{M, M, F, F\}, \{M, F, M, F\}, \{F, M, M, F\} \]
One common feature among these scenarios is that the last trial is always 
female. In the first three trials there are 2 males and 1 female. Use the 
binomial coefficient to confirm that there are 3 ways of ordering 2 males and 
1 female. 
\item Use the findings presented in part (c) to explain why the formula for 
the coefficient for the negative binomial is ${n-1 \choose k-1}$ while the 
formula for the binomial coefficient is ${n \choose k}$.
\end{parts}
}{}

% 30

\eoce{\qt{Serving in volleyball\label{serving_volleyball}} A not-so-skilled 
volleyball player has a 15\% chance of making the serve, which involves 
hitting the ball so it passes over the net on a trajectory such that it will 
land in the opposing team's court. Suppose that her serves are independent of 
each other.
\begin{parts}
\item What is the probability that on the $10^{th}$ try she will make her 
$3^{rd}$ successful serve?
\item Suppose she has made two successful serves in nine attempts. What is 
the probability that her $10^{th}$ serve will be successful?
\item Even though parts (a) and (b) discuss the same scenario, the 
probabilities you calculated should be different. Can you explain the reason 
for this discrepancy?
\end{parts}
}{}


================================================
FILE: ch_distributions/TeX/normal_distribution.tex
================================================
\exercisesheader{}

% 1

\eoce{\qt{Area under the curve, Part I\label{area_under_curve_1}} What percent of a 
standard normal distribution $N(\mu=0, \sigma=1)$ is found in each region? 
Be sure to draw a graph. \vspace{-3mm}
\begin{multicols}{4}
\begin{parts}
\item $Z < -1.35$
\item $Z > 1.48$
\item $-0.4 < Z < 1.5$
\item $|Z| > 2$
\end{parts}
\end{multicols}
}{}

% 2

\eoce{\qt{Area under the curve, Part II\label{area_under_curve_2}} What percent of 
a standard normal distribution $N(\mu=0, \sigma=1)$ is found in each region? 
Be sure to draw a graph. \vspace{-3mm}
\begin{multicols}{4}
\begin{parts}
\item $Z > -1.13$
\item $Z < 0.18$
\item $Z > 8$
\item $|Z| < 0.5$
\end{parts}
\end{multicols}
}{}

% 3

\eoce{\qt{GRE scores, Part I\label{GRE_intro}} Sophia who took the Graduate Record 
Examination (GRE) scored 160 on the Verbal Reasoning section and 157 on the 
Quantitative Reasoning section. The mean score for Verbal Reasoning section 
for all test takers was 151 with a standard deviation of 7, and the mean 
score for the Quantitative Reasoning was 153 with a standard deviation of 
7.67. Suppose that both distributions are nearly normal. 
\begin{parts}
\item Write down the short-hand for these two normal distributions.
\item What is  Sophia's Z-score on the Verbal Reasoning section? On the 
Quantitative Reasoning section? Draw a standard normal distribution curve and 
mark these two Z-scores.
\item What do these Z-scores tell you?
\item Relative to others, which section did she do better on?
\item Find her percentile scores for the two exams.
\item What percent of the test takers did better than her on the Verbal 
Reasoning section? On the Quantitative Reasoning section?
\item Explain why simply comparing raw scores from the two sections could lead 
to an incorrect conclusion as to which section a student did better on.
\item If the distributions of the scores on these exams are not nearly 
normal, would your answers to parts (b) - (f) change? Explain your reasoning.
\end{parts}
}{}

% 4

\eoce{\qt{Triathlon times, Part I\label{triathlon_times_intro}} In triathlons, it 
is common for racers to be placed into age and gender groups. Friends Leo and 
Mary both completed the Hermosa Beach Triathlon, where Leo competed in the 
\textit{Men, Ages 30 - 34} group while Mary competed in the \textit{Women, 
Ages 25 - 29} group. Leo completed the race in 1:22:28 (4948 seconds), while 
Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished 
faster, but they are curious about how they did within their respective 
groups. Can you help them? Here is some information on the performance of 
their groups:
\begin{itemize}
\setlength{\itemsep}{0mm}
\item The finishing times of the \textit{Men, Ages 30 - 34} group has a mean 
of 4313 seconds with a standard deviation of 583 seconds.
\item The finishing times of the \textit{Women, Ages 25 - 29} group has a 
mean of 5261 seconds with a standard deviation of 807 seconds.
\item The distributions of finishing times for both groups are approximately 
Normal.
\end{itemize}
Remember: a better performance corresponds to a faster finish.
\begin{parts}
\item Write down the short-hand for these two normal distributions.
\item What are the Z-scores for Leo's and Mary's finishing times? What do 
these Z-scores tell you?
\item Did Leo or Mary rank better in their respective groups? Explain your 
reasoning.
\item What percent of the triathletes did Leo finish faster than in his group?
\item What percent of the triathletes did Mary finish faster than in her 
group?
\item If the distributions of finishing times are not nearly normal, would 
your answers to parts (b)~-~(e) change? Explain your reasoning.
\end{parts}
}{}

% 5

\eoce{\qt{GRE scores, Part II\label{GRE_cutoffs}} In Exercise~\ref{GRE_intro} we 
saw two distributions for GRE scores: $N(\mu=151, \sigma=7)$ for the verbal 
part of the exam and $N(\mu=153, \sigma=7.67)$ for the quantitative part. Use 
this information to compute each of the following:
\begin{parts}
\item The score of a student who scored in the $80^{th}$ percentile on the 
Quantitative Reasoning section.
\item The score of a student who scored worse than 70\% of the test takers in 
the Verbal Reasoning section.
\end{parts}
}{}

\D{\newpage}

% 6

\eoce{\qt{Triathlon times, Part II\label{triathlon_times_cutoffs}} In 
Exercise~\ref{triathlon_times_intro} we saw two distributions for triathlon 
times: $N(\mu=4313, \sigma=583)$ for \emph{Men, Ages 30 - 34} and 
$N(\mu=5261, \sigma=807)$ for the \emph{Women, Ages 25 - 29} group. Times are 
listed in seconds. Use this information to compute each of the following:
\begin{parts}
\item The cutoff time for the fastest 5\% of athletes in the men's group, i.e. those 
who took the shortest 5\% of time to finish. 
\item The cutoff time for the slowest 10\% of athletes in the women's group. 
\end{parts}
}{}

% 7

\eoce{\qt{LA weather, Part I\label{la_weather_intro}} The average daily high 
temperature in June in LA is 77\degree F with a standard deviation of 
5\degree F. Suppose that the temperatures in June closely follow a normal 
distribution. 
\begin{parts}
\item What is the probability of observing an 83\degree F temperature or 
higher in LA during a randomly chosen day in June?
\item How cool are the coldest 10\% of the days (days with lowest 
high temperature) during June in LA?
\end{parts}
}{}

% 8

\eoce{\qt{CAPM\label{CAPM}} The Capital Asset Pricing Model (CAPM) is a financial 
model that assumes returns on a portfolio are normally distributed. Suppose a 
portfolio has an average annual return of 14.7\% (i.e. an average gain of 
14.7\%) with a standard deviation of 33\%. A return of 0\% means the value of 
the portfolio doesn't change, a negative return means that the portfolio 
loses money, and a positive return means that the portfolio gains money.
\begin{parts}
\item What percent of years does this portfolio lose money, i.e. have a 
return less than 0\%?
\item What is the cutoff for the highest 15\% of annual returns with this 
portfolio?
\end{parts}
}{}

% 9

\eoce{\qt{LA weather, Part II\label{la_weather_unit_change}} 
Exercise~\ref{la_weather_intro} states that average daily high temperature in 
June in LA is 77\degree F with a standard deviation of 5\degree F, and it can 
be assumed that they to follow a normal distribution. We use the following 
equation to convert \degree F (Fahrenheit) to \degree C (Celsius):
\[ C = (F - 32) \times \frac{5}{9}. \]
\begin{parts}
\item Write the probability model for the distribution of temperature in 
\degree C in June in LA.
\item What is the probability of observing a 28\degree C (which roughly 
corresponds to 83\degree F) temperature or higher in June in LA? Calculate 
using the \degree C model from part (a).
\item Did you get the same answer or different answers in part (b) of this 
question and part (a) of Exercise~\ref{la_weather_intro}? Are you surprised? Explain.
\item Estimate the IQR of the temperatures (in \degree C) in June in LA.
\end{parts}
}{}

% 10

\eoce{\qt{Find the SD\label{find_sd_cholesterol}}
Cholesterol levels for women aged 20 to 34 follow an
approximately normal distribution with mean 185 milligrams
per deciliter (mg/dl).
Women with cholesterol levels above 220 mg/dl are considered
to have high cholesterol and about 18.5\% of women fall into
this category.
What is the standard deviation of the 
distribution of cholesterol levels for women aged 20 to~34?
}{}


================================================
FILE: ch_distributions/TeX/poisson_distribution.tex
================================================
\exercisesheader{}

% 31

\eoce{\qt{Customers at a coffee shop\label{coffee_shop_customers}} A coffee shop 
serves an average of 75 customers per hour during the morning rush.
\begin{parts}
\item
  Which distribution have we studied that is most appropriate
  for calculating the probability of a given number of customers
  arriving within one hour 
  during this time of day?
\item What are the mean and the standard deviation of the number of customers 
this coffee shop serves in one hour during this time of day?
\item Would it be considered unusually low if only 60 customers showed up to 
this coffee shop in one hour during this time of day?
\item Calculate the probability that this coffee shop serves 70 customers in 
one hour during this time of day.
\end{parts}
}{}

% 32

\eoce{\qt{Stenographer's typos\label{stenographer_typos}} A very skilled 
court stenographer makes one typographical error (typo) per hour on average.
\begin{parts}
\item What probability distribution is most appropriate for calculating the 
probability of a given number of typos this stenographer makes in an hour?
\item What are the mean and the standard deviation of the number of typos 
this stenographer makes?
\item Would it be considered unusual if this stenographer made 4 typos in a 
given hour? 
\item Calculate the probability that this stenographer makes at most 2 typos 
in a given hour.
\end{parts}
}{}

% 33

\eoce{\qtq{How many cars show up\label{cars_in_parking_lot}}
For Monday through Thursday when there isn't a holiday,
the average number of vehicles that visit a particular
retailer between 2pm and 3pm each afternoon is 6.5,
and the number of cars that show up on any given day
follows a Poisson distribution.
\begin{parts}
\item
    What is the probability that exactly
    5 cars will show up next Monday?
\item
    What is the probability that
    0, 1, or 2 cars will show up next Monday
    between 2pm and 3pm?
\item
    There is an average of 11.7 people who visit during
    those same hours from vehicles.
    Is it likely that the number of people visiting
    by car during this hour is also Poisson?
    Explain.
\end{parts}
}{}

% 34

\eoce{\qt{Lost baggage\label{lost_baggage}}
Occasionally an airline will lose a bag.
Suppose a small airline has found it can reasonably
model the number of bags lost each weekday using a
Poisson model with a mean of 2.2 bags.
\begin{parts}
\item
    What is the probability that the airline
    will lose no bags next Monday?
\item
    What is the probability that the airline
    will lose 0, 1, or 2 bags on next Monday?
\item
    Suppose the airline expands over the course
    of the next 3 years, doubling the number of
    flights it makes, and the CEO asks you if
    it's reasonable for them to continue
    using the Poisson model with a mean of~2.2.
    What is an appropriate recommendation?
    Explain.
\end{parts}
}{}


================================================
FILE: ch_distributions/TeX/review_exercises.tex
================================================
\reviewexercisesheader{}

% 35

\eoce{\qt{Roulette winnings\label{roulette_winnings}} In the game of roulette, a 
wheel is spun and you place bets on where it will stop. One popular bet is 
that it will stop on a red slot; such a bet has an 18/38 chance of winning. 
If it stops on red, you double the money you bet. If not, you lose the money 
you bet. Suppose you play 3 times, each time with a \$1 bet. Let Y represent 
the total amount won or lost. Write a probability model for Y.
}{}

% 36

\eoce{\qt{Speeding on the I-5, Part I\label{speeding_i5_intro}} The distribution of 
passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) in 
California is nearly normal with a mean of 72.6 miles/hour and a standard 
deviation of 4.78 miles/hour.\footfullcite{Johnson+Murray:2010}
\begin{parts}
\item What percent of passenger vehicles travel slower than 80 miles/hour?
\item What percent of passenger vehicles travel between 60 and 80 miles/hour?
\item How fast do the fastest 5\% of passenger vehicles travel?
\item The speed limit on this stretch of the I-5 is 70 miles/hour. 
Approximate what percentage of the passenger vehicles travel above the speed 
limit on this stretch of the I-5.
\end{parts}
}{}

% 37

\eoce{\qt{University admissions\label{university_admissions}} Suppose a university 
announced that it admitted 2,500 students for the following year's freshman 
class. However, the university has dorm room spots for only 1,786 freshman 
students. If there is a 70\% chance that an admitted student will decide to 
accept the offer and attend this university, what is the approximate 
probability that the university will not have enough dormitory room spots for 
the freshman class?
}{}

% 38

\eoce{\qt{Speeding on the I-5, Part II\label{speeding_i5_geometric}} 
Exercise~\ref{speeding_i5_intro} states that the distribution of speeds of 
cars traveling on the Interstate 5 Freeway (I-5) in California is nearly 
normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 
miles/hour. The speed limit on this stretch of the I-5 is 70 miles/hour.
\begin{parts}
\item A highway patrol officer is hidden on the side of the freeway. What is 
the probability that 5~cars pass and none are speeding? Assume that the 
speeds of the cars are independent of each other.
\item On average, how many cars would the highway patrol officer expect to 
watch until the first car that is speeding? What is the standard deviation of 
the number of cars he would expect to watch?
\end{parts}
}{}

% 39

\eoce{\qt{Auto insurance premiums\label{auto_insurance_premiums}} Suppose a 
newspaper article states that the distribution of auto insurance premiums for 
residents of California is approximately normal with a mean of \$1,650. The 
article also states that 25\% of California residents pay more than \$1,800. 
\begin{parts}
\item What is the Z-score that corresponds to the top 25\% (or the $75^{th}$ 
percentile) of the standard normal distribution?
\item What is the mean insurance cost? What is the cutoff for the 75th 
percentile?
\item Identify the standard deviation of insurance premiums in California.
\end{parts}
}{}

% 40

\eoce{\qt{SAT scores\label{sat_scores}}
SAT scores (out of 1600) are distributed 
normally with a mean of 1100 and a standard deviation of 200.
Suppose a school council awards a certificate of excellence
to all students who score at least 1350 on the SAT,
and suppose we pick one of the recognized students at random.
What is the probability this student's score will be
at least 1500?
(The material covered in
Section~\ref{conditionalProbabilitySection}
on conditional probability
would be useful for this question.)
}{}

% 41

\eoce{\qt{Married women} \label{married_women} The American Community Survey 
estimates that 47.1\% of women ages 15 years and over are married.
\footfullcite{marWomenACS}
\begin{parts}
\item We randomly select three women between these ages. What is the 
probability that the third woman selected is the only one who is married?
\item What is the probability that all three randomly selected women are 
married?
\item On average, how many women would you expect to sample before selecting 
a married woman? What is the standard deviation?
\item If the proportion of married women was actually 30\%, how many women 
would you expect to sample before selecting a married woman? What is the 
standard deviation?
\item Based on your answers to parts (c) and (d), how does decreasing the 
probability of an event affect the mean and standard deviation of the wait 
time until success?
\end{parts}
}{}

\D{\newpage}

% 42

\eoce{\qt{Survey response rate\label{survey_response_rate}} Pew Research reported 
that the typical response rate to their surveys is only 9\%. If for a 
particular survey 15,000 households are contacted, what is the probability 
that at least 1,500 will agree to respond? \footfullcite{surveysPew}
}{}

% 43

\eoce{\qt{Overweight baggage\label{overweight_baggage}} Suppose weights of the 
checked baggage of airline passengers follow a nearly normal distribution 
with mean 45 pounds and standard deviation 3.2 pounds. Most airlines charge a 
fee for baggage that weigh in excess of 50 pounds. Determine what percent of 
airline passengers incur this fee.
}{}

% 44

\eoce{\qt{Heights of 10 year olds, Part I\label{heights_10_yrs}}
Heights of 10 year olds, regardless of gender, closely follow
a normal distribution with mean 55 inches and standard deviation
6~inches.
\begin{parts}
\item
    What is the probability that a randomly chosen 10 year old
    is shorter than 48 inches?
\item
    What is the probability that a randomly chosen 10 year old
    is between 60 and 65 inches?
\item
    If the tallest 10\% of the class is considered
    ``very tall'',
    what is the height cutoff for ``very tall"?
\end{parts}
}{}

% 45

\eoce{\qt{Buying books on Ebay\label{buy_boooks_ebay}}
Suppose you're considering buying your expensive chemistry
textbook on Ebay.
Looking at past auctions suggests that the 
prices of this textbook follow an approximately normal
distribution with mean \$89 and standard deviation \$15.
\begin{parts}

\item What is the probability that a randomly selected auction for this book 
closes at more than \$100?

\item Ebay allows you to set your maximum bid price so that if someone 
outbids you on an auction you can automatically outbid them, up to the 
maximum bid price you set. If you are only bidding on one auction, what are 
the advantages and disadvantages of setting a bid price too high or too low? 
What if you are bidding on multiple auctions?

\item If you watched 10 auctions, roughly what percentile might you use for a 
maximum bid cutoff to be somewhat sure that you will win one of these ten 
auctions? Is it possible to find a cutoff point that will ensure that you win 
an auction?

\item If you are willing to track up to ten auctions closely, about what 
price might you use as your maximum bid price if you want to be somewhat sure 
that you will buy one of these ten books?

\end{parts}
}{}

% 46

\eoce{\qt{Heights of 10 year olds, Part II\label{heights_10_yrs_prob}}
Heights of 10 year olds, regardless of gender, closely follow
a normal distribution with mean 55 inches and standard deviation
6~inches.
\begin{parts}
\item
    The height requirement for \textit{Batman the Ride} at
    Six Flags Magic Mountain is 54 inches.
    What percent of 10 year olds cannot go on this ride?
\item
    Suppose there are four 10 year olds.
    What is the chance that at least two of them
    will be able to ride \emph{Batman the Ride}?
\item
    Suppose you work at the park to help them better
    understand their customers' demographics, and
    you are counting people as they enter
    the park.
    What is the chance that the first 10 year old
    you see who can ride \emph{Batman the Ride} is
    the 3rd 10 year old who enters the park?
\item
    What is the chance that the fifth 10 year old
    you see who can ride \emph{Batman the Ride} is
    the 12th 10 year old who enters the park?
\end{parts}
}{}

% 47

\eoce{\qt{Heights of 10 year olds, Part III\label{heights_10_yrs_dist}}
Heights of 10 year olds, regardless of gender, closely follow
a normal distribution with mean 55 inches and standard deviation
6~inches.
\begin{parts}
\item
    What fraction of 10 year olds are taller than
    76 inches?
\item\label{heights_10_yrs_dist_76_inches}
    If there are 2,000 10 year olds entering
    Six Flags Magic Mountain in a single day,
    then compute the expected number of
    10 year olds who are at least 76 inches tall.
    (You may assume the heights of the 10-year olds
    are independent.)
\item
    Using the binomial distribution,
    compute the probability that 0 of the 2,000
    10 year olds will be at least 76 inches tall.
\item
    The number of 10 year olds who enter
    Six Flags Magic Mountain and are
    at least 76 inches tall in a given day
    follows a Poisson distribution with
    mean equal to the value found in
    part~(\ref{heights_10_yrs_dist_76_inches}).
    Use the Poisson distribution to identify
    the probability no 10 year old will enter
    the park who is 76 inches or taller.
\end{parts}
}{}

% 48

\eoce{\qt{Multiple choice quiz\label{mc_quiz}} In a multiple choice quiz there are 
5 questions and 4 choices for each question (a, b, c, d). Robin has not 
studied for the quiz at all, and decides to randomly guess the answers. What 
is the probability that
\begin{parts}
\item the first question she gets right is the $3^{rd}$ question?
\item she gets exactly 3 or exactly 4 questions right?
\item she gets the majority of the questions right?
\end{parts}
}{}


================================================
FILE: ch_distributions/figures/6895997/6895997.R
================================================
library(openintro)
data(COL)

myPDF("6895997.pdf", 5, 2.5,
      mar = c(2, 0, 0, 0))
X <- seq(-4, 4, 0.01)
Y <- dnorm(X)
plot(X, Y,
     type = 'n',
     axes = FALSE,
     xlim = c(-3.2, 3.2),
     ylim = c(0, 0.4))
abline(h = 0, col = COL[6])
at <- -3:3
labels <- expression(mu - 3 * sigma,
                     mu - 2 * sigma,
                     mu - sigma,
                     mu,
                     mu + sigma,
                     mu + 2 * sigma,
                     mu + 3 * sigma)
axis(1, at, labels)
for (i in 3:1) {
  these <- (i - 1 <= X & X <= i)
  polygon(c(i - 1, X[these], i),
          c(0, Y[these], 0),
          col = COL[i],
          border = COL[i])
  these <- (-i <= X & X <= -i + 1)
  polygon(c(-i, X[these], -i + 1),
          c(0, Y[these], 0),
          col = COL[i],
          border = COL[i])
}

# _____ Label 99.7 _____ #
arrows(-3, 0.03,
       3, 0.03,
       code = 3,
       col = '#444444',
       length = 0.15)
text(0, 0.02, '99.7%', pos = 3)

# _____ Label 95 _____ #
arrows(-2, 0.13,
       2, 0.13,
       code = 3,
       col = '#444444',
       length = 0.15)
text(0, 0.12, '95%', pos = 3)

# _____ Label 68 _____ #
arrows(-1, 0.23,
       1, 0.23,
       code = 3,
       col = '#444444',
       length = 0.15)
text(0, 0.22, '68%', pos = 3)

lines(X, Y, col = '#888888')
abline(h = 0, col = '#AAAAAA')
dev.off()


================================================
FILE: ch_distributions/figures/amiIncidencesOver100Days/amiIncidencesOver100Days.R
================================================
library(openintro)

x <- ami.occurrences$ami

myPDF("amiIncidencesOver100Days.pdf", 5, 2.4,
       mar = c(3, 3.5, 0.5, 1))
histPlot(x,
         breaks = (0:max(2 * x + 1)) / 2 - 0.25,
         axes = FALSE,
         col = COL[1],
         xlab = "",
         ylab = "")
at     <- 0:1000
labels <- rep("", length(at))
axis(1, at = at, labels = labels, tcl = -0.18)
axis(1, at = seq(0, 1000, 5), tcl = -0.35)
axis(2, at = seq(0, 1000, 20))
par(las = 0)
mtext("AMI Events (by Day)", 1, 1.8)
mtext("Frequency", 2, 2.4)
dev.off()


================================================
FILE: ch_distributions/figures/between59And62/between59And62.R
================================================
library(openintro)
data(COL)

myPDF('between59And62.pdf', 2.5, 0.9,
      mar = c(1.4, 0, 0, 0),
      mgp = c(3, 0.45, 0))
normTail(70, 3.3,
         M = c(69, 74),
         col = COL[1],
         axes = FALSE)
labels <- round(70 + 3.3 * c(-2, 0, 2), 2)
axis(1, labels, cex.axis = 0.8)
dev.off()


================================================
FILE: ch_distributions/figures/eoce/GRE_intro/gre_intro.R
================================================
# load packages -----------------------------------------------------
library(openintro)

# set input data ----------------------------------------------------

mean_v = 151
sd_v = 7
sophia_v = 160
sophia_v_Z = (sophia_v - mean_v) / sd_v

mean_q = 153
sd_q = 7.67
sophia_q = 157  
sophia_q_Z = (sophia_q - mean_q) / sd_q

# gre_intro ---------------------------------------------------------

pdf("gre_intro.pdf", height = 3, width = 5)

par(mar = c(0,0,0,0), las = 1, mgp = c(3,1,0))

m = 0
s = 1

X <- m + s*seq(-3.2,3.2,0.01)
Y <- dnorm(X, m, s)

plot(X, Y, type='l', axes = F, 
     xlim = c(-3.4,3.4), ylim = c(-0.11, 0.4), 
     ylab = "")
lines(X, rep(0,length(X)))

lines(c(0,0), dnorm(0)*c(0.01,0.99), col = COL[6], lty=3)

z = sophia_v_Z
text(x = z+0.1, dnorm(z)*1.05, "VR", pos=3, col= COL[1], cex = 1.5)
text(x = z + 0.5, y = -0.03, paste("Z =", round(sophia_v_Z, 2)), 
     col = COL[1], cex = 1.5)
lines(c(z,z), dnorm(z, m, s)*c(0.01,0.99), lty=2, col= COL[1])

z = sophia_q_Z
text(x = z+0.1, dnorm(z)*1.05, "QR", pos=3, col= COL[4], cex = 1.5)
text(x = z - 0.5, y = -0.03, paste("Z =", round(sophia_q_Z, 2)), 
     col = COL[4], cex = 1.5)
lines(c(z,z), dnorm(z, m, s)*c(0.01,0.99), lty=2, col= COL[4])

dev.off()

# gre_intro_VR ---------------------------------------------------------

pdf("gre_intro_VR.pdf", height = 2, width = 4)

par(mar = c(2,0,0,0), las = 1, mgp = c(3,1,0), 
    cex.lab = 1.25, cex.axis = 0.9)

normTail(m = mean_v, s = sd_v, L = sophia_v, col = COL[1])

dev.off()

# gre_intro_QR ---------------------------------------------------------

pdf("gre_intro_QR.pdf", height = 2, width = 4)

par(mar = c(2,0,0,0), las = 1, mgp = c(3,1,0), 
    cex.lab = 1.25, cex.axis = 0.9)

normTail(m = mean_q, s = sd_q, L = sophia_q, col = COL[1])

dev.off()

================================================
FILE: ch_distributions/figures/eoce/area_under_curve_1/area_under_curve_1.R
================================================
# load packages -----------------------------------------------------

library(openintro)

# Z < -1.35 ---------------------------------------------------------

pdf("zltNeg.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = -1.35
u = NA

normTail(m = m, s = s, L = l, U = u, 
         axes = FALSE, col = COL[1], 
         xlab = "(a)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()

# Z > 1.48 ----------------------------------------------------------

pdf("zgtPos.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = NA
u = 1.48

normTail(m = m, s = s, L = l, U = u, 
         axes = FALSE, col = COL[1], 
         xlab = "(b)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()

# -0.4 < Z < 1.5-----------------------------------------------------

pdf("zBet.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = NA
u = NA
M = c(-0.4,1.5)

normTail(m = m, s = s, L = l, U = u, M = M,
         axes = FALSE, col = COL[1], 
         xlab = "(c)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()

# -2 < Z < 2---------------------------------------------------------

pdf("zgtAbs.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = -2
u = 2
M = NA

normTail(m = m, s = s, L = l, U = u, M = M,
         axes = FALSE, col = COL[1], 
         xlab = "(d)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()


================================================
FILE: ch_distributions/figures/eoce/area_under_curve_2/area_under_curve_2.R
================================================
# load packages -----------------------------------------------------

library(openintro)

# Z > -1.13 ---------------------------------------------------------

pdf("zgtNeg.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = NA
u = -1.13
M = NA

normTail(m = m, s = s, L = l, U = u, M = M,
         axes = FALSE, col = COL[1], 
         xlab = "(a)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()

# Z < 0.18 ----------------------------------------------------------

pdf("zltPos.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = 0.18
u = NA
M = NA

normTail(m = m, s = s, L = l, U = u, 
         axes = FALSE, col = COL[1], 
         xlab = "(b)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()

# Z > 8 -------------------------------------------------------------

pdf("zgt8.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = NA
u = 8
M = NA

normTail(m = m, s = s, L = l, U = u, M = M,
         axes = FALSE, col = COL[1], 
         xlab = "(c)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()

# -0.5 < Z < 0.5 ----------------------------------------------------

pdf("zgtAbs.pdf", height = 3, width = 5)

par(mar = c(5,0,0,0), las = 1, mgp = c(3,1,0), mfrow = c(1,1))

m = 0
s = 1
l = NA
u = NA
M = c(-0.5,0.5)

normTail(m = m, s = s, L = l, U = u, M = M,
         axes = FALSE, col = COL[1], 
         xlab = "(d)", cex.lab = 2)
axis(1, at = c(m - 3*s, l, m, u, m + 3*s), 
     label = c(NA,l,m,u,NA), cex.axis = 2)

dev.off()


================================================
FILE: ch_distributions/figures/eoce/college_fem_heights/college_fem_heights.R
================================================
# load packages -----------------------------------------------------
library(openintro)

# create data -------------------------------------------------------

heights = c(54, 55, 56, 56, 57, 58, 58, 59, 60, 60, 60, 61, 
            61, 62, 62, 63, 63, 63, 64, 65, 65, 67, 67, 69, 73)

# format data for including in text ---------------------------------

cat(paste("\\stackrel{", 1:25, "}{", sort(heights), "}", sep=""), sep=", ")

# plot histogram of heights -----------------------------------------

pdf("heightsFcoll_hist.pdf", height = 4, width = 6)

par(mar=c(3.7,2.2,1,1), las=1, mgp=c(2.5,0.7,0), mfrow = c(1,1), cex.lab = 1.5, cex.axis = 1.5)

histPlot(heights, col = COL[1], xlab = "Heights", ylab = "", probability = TRUE, axes = FALSE, ylim = c(0,0.085))
axis(1)
#axis(2, labels = NA)

x = heights
xfit = seq(min(x)-5, max(x)+5, length = 400)
yfit = dnorm(xfit, mean = mean(x), sd = sd(x))
lines(xfit, yfit, col = COL[4], lwd = 2)

dev.off()

# normal probability plot of heights --------------------------------

pdf("heightsFcoll_qq.pdf", height = 4, width = 6)

par(mar=c(3.7,3.7,1,1), las=1, mgp=c(2.5,0.7,0), mfrow = c(1,1), cex.lab = 1.5, cex.axis = 1.5)

qqnorm(heights, col = COL[1], pch = 19, main = "", axes = FALSE)
axis(1)
axis(2)
qqline(heights, col = COL[1])

dev.off()

================================================
FILE: ch_distributions/figures/eoce/stats_scores/stats_scores.R
================================================
# load packages -----------------------------------------------------
library(openintro)

# create data -------------------------------------------------------

scores = c(79, 83, 57, 82, 94, 83, 72, 74, 73, 71, 
           66, 89, 78, 81, 78, 81, 88, 69, 77, 79)

# format data for including in text ---------------------------------

cat(paste("\\stackrel{", 1:20, "}{", sort(scores), "}", sep=""), sep=", ")

# plot histogram of scores  -----------------------------------------

pdf("scores_hist.pdf", height = 4, width = 6)

par(mar = c(3.7, 2.2, 1, 1), las = 1, 
    mgp = c(2.5,0.7,0), mfrow = c(1,1), 
    cex.lab = 1.5, cex.axis = 1.5)

histPlot(scores, col = COL[1], 
         xlab = "Scores", ylab = "", 
         probability = TRUE, 
         axes = FALSE)
axis(1)
#axis(2, labels = NA)

x = scores
xfit = seq(min(x)-5, max(x)+5, length = 400)
yfit = dnorm(xfit, mean = mean(x), sd = sd(x))
lines(xfit, yfit, col = COL[4], lwd = 2)

dev.off()

# normal probability plot of scores  --------------------------------

pdf("scores_qq.pdf", height = 4, width = 6)

par(mar=c(3.7,3.7,1,1), las=1, 
    mgp=c(2.5,0.7,0), mfrow = c(1,1), 
    cex.lab = 1.5, cex.axis = 1.5)

qqnorm(scores, col = COL[1], 
       pch = 19, main = "", 
       axes = FALSE)
axis(1)
axis(2)
qqline(scores, col = COL[1])

dev.off()

================================================
FILE: ch_distributions/figures/fcidMHeights/fcidMHeights-helpers.R
================================================

QQNorm <- function(x, M, SD, col) {
  qqnorm(x,
         cex = 0.7,
         main = '',
         axes = FALSE,
         ylab = 'male heights (in.)',
         col = col)
  axis(1)
  axis(2)
  abline(M, SD)
}

NormalHist <- function(obs, hold, M, SD, col) {
  plot(0, 0,
       type = 'n',
       xlab = 'Male heights (inches)',
       ylab = '',
       axes = FALSE,
       main = '',
       xlim = M + c(-3, 3) * SD,
       ylim = c(0, max(hold$density)))
  for (i in 1:length(hold$counts)) {
    rect(hold$breaks[i], 0,
         hold$breaks[i + 1], hold$density[i],
         col = col)
  }
  axis(1)
  x <- seq(min(obs) - 2, max(obs) + 2, 0.01)
  y <- dnorm(x, M, SD)
  lines(x, y, lwd = 1.5)
}

================================================
FILE: ch_distributions/figures/fcidMHeights/fcidMHeights.R
================================================
library(openintro)

obs <- male_heights_fcid$height_inch
source("fcidMHeights-helpers.R")

hold <- hist(obs, plot = FALSE)

myPDF("fcidMHeights.pdf", 6, 2.7,
      mfrow = c(1, 2),
      mgp = c(2, 0.7, 0),
      mar = c(3, 0.2, 1, 0.8))
NormalHist(obs, hold, mean(obs), sd(obs), COL[1])

par(mar = c(3,4,1,0))
qqnorm(obs,
       cex = 0.7,
       main = '',
       axes = FALSE,
       ylab = 'Male Heights (inches)',
       col = COL[1])
axis(1)
axis(2)
qqline(obs)
dev.off()


================================================
FILE: ch_distributions/figures/fourBinomialModelsShowingApproxToNormal/fourBinomialModelsShowingApproxToNormal.R
================================================
library(openintro)
data(COL)

k  <- -50:500
p  <- 0.1
n  <- c(10, 30, 100, 300)
xl <- c(0, 0, 0, 10) - 1
xu <- c(7, 11, 24, 50) - 1
axis1 <- list()
axis1[[1]] <- seq(0, 6, 2)
axis1[[2]] <- seq(0, 10, 2)
axis1[[3]] <- seq(0, 20, 5)
axis1[[4]] <- seq(10, 50, 10)

myPDF('fourBinomialModelsShowingApproxToNormal.pdf', 5.5, 4.1,
      mfrow = c(2, 2),
      mar = c(3.9, 1, 0.5, 1),
      mgp = c(2.2, 0.6, 0))

for (i in 1:4) {
  plot(k - 0.05, dbinom(k, n[i], p),
       type = 's',
       xlim = c(xl[i], xu[i]),
       axes = FALSE,
       xlab = paste("n  = ", n[i]),
       ylab = "",
       col = COL[1],
       lwd = 2)
  axis(1, axis1[[i]])
  abline(h = 0)
  if (i == 2) {
  	par(mar = c(3.25, 1, 0.9, 1))
  }
}

dev.off()


================================================
FILE: ch_distributions/figures/geometricDist35/geometricDist35.R
================================================
library(openintro)
data(COL)

p <- 0.35
x <- 1:100
y <- (1 - p)^(x - 1) * p
myPDF('geometricDist35.pdf', 6, 3.1,
      mar = c(2.6, 3.6, 0.5, 0.5),
      mgp = c(2.5, 0.34, 0))
plot(x, y,
     xlim = c(0.5, 14.5),
     type = 'n',
     axes = FALSE,
     xlab = '',
     ylab = 'Probability')
mtext('Number of Trials', line = 1.5, side = 1)
axis(1, at = seq(2, 14, 2))
par(mgp = c(2.25, 0.5, 0))
axis(2, seq(0, 0.3, 0.1))
for (i in 1:14) {
  rect(x[i] - 0.4, 0,
       x[i] + 0.4, y[i],
       col = COL[1])
}
abline(h = 0)
text(14.7, 0.003, '...', col = '#444444')
dev.off()


================================================
FILE: ch_distributions/figures/geometricDist70/geometricDist70.R
================================================
library(openintro)
data(COL)

p <- 0.7
x <- 1:100
y <- (1 - p)^(x - 1) * p
myPDF('geometricDist70.pdf', 6, 3.1,
      mar = c(2.6, 3.6, 0.5, 0.5),
      mgp = c(2.5, 0.34, 0))
plot(x, y,
     xlim = c(0.5, 8.5),
     type = 'n',
     axes = FALSE,
     xlab = '',
     ylab = 'Probability')
mtext(paste('Number of Trials Until a Success for p =', p),
    line = 1.5, side = 1)
axis(1, at = seq(1, 20, 1))
par(mgp = c(2.25, 0.5, 0))
axis(2, seq(0, 0.6, 0.2))
axis(2, seq(0, 0.7, 0.1), rep("", 8), tcl = -0.15)
for (i in 1:14) {
  rect(x[i] - 0.4, 0,
       x[i] + 0.4, y[i],
       col = COL[1])
}
abline(h = 0)
text(14.7, 0.003, '...', col = '#444444')
dev.off()


================================================
FILE: ch_distributions/figures/height40Perc/height40Perc.R
================================================
library(openintro)
data(COL)

myPDF('height40Perc.pdf', 2.15, 0.95,
      mar = c(1.31, 0, 0.01, 0),
      mgp = c(3, 0.45, 0))
X <- seq(-4, 4, 0.01)
Y <- dnorm(X)
plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.1, 3.1))
axis(1,
     at = c(-2, 0, 2),
     labels = round(70 + 3.3 * c(-2, 0, 2), 2),
     cex.axis = 0.8)
these <- which(X <= -0.25)
polygon(c(X[these[1]], X[these], X[rev(these)[1]]),
        c(0, Y[these], 0),
        col = COL[1])

text(-2, 0.24, '  40%\n(0.40)', cex = 0.8, col = COL[1])

lines(X, Y)
abline(h = 0)

dev.off()


================================================
FILE: ch_distributions/figures/height82Perc/height82Perc.R
================================================
library(openintro)
data(COL)

myPDF('height82Perc.pdf', 2.15, 1,
      mar = c(1.31, 0, 0.01, 0),
      mgp = c(3, 0.45, 0))
X <- seq(-4, 4, 0.01)
Y <- dnorm(X)

plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.4, 3.4))
axis(1,
     at = c(-2, 0, 2),
     labels = round(70 + 3.3 * c(-2, 0, 2), 2),
     cex.axis = 0.8)
these <- which(X <= 0.92)
polygon(c(X[these[1]], X[these], X[rev(these)[1]]),
        c(0, Y[these], 0), col = COL[1])

text(-2, 0.23, '  82%\n(0.82)', cex = 0.8, col = COL[1])

arrows(2, 0.2, 1.45, 0.07, length = 0.07)
text(2.1, 0.18, '  18%\n(0.18)', cex = 0.8, pos = 3)

lines(X, Y)
abline(h = 0)
dev.off()


================================================
FILE: ch_distributions/figures/mikeAndJosePercentiles/mikeAndJosePercentiles.R
================================================
library(openintro)
data(COL)

myPDF("mikeAndJosePercentiles.pdf", 7, 1.3,
      mar = c(2, 0.2, 0.2, 0.2),
      mgp = c(3, 0.8, 0),
      tcl = -0.4)
layout(matrix(0:2, 1), c(0.5, 2, 2), 1)

normTail(70, 3.3,
         L = 67,
         axes = FALSE,
         col = COL[1])
axis(1,
     at = c(-100, 67, 70, 1000),
     cex.axis = 1.7)
text(62, 0.07, "Mike", cex = 2)

normTail(70, 3.3,
         L = 76,
         axes = FALSE,
         col = COL[1])
axis(1,
     at = c(-100, 70, 76, 1000),
     cex.axis = 1.7)
text(62, 0.07, "Jose", cex = 2)

dev.off()

================================================
FILE: ch_distributions/figures/nbaNormal/nbaNormal-helpers.R
================================================

QQNorm <- function(x, M, SD, col) {
  qqnorm(x,
         cex = 0.7,
         main = '',
         axes = FALSE,
         ylab = 'Observed',
         col = col)
  axis(1)
  axis(2)
  qqline(x)
}

NormalHist <- function(obs, hold, M, SD, col) {
  x <- seq(min(obs) - 2, max(obs) + 2, 0.01)
  y <- dnorm(x, M, SD)
  plot(0, 0,
       type = 'n',
       xlab = 'Height (inches)',
       ylab = '',
       axes = FALSE,
       main = '',
       xlim = M + c(-3, 3) * SD,
       ylim = c(0, max(hold$density, y)))
  for (i in 1:length(hold$counts)) {
    rect(hold$breaks[i], 0,
         hold$breaks[i + 1], hold$density[i],
         col = col)
  }
  axis(1)
  lines(x, y, lwd = 1.5)
}

================================================
FILE: ch_distributions/figures/nbaNormal/nbaNormal.R
================================================
library(openintro)
dim(nba_players_19)
head(nba_players_19)

source("nbaNormal-helpers.R")

obs <- nba_players_19$height
M  <- mean(obs)
SD <- sd(obs)
hold <- hist(obs, plot = FALSE)

myPDF("nbaNormal.pdf", 6, 2.5,
      mfrow = c(1, 2),
      mgp = c(2, 0.5, 0),
      mar = c(3, 0.5, 0.5, 2),
      cex.axis = 0.8)
NormalHist(obs, hold, M, SD, COL[1])
par(mar = c(3, 4, 0.5, 0.5))
QQNorm(obs, M, SD, COL[1])
dev.off()


================================================
FILE: ch_distributions/figures/normApproxToBinomFail/normApproxToBinomFail.R
================================================
library(openintro)
data(COL)

k <- 0:400
p <- 0.15
n <- 400
x1 <- 49
x2 <- 51
m <- n * p
s <- sqrt(n * p * (1 - p))

myPDF('normApproxToBinomFail.pdf', 7.5, 2.6,
      mar = c(1.9, 1, 0.3, 1),
      mgp = c(2.2, 0.6, 0),
      tcl = -0.35)

X <- seq(0, 100, 0.01)
Y <- dnorm(X, m, s)
plot(X, Y,
     type = "l",
     xlim = c(37, 83),
     axes = FALSE,
     xlab = "",
     ylab = "")
polygon(c(x1, x1, x2, x2),
        dnorm(c(-1000, x1, x2, -1000), m, s),
        col = COL[1])
polygon(rep(c(x1 - 1.1, x1, x1 + 1, x2 + 0.1), rep(2, 4)) + 0.5,
        dbinom(c(-1000, x1, x1, x1 + 1, x1 + 1, x2, x2, -1000),
            n, p),
        border = COL[4],
        lwd = 2)
axis(1)
axis(1,
     1:200,
     rep("", 200),
     tcl = -0.12)
abline(h = 0)

dev.off()


================================================
FILE: ch_distributions/figures/normalExamples/normalExamples-helpers.R
================================================

QQNorm <- function(x, M, SD, col) {
  qqnorm(x,
         cex = 0.7,
         main = '',
         axes = FALSE,
         ylab = 'observed',
         col = col)
  axis(1, cex.axis = 1.2)
  axis(2, cex.axis = 1.2)
  qqline(x)
}

NormalHist <- function(obs, hold, M, SD, col) {
  plot(0, 0,
       type = 'n',
       xlab = '',
       ylab = '',
       axes = FALSE,
       main = '',
       xlim = c(-3, 3),
       ylim = c(0, max(hold$density)))
  for (i in 1:length(hold$counts)) {
    rect(hold$breaks[i], 0,
         hold$breaks[i + 1], hold$density[i],
         col = col)
  }
  axis(1, cex.axis = 1.2)
  x <- seq(min(obs) - 2, max(obs) + 2, 0.01)
  y <- dnorm(x, M, SD)
  lines(x, y, lwd = 1.5)
}

================================================
FILE: ch_distributions/figures/normalExamples/normalExamples.R
================================================
library(openintro)
data(COL)

obs1 <- simulated_normal$n40
obs2 <- simulated_normal$n100
obs3 <- simulated_normal$n400

hold1 <- hist(obs1, plot=FALSE)
M1    <- mean(obs1)
SD1   <- sd(obs1)

hold2 <- hist(obs2, breaks=10, plot=FALSE)
M2    <- mean(obs2)
SD2   <- sd(obs2)

hold3 <- hist(obs3, breaks=12, plot=FALSE)
M3    <- mean(obs3)
SD3   <- sd(obs3)

source("normalExamples-helpers.R")

myPDF("normalExamples.pdf", 7.3, 4.4,
      mfrow = c(2, 3),
      mgp = c(2, 0.7, 0),
      mar = c(3, 0, 1, 1))
NormalHist(obs1, hold1, M1, SD1, COL[1])
NormalHist(obs2, hold2, M2, SD2, COL[2])
NormalHist(obs3, hold3, M3, SD3, COL[3])

par(mar = c(3,2.85,1,1.8))
QQNorm(obs1, M1, SD1, COL[1])
QQNorm(obs2, M2, SD2, COL[2])
QQNorm(obs3, M3, SD3, "#B09A00")

dev.off()


================================================
FILE: ch_distributions/figures/normalQuantileExer/QQNorm.R
================================================

QQNorm <- function(obs, at  =  pretty(obs), lwd = 2) {
  qqnorm(obs,
         cex = 0.9,
         main = '',
         axes = FALSE,
         ylab = 'Observed',
         xlab = "",
         col = COL[1],
         lwd = lwd)
  mtext("Theoretical quantiles",
        1,
        1.8,
        cex = 0.8)
  axis(1, cex.axis = 1.1)
  axis(2, at = at, cex.axis = 1.1)
}


================================================
FILE: ch_distributions/figures/normalQuantileExer/normalQuantileExer-data.R
================================================


================================================
FILE: ch_distributions/figures/normalQuantileExer/normalQuantileExer.R
================================================
library(openintro)
data(COL)


obs1 <- simulated_dist$d1
obs2 <- simulated_dist$d2
obs3 <- simulated_dist$d3
obs4 <- simulated_dist$d4

source("QQNorm.R")

myPDF("normalQuantileExer.pdf", 6, 5.3,
      mfrow = c(2,2),
      mgp = c(2.4,.55,0),
      mar = c(3.5,3.45,1,1),
      cex.lab = 1.1)
QQNorm(obs1, seq(0, 120, 40), lwd = 1.5)
QQNorm(obs2, lwd = 1.5)
QQNorm(obs3, seq(-3, -1, 1), lwd = 1.5)
QQNorm(obs4, lwd = 1.5)
dev.off()


================================================
FILE: ch_distributions/figures/normalQuantileExer/normalQuantileExerAdditional.R
================================================
library(openintro)
data(COL)

source("QQNorm.R")

obs1 <- simulated_dist$d5
obs2 <- simulated_dist$d6


myPDF("normalQuantileExerAdditional.pdf", 7.2, 3.18,
      mfrow = c(1, 2),
      mgp = c(2.4, 0.55, 0),
      mar = c(3.5, 3.45, 1, 1),
      cex.lab = 1.1)

QQNorm(obs1, 0:2, lwd = 2)
QQNorm(obs2, seq(5, 15, 5), lwd = 2)
dev.off()


================================================
FILE: ch_distributions/figures/normalTails/normalTails.R
================================================
library(openintro)
data(COL)

myPDF("normalTails.pdf", 4.3, 1,
      mar = c(0.81, 1, 0.3, 1),
      mgp = c(3, -0.2, 0),
      mfrow = c(1,2))
normTail(0, 1,
         -0.8,
         col = COL[1],
         axes = FALSE)
at <- c(-5, 0, 5)
labels <- c(-5, 'Negative Z', 5)
cex.axis <- 0.7
tick <- FALSE
axis(1, at, labels, cex.axis = cex.axis, tick = tick)
lines(c(0, 0),
      dnorm(0) * c(0.01, 0.99),
      col = COL[6],
      lty = 3,
      lwd = 1.5)

normTail(0, 1,
         0.8,
         col = COL[1],
         axes = FALSE)
labels <- c(-5, 'Positive Z', 5)
axis(1, at, labels, cex.axis = cex.axis, tick = tick)
lines(c(0, 0),
      dnorm(0) * c(0.01, 0.99),
      col = COL[6],
      lty = 3,
      lwd = 1.5)
dev.off()


================================================
FILE: ch_distributions/figures/pokerNormal/pokerNormal.R
================================================
library(openintro)
data(COL)

obs <- c(-110, -9, -60, 316, -200, -196,
         320, -160, 31, 331, 1731, 21,
         -926, -475, 914, -300, -15, 1,
         -29, 829, 761, 227, -141, -672,
         352, 385, 24, 103, -826, 95,
         115, 39, -9, -1000, -35, -200,
         -200, 235, 70, 307, 135, 60,
         -100, -295, -1000, 361, -95,
         337, 3712, -255)

M  <- mean(obs)
SD <- sd(obs)
x <- seq(min(obs) - 3000,
         max(obs) + 3000,
         1)
y <- dnorm(x, M, SD)
myPDF("pokerNormal.pdf", 6.5, 2.7,
      mfrow = 1:2,
      mgp = c(2, 0.5, 0),
      mar = c(3, 0.5, 0.5, 2))
histPlot(obs,
         xlab = 'Poker earnings (US$)',
         ylab = '',
         axes = FALSE,
         main = '',
         xlim = c(-2000, 4000),
         probability = TRUE,
         col = COL[1])
axis(1,
     cex.axis = 0.7,
     mgp = c(2, 0.35, 0))
lines(x, y,
      lwd = 1.5)

par(mar = c(3, 4, 0.5, 0.5),
    mgp = c(2.8, 0.5, 0),
    cex.axis = 0.8)
qqnorm(obs,
       cex = 0.8, col = COL[1], lwd = 2,
       main = '',
       axes = FALSE,
       xlab = '',
       ylab = 'Observed')
mtext('Theoretical Quantiles',
      line = 2,
      side = 1)
axis(1)
axis(2)
dev.off()


================================================
FILE: ch_distributions/figures/satAbove1190/satAbove1190.R
================================================
library(openintro)
data(COL)

myPDF("satAbove1190.pdf", 3, 1.4,
      mar = c(1.2, 0, 0, 0),
      mgp = c(3, 0.17, 0))
normTail(1100, 200,
         U = 1190,
         axes = FALSE,
         col = COL[1])
axis(1, at = c(700, 1100, 1500),
     cex.axis = 0.8)
dev.off()


================================================
FILE: ch_distributions/figures/satActNormals/satActNormals.R
================================================
library(openintro)
data(COL)

set.seed(1)

pdf('satActNormals.pdf', 6, 3.5)
par(mfrow = c(2, 1),
    las = 1,
    mar = c(2.5, 0, 0.5, 0))

# _____ Curve 1 _____ #
m <- 1100
s <- 200
X <- m + s * seq(-6, 6, 0.01)
Y <- dnorm(X, m, s)
plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = m + s * 2.7 * c(-1, 1))
axis(1, at = m + s * (-3:3))
abline(h = 0)
lines(c(m, m),
      dnorm(m, m, s) * c(0.01, 0.99),
      lty = 2,
      col = '#EEEEEE')
lines(c(m, m) + s,
      dnorm(m + s, m, s) * c(0.01, 1.25),
      lty = 2, col = COL[1])
text(m + s,
     dnorm(m + s, m, s) * 1.25,
     'Ann',
     pos = 3,
     col = COL[1])


# _____ Curve 2 _____ #
par(mar = c(2, 0, 1, 0))
m <- 21
s <- 6
X <- m + s * seq(-6, 6, 0.01)
Y <- dnorm(X, m, s)
plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = m + s * 2.7 * c(-1, 1))
axis(1, at = m + s * (-3:3))
abline(h = 0)
lines(c(m, m),
      dnorm(m, m, s) * c(0.01, 0.99),
      lty = 2,
      col = '#EEEEEE')
lines(c(m, m) + 3,
      dnorm(m + 3, m, s) * c(0.01, 1.2),
      lty = 2,
      col = COL[1])
text(m + 3,
     dnorm(m + 3, m, s) * 1.05,
     'Tom',
     pos = 4,
     col = COL[1])

dev.off()


================================================
FILE: ch_distributions/figures/satBelow1030/satBelow1030.R
================================================
library(openintro)
data(COL)


myPDF('satBelow1030.pdf', 2.875, 1,
      mar = c(1.5, 0, 0, 0),
      mgp = c(3, 0.45, 0))
normTail(1100, 200, 1030,
         axes = FALSE,
         col = COL[1])
axis(1, at = c(700, 1100, 1500))
dev.off()


myPDF('satAbove1030.pdf', 3, 1,
      mar = c(1.5, 4, 0, 0),
      mgp = c(3, 0.45, 0))
normTail(1100, 200,
         U = 1030,
         axes = FALSE,
         col = COL[1])
axis(1, at = c(700, 1100, 1500))
dev.off()


================================================
FILE: ch_distributions/figures/satBelow1300/satBelow1300.R
================================================
library(openintro)
data(COL)

#===> plot <===#
myPDF("satBelow1300.pdf", 2.25, 1,
      mar = c(1.2, 0, 0, 0),
      mgp = c(3, 0.17, 0))
normTail(1100, 200,
         L = 1300,
         col = COL[1],
         cex.axis = 0.6)
dev.off()


================================================
FILE: ch_distributions/figures/simpleNormal/simpleNormal.R
================================================
library(openintro)
data(COL)

myPDF("simpleNormal.pdf", 4.3, 1.5,
      mar = 0.1 * rep(1, 4))

X <- seq(-5,5,0.01)
Y <- dnorm(X)
plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-4, 4),
     lwd = 2,
     col = COL[5])
#axis(1, at = -3:3)
abline(h = -0.002, col = COL[5])

dev.off()


================================================
FILE: ch_distributions/figures/smallNormalTails/smallNormalTails.R
================================================
library(openintro)

myPDF("smallNormalTails.pdf", 4.56, 1.2,
      mar = c(1.3, 1, 0.5, 1),
      mgp = c(3, 0.27, 0),
      mfrow = c(1, 2))

X <- seq(-4, 4, 0.01)
Y <- dnorm(X)

plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.4, 3.4))
at = c(-5, -0.8, 0, 5)
labels = c(-5, '-Z', 0, 5)
axis(1, at, labels, cex.axis = 0.7)
these <- which(X < -0.799)
polygon(c(X[these[1]], X[these], X[rev(these)[1]]),
        c(0, Y[these], 0),
        col = '#CCCCCC')
lines(X, Y)
abline(h = 0)
lines(c(0, 0), c(0, dnorm(0)),
      col = '#CCCCCC',
      lty = 3)

plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.4, 3.4))
axis(1,
     at = c(-5, 0.8, 0, 5),
     labels = c(-5, 'Z', 0,5),
     cex.axis = 0.7)
these <- which(X > 0.801)
polygon(c(X[these[1]], X[these],X[rev(these)[1]]),
        c(0, Y[these], 0),
        col = '#CCCCCC')
lines(X, Y)
abline(h = 0)
lines(c(0, 0),
      c(0, dnorm(0)),
      col = '#CCCCCC',
      lty = 3)

dev.off()


================================================
FILE: ch_distributions/figures/standardNormal/standardNormal.R
================================================
library(openintro)

set.seed(1)
x <- rnorm(1e5)
hold <- hist(x, breaks = 50, plot = FALSE)

myPDF("standardNormal.pdf", 1250 / 255, 650 / 255,
      mar = c(2, 0, 0.5, 0))

X <- seq(-4, 4, 0.01)
Y <- dnorm(X)

plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.4, 3.4))
axis(1, at = -3:3)
for(i in 1:length(hold$counts)){
  rect(hold$breaks[i], 0,
       hold$breaks[i+1], hold$density[i],
       border = '#DDDDDD',
       col = '#F4F4F4')
}
lines(X, Y)
abline(h = 0)

dev.off()


================================================
FILE: ch_distributions/figures/subtracting2Areas/subtracting2Areas.R
================================================
library(openintro)
data(COL)

AddShadedPlot <- function(x, y, offset,
                          shade.start = -8,
                          shade.until = 8) {
  lines(x + offset, y)
  lines(x + offset, rep(0, length(x)))
  these <- which(shade.start <= x & x <= shade.until)
  polygon(c(x[these[1]], x[these], x[rev(these)[1]]) + offset,
          c(0, y[these], 0),
          col = COL[1])
  lines(x + offset, y)
}
AddText <- function(x, text) {
  text(x, 0.549283, text)
}

pdf('subtracting2Areas.pdf', 4, 0.7)
par(las = 1,
    mar = rep(0, 4),
    mgp = c(3, 0, 0))
X <- seq(-3.2, 3.2, 0.01)
Y <- dnorm(X)

plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.4, 24 + 3.4),
     ylim = c(0, 0.622))

AddShadedPlot(X, Y, 0)
AddText(0, format(c(1, 0.0001), scientific = FALSE)[1])

AddShadedPlot(X, Y, 8, -8, -0.3)
AddText(8, format(0.3821, scientific = FALSE)[1])

AddShadedPlot(X, Y, 16, 1.21, 8)
AddText(16, format(0.1131, scientific = FALSE)[1])

AddShadedPlot(X, Y, 24, -0.3, 1.21)
AddText(24, format(0.5048, scientific = FALSE)[1])

lines(c(3.72, 4.28), rep(0.549283, 2), lwd = 2)
lines(c(3, 8 - 3), c(0.2, 0.2), lwd = 3)
lines(c(8 + 3.72, 8 + 4.28), rep(0.549283, 2), lwd = 2)
lines(c(8 + 3, 2 * 8 - 3), c(0.2, 0.2), lwd = 3)

text(20, 0.549283,
     ' = ')
segments(rep(19, 2), c(0.17, 0.23), rep(21, 2), lwd = 3)
dev.off()



================================================
FILE: ch_distributions/figures/subtractingArea/subtractingArea.R
================================================
library(openintro)

AddShadedPlot <- function(x, y, offset,
                          shade.start = -8,
                          shade.until = 8) {
  lines(x + offset, y)
  lines(x + offset, rep(0, length(x)))
  these <- which(shade.start <= x & x <= shade.until)
  polygon(c(x[these[1]], x[these], x[rev(these)[1]]) + offset,
          c(0, y[these], 0),
          col = COL[1])
  lines(x + offset, y)
}
AddText <- function(x, text) {
  text(x, 0.549283, text, cex = 2)
}

pdf('subtractingArea.pdf', 6, 1.4)
par(las = 1,
    mar = rep(0, 4),
    mgp = c(3, 0, 0))
X <- seq(-3.2, 3.2, 0.01)
Y <- dnorm(X)

plot(X, Y,
     type = 'l',
     axes = FALSE,
     xlim = c(-3.4, 16 + 3.4),
     ylim = c(0, 0.622))

AddShadedPlot(X, Y, 0)
AddText(0, format(c(1, 0.0001), scientific = FALSE)[1])

AddShadedPlot(X, Y, 8, -8, 0.45)
AddText(8, format(0.6736, scientific = FALSE)[1])

AddShadedPlot(X, Y, 16, 0.45, 8)
AddText(16, format(0.3264, scientific = FALSE)[1])

lines(c(3.72, 4.28), rep(0.549283, 2), lwd = 2)
lines(c(3, 8 - 3), c(0.2, 0.2), lwd = 3)

text(12, 0.549283,
     ' = ',
     cex = 2)
segments(c(11, 11), c(0.17, 0.23), c(13, 13), lwd = 3)
dev.off()


pdf('subtracted.pdf', 3, 0.95)
par(las = 1,
    mar = c(1.5, 3, 0, 0),
    mgp = c(3, 0.55, 0))
normTail(1100, 200, L = 1190, col = COL[1], axes = FALSE)
axis(1, at = c(700, 1100, 1500))
dev.off()


================================================
FILE: ch_distributions/figures/twoSampleNormals/twoSampleNormals.R
================================================
library(openintro)
data(COL)

set.seed(1)
x <- rnorm(100000)
hold <- hist(x,
             breaks = 50,
             plot = FALSE)

myPDF("twoSampleNormals.pdf", 6, 2,
      mfrow = c(1,2), las = 1, mar = c(2.5,1,0.5,1))

# curve 1
X <- seq(-4,4,0.01)
Y <- dnorm(X)
plot(X, Y,
     type = 'l',
     col = COL[1],
     axes = FALSE,
     xlim = c(-3.4, 3.4))
axis(1, at = -3:3)
for (i in 1:length(hold$counts)) {
  rect(hold$breaks[i], 0,
       hold$breaks[i+1], hold$density[i],
       border = COL[5,4], col = COL[7,3])
}
lines(X, Y, col = COL[1], lwd = 2)
abline(h = 0)

# curve 2
X <- seq(3,35,0.01)
Y <- dnorm(X, 19, 4)
plot(X, Y, type = 'l', col = COL[2], axes = FALSE, xlim = c(5.4,32.6))
axis(1, at = 19+4*(-3:3))

for (i in 1:length(hold$counts)) {
  rect(19 + 4 * hold$breaks[i], 0,
       19 + 4 * hold$breaks[i + 1], hold$density[i] / 4,
       border = COL[5, 4], col = COL[7, 3])
}
lines(X, Y, col = COL[2], lwd = 2)
abline(h = 0)

dev.off()


================================================
FILE: ch_distributions/figures/twoSampleNormalsStacked/twoSampleNormalsStacked.R
================================================
library(openintro)
data(COL)

myPDF("twoSampleNormalsStacked.pdf", 4.65, 2,
      mar = c(1.7,1,0.1,1))

# curve 1
X <- seq(-4,4,0.01)
Y <- dnorm(X)
plot(X, Y,
     type = 'l',
     col = COL[1],
     axes = FALSE,
     xlim = c(-5, 35))
axis(1, at = seq(-10, 40, 10))
lines(X, Y, col = COL[1], lwd = 3)
abline(h = 0)

# curve 2
X <- seq(4, 35, 0.01)
Y <- dnorm(X, 19, 4)
lines(X, Y, col = COL[2], lwd = 3)

dev.off()


================================================
FILE: ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex
================================================
\begin{chapterpage}{Foundations for inference}
  \chaptertitle{Foundations for inference}
  \label{foundationsForInference}
  \label{ch_foundations_for_inf}
  \chaptersection{pointEstimates}
  \chaptersection{confidenceIntervals}
  \chaptersection{hypothesisTesting}
\end{chapterpage}
\renewcommand{\chapterfolder}{ch_foundations_for_inf}

\chapterintro{Statistical inference is primarily
  concerned with understanding and quantifying the
  uncertainty of parameter estimates.
  While the equations and details change
  depending on the setting, the foundations for inference
  are the same throughout all of statistics. \\

  \noindent%
  We start with a familiar topic:
  the idea of using a sample proportion to estimate
  a population proportion.
  Next, we create what's called a
  \emph{\hiddenterm{confidence interval}}, which is a range
  of plausible values where we may find the true population
  value.
  Finally, we introduce the
  \emph{hypothesis testing framework},
  which allows us to formally evaluate claims about the
  population, such as whether a survey provides strong
  evidence that a candidate has the support of a majority
  of the voting population.}



%__________________
\section{Point estimates and sampling variability}
\label{pointEstimates}

\index{data!solar survey|(}

Companies such as Pew Research frequently conduct
polls as a way to understand the state of public opinion
or knowledge on many topics, including politics,
scientific understanding, brand recognition, and more.
%These polls typically reach a sample of 300 to
%10,000 people.
The ultimate goal in taking a poll is generally to use
the responses to estimate the opinion or knowledge of the
broader population.

%These polls are often based on 500 to 5000 people,
%and a polling company such as Pew would use this sample
%to estimate the opinions of the broader population.
%For example, Pew frequently conducts a poll on about
%1000 adults about their feelings about the direction
%of their country.
%In early 2019, they found that 
%Through this and future sections,
%we'll use some new notation and terminology:
%\begin{itemize}
%\item
%    For all inference problems concerning proportions,
%    the population proportion will be written as $p$.
%    When discussing a population summary such as $p$,
%    it is common to refer to the value as a population
%    \term{parameter}.
%    In the solar survey,
%    $p$ represents the proportion of \emph{all}
%    American adults who support solar energy.
%\item Using Pew Research sample, we can estimate that the proportion
%    of American adults who support expanding solar energy is
%    somewhere near \pewsolarpollpercent{}.
%    This is called the \term{sample proportion},
%    and it gets a special label of $\hat{p}$
%    (spoken as \emph{p-hat}).
%\item The size of a sample will generally
%    be denoted by $n$. In the case of this Pew Research poll,
%    the \term{sample size} is $n = \pewsolarpollsize{}$.
%\end{itemize}



%In the United States, those 1000 adults would be used
%to generalize out to a population of about \emph{250 million}
%American adults.
%A~natural question arises:
%\begin{quote}
%\em
%If the poll was based on only a thousand people,
%how reliable is it?
%\end{quote}
%For instance, if we took another poll,
%we wouldn't get the exact same answer,
%so how trustworthy is the result?
%This is the topic of this first inference section,
%where we hope to understand how variable estimates
%are from one sample to the next,
%which will give us an idea of how much trust we should
%(or shouldn't) put into such polls.


\subsection{Point estimates and error}

\index{point estimate|(}

Suppose a poll suggested the US President's approval
rating is 45\%.
We would consider 45\% to be a
\term{point estimate}\index{estimate} of the approval
rating we might see if we collected responses from the
entire population.
%\footnote{When we collect responses from the
%  entire population, it is called a \term{census}.
%  It is often expensive to conduct a census,
%  which is why we often instead take a sample.}
This entire-population response proportion is
generally referred to as the \term{parameter}
of interest.
When the parameter is a proportion,
it is often denoted by $p$,
%We typically estimate the parameter by collecting
%information from a sample of the population;
%we compute the observed proportion in the sample;
%also called a \term{point estimate},
and we often refer to the sample proportion as $\hat{p}$
(pronounced \emph{p-hat}\footnote{Not to be confused with
  \emph{phat}, the slang term used for something cool,
  like this book.}).
Unless we collect responses from every individual in the population,
$p$ remains unknown, and we use $\hat{p}$ as our estimate of~$p$.
The difference we observe from the poll versus
the parameter is called the \term{error} in the estimate.
%There are other considerations that can influence
%the error in a sample's estimate can be influenced
%by other factors, too.
%it is not the complete story.
%For this reason, we will also find it convenient to track
%the \term{sample size}, which is generally referred to using
%the letter $n$.
Generally, the error consists of two aspects:
sampling error and bias.
%Throughout the rest of this section,
%we discuss what a point estimate like
%\pewsolarpollpercent{} represents
%and the sampling uncertainty associated with such an estimate.
%If we take a simple random sample of 1000 American adults
%and ask them for their opinion about solar energy,
%will we tend to get a result close to the
%\pewsolarpollpercent{} value,
%or might we see observations far from the truth?


%
%Suppose that we know that \pewsolarpollpercent{}
%of American adults 
%
%American adults' attitudes towards different forms of energy.
%They found that \pewsolarpollpercent{} of respondents
%favored expanding
%solar energy.
%In this case, Pew Research worked to ensure
%that the sample was representative.
%However, a~natural question remains:
%\begin{quote}
%\em
%If the poll was based on only a thousand people,
%how reliable is it?
%\end{quote}
%If we took another poll, we wouldn't get the exact same answer.
%Maybe we'd get 90\%, or perhaps even 80\%.
%Ultimately, it's unlikely that the actual proportion of
%Americans who support expanding solar energy is
%\emph{exactly}~\pewsolarpollpercent{}, but the data suggest
%the actual support is close to \pewsolarpollpercent{}.
%This type of uncertainty --
%the variability in the estimate from one sample to the next --
%is called the \term{sampling error},
%and it is a major focus throughout the rest of this book.

%\footnote{Another major form
%  of error is \term{bias}, which basically is a systematic
%  tendency to over or under-estimate the true population value.
%  For instance, if we took a political poll and undersampled
%  one of the political parties, the sample would not be
%  representative and would skew in a particular direction.}
%Ultimately, it's unlikely that the actual proportion of Americans
%who support expanding solar energy is \emph{exactly}
%\pewsolarpollpercent{}, but the data suggest the actual
%support is close to \pewsolarpollpercent{}.


%The Pew Research poll is a point estimate
%of the actual proportion
%of American adults who support expanding solar energy.
%This estimate of \pewsolarpollpercent{} is unlikely
%to be perfect,
%and it's quite possible for the population proportion
%to be a little lower or a little higher than the
%sample proportion.
%The difference between a point estimate and
%the parameter is called the estimate's \term{error}.

\termsub{Sampling error}{sampling error},
sometimes called \emph{\hiddenterm{sampling uncertainty}},
describes how much an estimate will tend to vary from
one sample to the next.
For instance, the estimate from one sample might be 1\% too low
while in another it may be 3\% too high.
Much of statistics, including much of this book,
is focused on understanding and quantifying sampling error,
and we will find it useful to consider a sample's size
to help us quantify this error;
the \term{sample size} is often represented by the letter $n$.
%Intuitively, a larger sample would tend to produce a more
%accurate estimate than what we would
%obtain from a smaller sample.
%This is exactly the ref
%estimate from a smaller sample,
%and this is generally true.

\termsub{Bias}{bias} describes a systematic tendency
to over- or under-estimate the true population value.
For~example, if we were taking a student poll asking
about support for a new college stadium, we'd probably
get a biased estimate of the stadium's level of student
support by wording the question as,
\emph{Do you support your school by supporting funding
  for the new stadium?}
We try to minimize bias through thoughtful data
collection procedures, which were discussed in
Chapter~\ref{ch_intro_to_data}
and are the topic of many other books.

%While bias is an incredibly important topic,
%it's forms are so varied that 
%so vast and context-specific that we 

%\begin{onebox}{Sampling error vs bias}
%  \termsub{Sampling error}{sampling error} is uncertainty
%  in a point estimate that happens naturally from one sample
%  to the next.
%  The methods we discuss are useful for understanding,
%  quantifying, and working with sampling errors.
%  \stdvspace{}
%
%  In contrast, another common form of error is \term{bias},
%  which is a systematic tendency to over or under-estimate
%  the true population value.
%  For instance, if we took a political poll but our sample
%  didn't include a roughly representative distribution of
%  the political parties, the sample would likely skew
%  in a particular direction and be biased.
%\end{onebox}




\subsection{Understanding the variability of a point estimate}
\label{simulationForUnderstandingVariabilitySection}

\newcommand{\pewsolarpollsize}{1000}
\newcommand{\pewsolarparprop}{0.88}
\newcommand{\pewsolarparpropcomplement}{0.12}
\newcommand{\pewsolarparpercent}{88\%}
\newcommand{\pewsolarparpercentcomplement}{12\%}
\newcommand{\pewsolarpollprop}{0.887}
\newcommand{\pewsolarpollpropcomplement}{0.113}
\newcommand{\pewsolarpollpercent}{88.7\%}
\newcommand{\pewsolarpollpercentcomplement}{11.3\%}
\newcommand{\pewsolarpollcount}{887}
\newcommand{\pewsolarpollexpcount}{880}
\newcommand{\pewsolarpollcountcomplement}{113}
\newcommand{\pewsolarpollexpcountcomplement}{120}
\newcommand{\pewsolarpollse}{0.010}

Suppose the proportion of American adults who support
the expansion of solar energy is $p = \pewsolarparprop{}$,
which is our parameter of interest.\footnote{We haven't
  actually conducted a census to measure this value perfectly.
  However, a very large sample has suggested the actual
  level of support is about \pewsolarparpercent{}.}
If we were to take a poll of \pewsolarpollsize{} American adults
on this topic, the estimate would not be perfect,
but how close might we expect the sample proportion
in the poll would be to \pewsolarparpercent{}?
We want to understand, \emph{how does the
sample proportion $\hat{p}$ behave when the true population
proportion is
\pewsolarparprop{}}.\footnote{\pewsolarparpercent{}
  written as a proportion would be
  \pewsolarparprop{}.
  It is common to switch between proportion and percent.
  However, formulas presented in this book always refer
  to the proportion, not the percent.}
Let's find out!
We can simulate responses we would get from a simple
random sample of 1000 American adults,
which is only possible because we know the actual
support for expanding solar energy is \pewsolarparprop{}.
%
%
%We could
%run the survey again to see how consistent the results
%are, but who has the time and money for that? Instead,
%we can investigate the properties of $\hat{p}$ using simulations.
%
%To simulate the sample, we'll suppose that the population
%proportion is exactly \pewsolarpollpercent{}.
%Now, we know
%the population proportion isn't exactly \pewsolarpollpercent\%,
%but we do expect it to be close, so this simulation will offer
%us some insights about the property of $\hat{p}$.
%If we took a random sample
%from this population, how accurate would the point estimate be?
Here's how we might go about constructing such a simulation:
%simulate it:
\begin{enumerate}
\item There were about 250 million American adults in 2018.
    On 250 million pieces of paper, write ``support''
    on \pewsolarparpercent{} of them and ``not'' on
    the other \pewsolarparpercentcomplement{}.
\item Mix up the pieces of paper and pull out \pewsolarpollsize{}
    pieces to represent our sample of \pewsolarpollsize{}
    American adults.
\item Compute the fraction of the sample that say ``support''.
\end{enumerate}
Any volunteers to conduct this simulation? Probably not. Running
this simulation with 250 million pieces of paper would be
time-consuming and very costly, but we can simulate it
using computer code; we've written a short program in
Figure~\ref{solarPollSimulationCodeR}
in case you are curious what the computer code looks like.
In this simulation, the sample gave a point estimate of
$\hat{p}_1 = 0.894$. We~know the population proportion
for the simulation was $p = \pewsolarparprop{}$, so we know
the estimate had an error of
$0.894 - \pewsolarparprop{} = \text{+0.014}$.

%\setlength\textwidth{\officialtextwidth-10mm}
\begin{figure}[h]
\texttt{\# 1.\ Create a set of 250 million entries,
where \pewsolarparpercent{} of them are "support" \\
\#\ \ \ \ and \pewsolarparpercentcomplement{} are "not". \\
pop\us{}size <- 250000000 \\
possible\_entries <- c(rep("support", \pewsolarparprop{} * pop\us{}size), rep("not", \pewsolarparpropcomplement{} * pop\us{}size))
\\[3mm]
\# 2.\ Sample \pewsolarpollsize{} entries without replacement. \\
sampled\_entries <- sample(possible\_entries, size = \pewsolarpollsize{}) \\[3mm]
\# 3.\ Compute p-hat:~count the number that are "support",
then divide by \\
\#\ \ \ \ the sample size. \\
sum(sampled\_entries == "support") / \pewsolarpollsize{}}
\caption{For those curious, this is code for
    a single $\hat{p}$ simulation using the
    statistical software called \R{}\index{R}.
    Each line that starts with \texttt{\#} is a
    \term{code comment},
    which is used to describe in regular language what the
    code is doing.
    We've provided software labs in \R{} at
    \oiRedirect{os}{openintro.org/book/os}
    for anyone interested in learning more.}
\label{solarPollSimulationCodeR}
\end{figure}
% \setlength\textwidth{\officialtextwidth}

One simulation isn't enough to get a great sense of the
distribution of estimates we might expect in the simulation,
so we should run more simulations.
In a second simulation,
we get $\hat{p}_2 = 0.885$, which has an error of~+0.005.
In another, $\hat{p}_3 = 0.878$ for an error of -0.002.
And in another,
an estimate of $\hat{p}_4 = 0.859$ with an error of -0.021.
With the help of a computer, we've run the simulation 10,000 times
and created a histogram of the results from all 10,000 simulations
in Figure~\ref{sampling_10k_prop_88p}. This
distribution of sample proportions is called a
\term{sampling distribution}.
We can characterize this sampling distribution as follows:
\begin{description}
\setlength{\itemsep}{0mm}
\item[Center.]
    The center of the distribution is
    $\bar{x}_{\hat{p}} = \pewsolarparprop{}0$,
    which is the same as the parameter.
    Notice that the simulation mimicked a simple random sample
    of the population, which is a straightforward sampling
    strategy that helps avoid sampling bias.
%    That~is, we see that the sample proportion is an
%    \termsub{unbiased estimate}{unbiased}
%    of the population proportion.
\item[Spread.]
    The standard deviation of the distribution
    is $s_{\hat{p}} = \pewsolarpollse{}$.
    When we're talking about
    a sampling distribution or the variability of
    a point estimate, we typically use the term
    \termsub{standard error}{standard error (SE)}
    rather than \emph{standard deviation},
    and the notation $SE_{\hat{p}}$ is used for the standard
    error associated with the sample proportion.
\item[Shape.]
    The distribution is symmetric and bell-shaped,
    and it \emph{resembles a normal distribution}.
\end{description}
These findings are encouraging!
When the population
proportion is $p = \pewsolarparprop{}$ and the sample size is
$n = \pewsolarpollsize{}$,
the sample proportion $\hat{p}$ tends to give
a pretty good estimate
of the population proportion.
We also have the interesting observation
that the histogram resembles a normal distribution.

\begin{figure}[h]
   \centering
   \Figure[A histogram is shown for 10,000 sample proportions where each sample is taken from a population where the population proportion is \pewsolarparprop{} and the sample size is $n = \pewsolarpollsize{}$. The distribution is bell-shaped (appears nearly normal), is centered at 0.88 and has a standard deviation of about 0.01.]{0.8}{sampling_10k_prop_88p}
   %\Figure{0.8}{sampling_10k_prop_887p}
   \caption{A histogram of 10,000 sample proportions,
       where each sample is taken from a population
       where the population proportion is
       \pewsolarparprop{} and the sample size
       is $n = \pewsolarpollsize{}$.}
   \label{sampling_10k_prop_88p}
   %\label{sampling_10k_prop_887p}
\end{figure}

\begin{onebox}{Sampling distributions are
    never observed, but we keep them in mind}
  In real-world applications, we never actually observe the
  sampling distribution, yet it is useful to always think of
  a point estimate as coming from such a hypothetical
  distribution.
  \mbox{Understanding} the sampling distribution will help us
  characterize and make sense of the point estimates that we
  do observe.
\end{onebox}

\begin{examplewrap}
\begin{nexample}{If we used a much smaller sample size of $n = 50$,
would you guess that the standard error for $\hat{p}$ would be larger
or smaller than when we used $n = \pewsolarpollsize{}$?}
\label{smallerSampleWhatHappensToPropErrorExercise}
Intuitively, it seems like more data is better
than less data, and generally that is correct! The typical error
when $p = \pewsolarparprop{}$ and $n = 50$ would be larger
than the error we would expect when $n = \pewsolarpollsize{}$.
\end{nexample}
\end{examplewrap}

%\noindent
Example~\ref{smallerSampleWhatHappensToPropErrorExercise}
highlights an important property we will see again and again:
a bigger sample tends to provide a more precise point estimate
than a smaller sample.

\index{point estimate|)}


\subsection{Central Limit Theorem}

The distribution in
Figure~\ref{sampling_10k_prop_88p} looks an awful lot like
a normal distribution. That is no anomaly; it~is the result
of a general principle called the
\index{Central Limit Theorem!proportion|textbf}
\term{Central Limit Theorem}.

\begin{onebox}{Central Limit Theorem and the success-failure condition}
  When observations are independent and the sample size is
  sufficiently large, the sample proportion $\hat{p}$ will tend
  to follow a normal distribution with the following mean and
  standard error:%\footnotemark{}
  \begin{align*}
    \mu_{\hat{p}} &= p
    &SE_{\hat{p}} &= \sqrt{\frac{p (1 - p)}{n}}
  \end{align*}
  In order for the Central Limit Theorem to hold,
  the sample size is typically considered sufficiently large
  when $np \geq 10$ and $n(1-p) \geq 10$, which is called the
  \term{success-failure condition}.
\end{onebox}
%\footnotetext{Some statisticians will say what we
%  have written for $SE_{\hat{p}}$ should be called
%  the \emph{standard deviation of $\hat{p}$}
%  and the standard error is a term for
%  an estimated version (that we'll first encounter
%  in Section~\ref{apply_clt_real_world_setting}).
%  We adhere to simpler terminology in this book
%  that is also accepted,
%  where the listed formula also can be called the
%  \emph{standard error}.}

The Central Limit Theorem is incredibly important, and it provides
a foundation for much of statistics.
As we begin applying
the Central Limit Theorem, be mindful of the two
technical conditions:
the observations must be independent, and the sample size must
be sufficiently large such that $np \geq 10$ and $n(1-p) \geq 10$.

\begin{examplewrap}
\begin{nexample}{Earlier we estimated the mean and standard
error of $\hat{p}$ using simulated data when
$p = \pewsolarparprop{}$ and $n = \pewsolarpollsize{}$.
Confirm that the Central Limit Theorem applies
and the sampling  distribution is approximately
normal.}\label{sample_p88_n1000_confirm_normal}
\begin{description}
\item[Independence.] There are $n = \pewsolarpollsize{}$
    observations for each
    sample proportion $\hat{p}$, and each of those observations
    are independent draws. \emph{The most common way for
    observations to be considered independent is if they are from
    a simple random sample.}
    \index{independent}
    \index{independence}
    \index{Central Limit Theorem!independence}
\item[Success-failure condition.] We can confirm the sample size
    is sufficiently large by checking the success-failure condition
    and confirming the two calculated values are greater than~10:
    \begin{align*}
    np &= \pewsolarpollsize{} \times \pewsolarparprop{}
        = \pewsolarpollexpcount{}
        \geq 10
    &n(1-p) &= \pewsolarpollsize{} \times (1 - \pewsolarparprop{})
        = \pewsolarpollexpcountcomplement{}
        \geq 10
    \end{align*}
\end{description}
The independence and success-failure conditions are both
satisfied, so the Central Limit Theorem applies, and it's
reasonable to model $\hat{p}$ using a normal distribution.
\end{nexample}
\end{examplewrap}

\begin{onebox}{How to verify sample observations are independent}
  Subjects in an experiment are considered independent
  if they undergo random assignment to the treatment
  groups.\stdvspace{}

  If the observations are from a simple random sample,
  then they are independent.\stdvspace{}

  If a sample is from a seemingly random process,
  e.g. an occasional error on an assembly line,
  checking independence is more difficult. In~this case,
  use your best judgement.
\end{onebox}

An additional condition that is sometimes added for samples
from a population is that they are no larger than 10\% of
the population.
When the sample exceeds 10\% of the population size,
the methods we discuss tend to overestimate the sampling error
slightly versus what we would get using more advanced
methods.\footnote{For example, we could use what's called the
  \term{finite population correction factor}:
  if the sample is of size $n$ and the population size is $N$,
  then we can multiply the typical standard error formula by
  $\sqrt{\frac{N-n}{N-1}}$
  to obtain a smaller, more precise estimate of the
  actual standard error.
  When $n < 0.1 \times N$, this correction factor is
  relatively small.}
This is very rarely an issue, and when it is an issue,
our methods tend to be conservative, so we consider this
additional check as optional.

\begin{examplewrap}
\begin{nexample}{Compute the theoretical mean and standard error
of $\hat{p}$ when
$p = \pewsolarparprop{}$ and $n = \pewsolarpollsize{}$,
according to the
Central Limit Theorem.}\label{sample_p88_n1000_mean_se}
The mean of the $\hat{p}$'s is simply the population proportion:
$\mu_{\hat{p}} = \pewsolarparprop{}$.

The calculation of the standard error of $\hat{p}$ uses
the following formula:
\begin{align*}
SE_{\hat{p}}
    = \sqrt{\frac{p (1 - p)}{n}}
    = \sqrt{\frac{\pewsolarparprop{} (1 - \pewsolarparprop{})}
        {\pewsolarpollsize{}}}
    = \pewsolarpollse{}
\end{align*}
\end{nexample}
\end{examplewrap}

\begin{examplewrap}
\begin{nexample}{Estimate how frequently the sample proportion
$\hat{p}$ should be within 0.02 (2\%) of the population value,
$p = \pewsolarparprop{}$. Based on
Examples~\ref{sample_p88_n1000_confirm_normal}
and~\ref{sample_p88_n1000_mean_se},
we know that the distribution is approximately
$N(\mu_{\hat{p}} = \pewsolarparprop{}, SE_{\hat{p}} = \pewsolarpollse{})$.}
\label{sampling_10k_prop_887p-prop_from_867_to_907}
After so much practice in Section~\ref{normalDist},
this normal distribution example will hopefully feel familiar!
We would like to understand the fraction of $\hat{p}$'s
between 0.86 and 0.90:
\begin{center}
\Figure[A normal distribution centered at 0.88 with a standard deviation of 0.01 is shown, where the region between 0.86 and 0.90 has been shaded.]{0.35}{p-hat_from_86_and_90}
\end{center}
With $\mu_{\hat{p}} = \pewsolarparprop{}$ and
$SE_{\hat{p}} = \pewsolarpollse{}$,
we can compute the Z-score for both the left and right cutoffs:
\begin{align*}
Z_{0.86}
  &= \frac{0.86 - \pewsolarparprop{}}{\pewsolarpollse{}}
  = -2
&Z_{0.90}
  &= \frac{0.90 - \pewsolarparprop{}}{\pewsolarpollse{}}
  = 2
\end{align*}
We can use either statistical software, a graphing calculator,
or a table to find the areas to the tails, and in any case we
will find that they are each 0.0228. The total tail areas are
$2 \times 0.0228 = 0.0456$, which leaves the shaded area of
0.9544. That is, about 95.44\% of the sampling distribution
in Figure~\ref{sampling_10k_prop_88p} is within $\pm0.02$
of the population proportion, $p = \pewsolarparprop{}$.
\end{nexample}
\end{examplewrap}

\D{\newpage}

\begin{exercisewrap}
\begin{nexercise}
In Example~\ref{smallerSampleWhatHappensToPropErrorExercise}
we discussed how a smaller sample would tend
to produce a less reliable estimate. Explain how this intuition
is reflected in the formula for
$SE_{\hat{p}} = \sqrt{\frac{p (1 - p)}{n}}$.\footnotemark
\end{nexercise}
\end{exercisewrap}
\footnotetext{Since the
  sample size $n$ is in the denominator
  (on the bottom) of the fraction,
  a bigger sample size means the entire
  expression when calculated will tend to be smaller.
  That is, a larger sample size would correspond to
  a smaller standard error.}


\subsection{Applying the Central Limit Theorem to
    a real-world setting}
\label{apply_clt_real_world_setting}

We do not actually know the population proportion
unless we conduct an expensive poll of all individuals
in the population.
Our earlier value of $p = 0.88$ was based on poll
conducted by Pew Research of \pewsolarpollsize{}
American adults that found
$\hat{p} = \pewsolarpollprop{}$ of them favored
expanding solar energy.
The researchers might have wondered:
does the sample proportion from the poll approximately
follow a normal distribution?
We can check the conditions from the Central Limit Theorem:
\begin{description}
\item[Independence.] The poll is a simple random sample of
    American adults, which means that the observations are
    independent.
\item[Success-failure condition.] To check this condition,
    we need the population proportion, $p$, to check if both
    $np$ and $n(1-p)$ are greater than 10.
    However, we do not actually know $p$, which
    is exactly why the pollsters would take a sample!
    In cases like these, we often use $\hat{p}$
    as our next best way to check the success-failure condition:
    \begin{align*}
    n\hat{p}
        &= \pewsolarpollsize{} \times \pewsolarpollprop{}
        = \pewsolarpollcount{}
    &n (1 - \hat{p})
        &= \pewsolarpollsize{} \times (1 - \pewsolarpollprop{})
        = \pewsolarpollcountcomplement{}
    \end{align*}
    The sample proportion $\hat{p}$ acts as
    a reasonable substitute for $p$ during this check,
    and each value in this case is well above the minimum of 10.
\end{description}

This \term{substitution approximation} of using $\hat{p}$ in
place of $p$ is also useful when computing the standard error
of the sample proportion:
\begin{align*}
SE_{\hat{p}}
    = \sqrt{\frac{p (1 - p)}{n}}
    \approx \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}
    = \sqrt{\frac{\pewsolarpollprop{}
        (1 - \pewsolarpollprop{})}{\pewsolarpollsize{}}}
    = \pewsolarpollse{}
\end{align*}
This substitution technique is sometimes
referred to as the ``\hiddenterm{plug-in principle}''.
In this case, $SE_{\hat{p}}$ didn't change enough to
be detected using only 3 decimal places
versus when we completed the calculation with
\pewsolarparprop{} earlier.
The computed standard error tends to be reasonably stable
even when observing slightly different proportions in one
sample or another.


\D{\newpage}

\subsection{More details regarding the Central Limit Theorem}

\noindent%
We've applied the Central Limit Theorem in numerous examples
so far this chapter:
\begin{quote}{\em
When observations are independent and the sample size is
sufficiently large, the distribution of $\hat{p}$ resembles
a normal distribution with
\begin{align*}
  \mu_{\hat{p}} &= p
  &SE_{\hat{p}} &= \sqrt{\frac{p (1 - p)}{n}}
\end{align*}
The sample size is considered sufficiently large
when $n p \geq 10$ and $n (1 - p) \geq 10$.
}\end{quote}
In this section, we'll explore the success-failure
condition and seek to better understand the
Central Limit Theorem.

An interesting question to answer is, \emph{what happens when
$np < 10$ or $n(1-p) < 10$?} As we did in
Section~\ref{simulationForUnderstandingVariabilitySection},
we can simulate drawing samples of different sizes where,
say, the true proportion is $p = 0.25$.
Here's a sample of size~10:
\begin{center}
% paste(sample(c("yes", "no"), 10, TRUE, c(.25, .75)), collapse = ", ")
no, no, yes, yes, no, no, no, no, no, no
\end{center}
In this sample, we observe a sample proportion of yeses
of $\hat{p} = \frac{2}{10} = 0.2$. We can simulate many such
proportions to understand the sampling distribution of
$\hat{p}$ when $n = 10$ and $p = 0.25$, which we've plotted
in Figure~\ref{sampling_10_prop_25p}
alongside a normal distribution with the
same mean and variability.
These distributions have a number of important differences.

\begin{figure}[h]
   \centering
   \Figure[There are two plots. The first plot is a histogram of 10,000 simulations of p-hat when the sample size is n equals 10 and the population proportion is p equals 0.25. The possible values are 0.0, 0.1, 0.2, and so on up to 1.0, though the graph only shows values up to 0.8. The distribution is centered at about 0.25, and is slightly right-skewed. The frequencies are about 500 for 0.0, 1900 for 0.1, 2800 for 0.2, 2400 for 0.3, 1500 for 0.4, 500 for 0.5, 100 for 0.6, and the bin heights for the remaining values have bin heights that are not visually distinguishable from zero. The second plot shows a normal distribution centered at 0.25 with a standard deviation of 0.137. The plot has a vertical line located at 0.0, which makes it more visually evident that a portion of the area under the normal distribution -- about 5\% of this area -- represents values below 0.0.]{0.97}{sampling_10_prop_25p}
   \caption{Left: simulations of $\hat{p}$ when the sample size
       is $n = 10$ and the population proportion is $p = 0.25$.
       Right: a normal distribution with the same mean (0.25)
       and standard deviation (0.137).}
   \label{sampling_10_prop_25p}
\end{figure}

\begin{figure}
  \centering
  \Figures[S

Download .txt

gitextract_luzkuarl/

├── .gitignore
├── LICENSE.md
├── README.md
├── ch_distributions/
│   ├── TeX/
│   │   ├── binomial_distribution.tex
│   │   ├── ch_distributions.tex
│   │   ├── geometric_distribution.tex
│   │   ├── negative_binomial_distribution.tex
│   │   ├── normal_distribution.tex
│   │   ├── poisson_distribution.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── 6895997/
│       │   └── 6895997.R
│       ├── amiIncidencesOver100Days/
│       │   └── amiIncidencesOver100Days.R
│       ├── between59And62/
│       │   └── between59And62.R
│       ├── eoce/
│       │   ├── GRE_intro/
│       │   │   └── gre_intro.R
│       │   ├── area_under_curve_1/
│       │   │   └── area_under_curve_1.R
│       │   ├── area_under_curve_2/
│       │   │   └── area_under_curve_2.R
│       │   ├── college_fem_heights/
│       │   │   └── college_fem_heights.R
│       │   └── stats_scores/
│       │       └── stats_scores.R
│       ├── fcidMHeights/
│       │   ├── fcidMHeights-helpers.R
│       │   └── fcidMHeights.R
│       ├── fourBinomialModelsShowingApproxToNormal/
│       │   └── fourBinomialModelsShowingApproxToNormal.R
│       ├── geometricDist35/
│       │   └── geometricDist35.R
│       ├── geometricDist70/
│       │   └── geometricDist70.R
│       ├── height40Perc/
│       │   └── height40Perc.R
│       ├── height82Perc/
│       │   └── height82Perc.R
│       ├── mikeAndJosePercentiles/
│       │   └── mikeAndJosePercentiles.R
│       ├── nbaNormal/
│       │   ├── nbaNormal-helpers.R
│       │   └── nbaNormal.R
│       ├── normApproxToBinomFail/
│       │   └── normApproxToBinomFail.R
│       ├── normalExamples/
│       │   ├── normalExamples-helpers.R
│       │   └── normalExamples.R
│       ├── normalQuantileExer/
│       │   ├── QQNorm.R
│       │   ├── normalQuantileExer-data.R
│       │   ├── normalQuantileExer.R
│       │   └── normalQuantileExerAdditional.R
│       ├── normalTails/
│       │   └── normalTails.R
│       ├── pokerNormal/
│       │   └── pokerNormal.R
│       ├── satAbove1190/
│       │   └── satAbove1190.R
│       ├── satActNormals/
│       │   └── satActNormals.R
│       ├── satBelow1030/
│       │   └── satBelow1030.R
│       ├── satBelow1300/
│       │   └── satBelow1300.R
│       ├── simpleNormal/
│       │   └── simpleNormal.R
│       ├── smallNormalTails/
│       │   └── smallNormalTails.R
│       ├── standardNormal/
│       │   └── standardNormal.R
│       ├── subtracting2Areas/
│       │   └── subtracting2Areas.R
│       ├── subtractingArea/
│       │   └── subtractingArea.R
│       ├── twoSampleNormals/
│       │   └── twoSampleNormals.R
│       └── twoSampleNormalsStacked/
│           └── twoSampleNormalsStacked.R
├── ch_foundations_for_inf/
│   ├── TeX/
│   │   ├── ch_foundations_for_inf.tex
│   │   ├── confidence_intervals.tex
│   │   ├── hypothesis_testing.tex
│   │   ├── one_sided_tests.tex
│   │   ├── review_exercises.tex
│   │   └── variability_in_estimates.tex
│   └── figures/
│       ├── 95PercentConfidenceInterval/
│       │   └── 95PercentConfidenceInterval.R
│       ├── ARCHIVE/
│       │   └── sampling_10k_prop_56p/
│       │       └── sampling_10k_prop_56p.R
│       ├── arrayOfFigureAreasForChiSquareDistribution/
│       │   ├── chiSquareAreaAbove10WithDF4/
│       │   │   └── chiSquareAreaAbove10WithDF4.R
│       │   ├── chiSquareAreaAbove11Point7WithDF7/
│       │   │   └── chiSquareAreaAbove11Point7WithDF7.R
│       │   ├── chiSquareAreaAbove4Point3WithDF2/
│       │   │   └── chiSquareAreaAbove4WithDF2.R
│       │   ├── chiSquareAreaAbove5Point1WithDF5/
│       │   │   └── chiSquareAreaAbove5Point1WithDF5.R
│       │   ├── chiSquareAreaAbove6Point25WithDF3/
│       │   │   └── chiSquareAreaAbove6Point25WithDF3.R
│       │   └── chiSquareAreaAbove9Point21WithDF3/
│       │       └── chiSquareAreaAbove9Point21WithDF3.R
│       ├── bladesTwoSampleHTPValueQC/
│       │   └── bladesTwoSampleHTPValueQC.R
│       ├── business_one_sided_20_21-p_value/
│       │   └── business_one_sided_20_21-p_value.R
│       ├── chiSquareDistributionWithInceasingDF/
│       │   └── chiSquareDistributionWithInceasingDF.R
│       ├── choosingZForCI/
│       │   └── choosingZForCI.R
│       ├── clt_prop_grid/
│       │   └── clt_prop_grid.R
│       ├── communityCollegeClaimedHousingExpenseDistribution/
│       │   └── communityCollegeClaimedHousingExpenseDistribution.R
│       ├── eoce/
│       │   ├── adult_heights/
│       │   │   └── adult_heights.R
│       │   ├── age_at_first_marriage_intro/
│       │   │   └── age_at_first_marriage_intro.R
│       │   ├── assisted_reproduction_one_sample_randomization/
│       │   │   └── assisted_reproduction_one_sample_randomization.R
│       │   ├── cflbs/
│       │   │   └── cflbs.R
│       │   ├── college_credits/
│       │   │   └── college_credits.R
│       │   ├── egypt_revolution_one_sample_randomization/
│       │   │   └── egypt_revolution_one_sample_randomization.R
│       │   ├── exclusive_relationships/
│       │   │   ├── exclusive_relationships.R
│       │   │   └── survey.csv
│       │   ├── gifted_children_ht/
│       │   │   └── gifted_children_ht.R
│       │   ├── gifted_children_intro/
│       │   │   └── gifted_children_intro.R
│       │   ├── identify_dist_ls_pop/
│       │   │   └── identify_dist_ls_pop.R
│       │   ├── identify_dist_symm_pop/
│       │   │   └── identify_dist_symm_pop.R
│       │   ├── pennies_ages/
│       │   │   ├── penniesAges.Rda
│       │   │   └── pennies_ages.R
│       │   ├── penny_weights/
│       │   │   └── penny_weights.R
│       │   ├── social_experiment_two_sample_randomization/
│       │   │   └── social_experiment_two_sample_randomization.R
│       │   ├── songs_on_ipod/
│       │   │   └── songs_on_ipod.R
│       │   ├── thanksgiving_spending_intro/
│       │   │   └── thanksgiving_spending_intro.R
│       │   └── yawning_two_sample_randomization/
│       │       └── yawning_two_sample_randomization.R
│       ├── geomFitEvaluationForSP500For1990To2011/
│       │   └── geomFitEvaluationForSP500For1990To2011.R
│       ├── geomFitPValueForSP500For1990To2011/
│       │   └── geomFitPValueForSP500For1990To2011.R
│       ├── googleHTForDiffAlgPerformancePValue/
│       │   └── googleHTForDiffAlgPerformancePValue.R
│       ├── helpers.R
│       ├── jurorHTPValueShown/
│       │   └── jurorHTPValueShown.R
│       ├── mammograms/
│       │   └── mammograms.R
│       ├── normal_dist_mean_500_se_016/
│       │   └── normal_dist_mean_500_se_016.R
│       ├── nuclearArmsReduction/
│       │   └── nuclearArmsReduction.R
│       ├── p-hat_from_53_and_59-not-used/
│       │   └── p-hat_from_53_and_59.R
│       ├── p-hat_from_53_and_59_computation/
│       │   ├── NormTailsCalc.R
│       │   └── p-hat_from_53_and_59_computation.R
│       ├── p-hat_from_867_and_907-not-used/
│       │   └── p-hat_from_867_and_907.R
│       ├── p-hat_from_86_and_90/
│       │   └── p-hat_from_86_and_90.R
│       ├── quadcopter/
│       │   └── quadcopter_attribution.txt
│       ├── sampling_100_prop_X/
│       │   └── sampling_100_prop_X.R
│       ├── sampling_10_prop_25p/
│       │   ├── sampling_10_prop_25p - one figure.R
│       │   └── sampling_10_prop_25p.R
│       ├── sampling_10k_prop_887p/
│       │   └── sampling_10k_prop_887p.R
│       ├── sampling_10k_prop_88p/
│       │   └── sampling_10k_prop_88p.R
│       ├── sampling_5k_prop_50p/
│       │   └── sampling_5k_prop_50p.R
│       ├── sampling_X_prop_56p/
│       │   └── sampling_X_prop_56p.R
│       ├── sulphStudyFindPValueUsingNormalApprox/
│       │   └── sulphStudyFindPValueUsingNormalApprox.R
│       └── whyWeWantPValue/
│           └── whyWeWantPValue.R
├── ch_inference_for_means/
│   ├── TeX/
│   │   ├── ch_inference_for_means.tex
│   │   ├── comparing_many_means_with_anova.tex
│   │   ├── difference_of_two_means.tex
│   │   ├── one-sample_means_with_the_t-distribution.tex
│   │   ├── paired_data.tex
│   │   ├── power_calculations_for_a_difference_of_means.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── babySmokePlotOfTwoGroupsToExamineSkew/
│       │   └── babySmokePlotOfTwoGroupsToExamineSkew.R
│       ├── cbrRunTimesMenWomen/
│       │   └── cbrRunTimesMenWomen.R
│       ├── classData/
│       │   └── classData.R
│       ├── distOfDiffOfSampleMeansForBWOfBabySmokeData/
│       │   └── distOfDiffOfSampleMeansForBWOfBabySmokeData.R
│       ├── eoce/
│       │   ├── adult_heights/
│       │   │   └── adult_heights.R
│       │   ├── age_at_first_marriage_intro/
│       │   │   └── age_at_first_marriage_intro.R
│       │   ├── anova_exercise_1/
│       │   │   └── anova_exercise_1.R
│       │   ├── chick_wts_anova/
│       │   │   └── chick_wts.R
│       │   ├── chick_wts_linseed_horsebean/
│       │   │   └── chick_wts.R
│       │   ├── child_care_hours/
│       │   │   ├── child_care_hours.R
│       │   │   └── china.csv
│       │   ├── cleveland_sacramento/
│       │   │   └── cleveland_sacramento.R
│       │   ├── college_credits/
│       │   │   └── college_credits.R
│       │   ├── diamonds_1/
│       │   │   └── diamonds.R
│       │   ├── exclusive_relationships/
│       │   │   ├── exclusive_relationships.R
│       │   │   └── survey.csv
│       │   ├── friday_13th_accident/
│       │   │   └── friday_13th_accident.R
│       │   ├── friday_13th_traffic/
│       │   │   └── friday_13th_traffic.R
│       │   ├── fuel_eff_city/
│       │   │   ├── fuel_eff.csv
│       │   │   └── fuel_eff_city.R
│       │   ├── fuel_eff_hway/
│       │   │   ├── fuel_eff.csv
│       │   │   └── fuel_eff_hway.R
│       │   ├── gifted_children/
│       │   │   └── gifted_children.R
│       │   ├── gifted_children_ht/
│       │   │   └── gifted_children_ht.R
│       │   ├── gifted_children_intro/
│       │   │   └── gifted_children_intro.R
│       │   ├── global_warming_v2_1/
│       │   │   └── global_warming_v2_1.R
│       │   ├── gpa_major/
│       │   │   ├── gpa_major.R
│       │   │   └── survey.csv
│       │   ├── hs_beyond_1/
│       │   │   └── hs_beyond.R
│       │   ├── oscar_winners/
│       │   │   └── oscar_winners.R
│       │   ├── prison_isolation_T/
│       │   │   ├── prison_isolation.R
│       │   │   └── prison_isolation.csv
│       │   ├── prius_fuel_efficiency/
│       │   │   └── prius_fuel_efficiency.R
│       │   ├── prius_fuel_efficiency_update/
│       │   │   └── prius_fuel_efficiency.R
│       │   ├── t_distribution/
│       │   │   └── t_distribution.R
│       │   ├── torque_on_rusty_bolt/
│       │   │   ├── torque_on_rusty_bolt (Autosaved).R
│       │   │   └── torque_on_rusty_bolt.R
│       │   └── work_hours_education/
│       │       ├── gss2010.Rda
│       │       └── work_hours_education.R
│       ├── fDist2And423/
│       │   └── fDist2And423.R
│       ├── fDist3And323/
│       │   └── fDist3And323.R
│       ├── mlbANOVA/
│       │   └── mlbANOVA.R
│       ├── outliers_and_ss_condition/
│       │   └── outliers_and_ss_condition.R
│       ├── pValueOfTwoTailAreaOfExamVersionsWhereDFIs26/
│       │   └── pValueOfTwoTailAreaOfExamVersionsWhereDFIs26.R
│       ├── pValueShownForSATHTOfOver100PtGain/
│       │   └── pValueShownForSATHTOfOver100PtGain.R
│       ├── power_best_sample_size/
│       │   └── power_best_sample_size.R
│       ├── power_curve/
│       │   └── power_curve.R
│       ├── power_null_0_0-76/
│       │   └── power_null_0_0-76.R
│       ├── power_null_0_1-7/
│       │   └── power_null_0_1-7.R
│       ├── rissosDolphin/
│       │   └── ReadMe.txt
│       ├── run10SampTimeHistogram/
│       │   └── run10SampTimeHistogram.R
│       ├── satImprovementHTDataHistogram/
│       │   └── satImprovementHTDataHistogram.R
│       ├── stemCellTherapyForHearts/
│       │   └── stemCellTherapyForHearts.R
│       ├── stemCellTherapyForHeartsPValue/
│       │   └── stemCellTherapyForHeartsPValue.R
│       ├── tDistAppendixTwoEx/
│       │   └── tDistAppendixTwoEx.R
│       ├── tDistCompareToNormalDist/
│       │   └── tDistCompareToNormalDist.R
│       ├── tDistConvergeToNormalDist/
│       │   └── tDistConvergeToNormalDist.R
│       ├── tDistDF18LeftTail2Point10/
│       │   └── tDistDF18LeftTail2Point10.R
│       ├── tDistDF20RightTail1Point65/
│       │   └── tDistDF20RightTail1Point65.R
│       ├── textbooksF18/
│       │   ├── diffInTextbookPricesF18.R
│       │   └── textbooksF18HTTails.R
│       ├── textbooksS10/
│       │   ├── diffInTextbookPricesS10.R
│       │   └── textbooksS10HTTails.R
│       ├── textbooks_scatter/
│       │   └── textbooks_scatter.R
│       └── toyANOVA/
│           └── toyANOVA.R
├── ch_inference_for_props/
│   ├── TeX/
│   │   ├── ch_inference_for_props.tex
│   │   ├── difference_of_two_proportions.tex
│   │   ├── inference_for_a_single_proportion.tex
│   │   ├── review_exercises.tex
│   │   ├── testing_for_goodness_of_fit_using_chi-square.tex
│   │   └── testing_for_independence_in_two-way_tables.tex
│   └── figures/
│       ├── arrayOfFigureAreasForChiSquareDistribution/
│       │   ├── chiSquareAreaAbove10WithDF4/
│       │   │   └── chiSquareAreaAbove10WithDF4.R
│       │   ├── chiSquareAreaAbove11Point7WithDF7/
│       │   │   └── chiSquareAreaAbove11Point7WithDF7.R
│       │   ├── chiSquareAreaAbove4Point3WithDF2/
│       │   │   └── chiSquareAreaAbove4WithDF2.R
│       │   ├── chiSquareAreaAbove5Point1WithDF5/
│       │   │   └── chiSquareAreaAbove5Point1WithDF5.R
│       │   ├── chiSquareAreaAbove6Point25WithDF3/
│       │   │   └── chiSquareAreaAbove6Point25WithDF3.R
│       │   └── chiSquareAreaAbove9Point21WithDF3/
│       │       └── chiSquareAreaAbove9Point21WithDF3.R
│       ├── bladesTwoSampleHTPValueQC/
│       │   └── bladesTwoSampleHTPValueQC.R
│       ├── chiSquareDistributionWithInceasingDF/
│       │   └── chiSquareDistributionWithInceasingDF.R
│       ├── eoce/
│       │   ├── assisted_reproduction_one_sample_randomization/
│       │   │   └── assisted_reproduction_one_sample_randomization.R
│       │   ├── egypt_revolution_one_sample_randomization/
│       │   │   └── egypt_revolution_one_sample_randomization.R
│       │   ├── social_experiment_two_sample_randomization/
│       │   │   └── social_experiment_two_sample_randomization.R
│       │   └── yawning_two_sample_randomization/
│       │       └── yawning_two_sample_randomization.R
│       ├── geomFitEvaluationForSP500/
│       │   ├── geomFitEvaluationForSP500.R
│       │   └── sp500_1950_2018.csv
│       ├── geomFitPValueForSP500/
│       │   └── geomFitPValueForSP500.R
│       ├── iPodChiSqTail/
│       │   └── iPodChiSqTail.R
│       ├── jurorHTPValueShown/
│       │   └── jurorHTPValueShown.R
│       ├── mammograms/
│       │   └── mammograms.R
│       ├── paydayCC_norm_pvalue/
│       │   └── paydayCC_norm_pvalue.R
│       └── quadcopter/
│           └── quadcopter_attribution.txt
├── ch_intro_to_data/
│   ├── TeX/
│   │   ├── case_study_using_stents_to_prevent_strokes.tex
│   │   ├── ch_intro_to_data.tex
│   │   ├── data_basics.tex
│   │   ├── experiments.tex
│   │   ├── review_exercises.tex
│   │   └── sampling_principles_and_strategies.tex
│   └── figures/
│       ├── county_fed_spendVsPoverty/
│       │   └── county_fed_spendVsPoverty.R
│       ├── eoce/
│       │   ├── air_quality_durham/
│       │   │   ├── air_quality_durham.R
│       │   │   └── pm25_2011_durham.csv
│       │   ├── airports/
│       │   │   ├── airports.R
│       │   │   └── data/
│       │   │       └── cb_2013_us_state_20m/
│       │   │           ├── cb_2013_us_state_20m.dbf
│       │   │           ├── cb_2013_us_state_20m.prj
│       │   │           ├── cb_2013_us_state_20m.shp
│       │   │           ├── cb_2013_us_state_20m.shp.iso.xml
│       │   │           ├── cb_2013_us_state_20m.shp.xml
│       │   │           ├── cb_2013_us_state_20m.shx
│       │   │           └── state_20m.ea.iso.xml
│       │   ├── antibiotic_use_children/
│       │   │   └── antibiotic_use_children.R
│       │   ├── association_plots/
│       │   │   └── association_plots.R
│       │   ├── cleveland_sacramento/
│       │   │   └── cleveland_sacramento.R
│       │   ├── county_commute_times/
│       │   │   ├── countyMap.R
│       │   │   └── county_commute_times.R
│       │   ├── county_hispanic_pop/
│       │   │   ├── countyMap.R
│       │   │   └── county_hispanic_pop.R
│       │   ├── county_income_education/
│       │   │   └── county_income_education.R
│       │   ├── dream_act_mosaic/
│       │   │   └── dream_act_mosaic.R
│       │   ├── estimate_mean_median_simple/
│       │   │   └── estimate_mean_median_simple.R
│       │   ├── gpa_study_hours/
│       │   │   ├── gpa_study_hours.R
│       │   │   ├── gpa_study_hours.csv
│       │   │   └── gpa_study_hours.rda
│       │   ├── hist_box_match/
│       │   │   └── hist_box_match.R
│       │   ├── hist_vs_box/
│       │   │   └── hist_vs_box.R
│       │   ├── income_coffee_shop/
│       │   │   └── income_coffee_shop.R
│       │   ├── infant_mortality_rel_freq/
│       │   │   ├── factbook.rda
│       │   │   └── infant_mortality.R
│       │   ├── internet_life_expactancy/
│       │   │   ├── factbook.rda
│       │   │   └── internet_life_expactancy.R
│       │   ├── internet_life_expectancy/
│       │   │   ├── factbook.rda
│       │   │   └── internet_life_expectancy.R
│       │   ├── mammal_life_spans/
│       │   │   └── mammal_life_spans.R
│       │   ├── marathon_winners/
│       │   │   └── marathon_winners.R
│       │   ├── office_productivity/
│       │   │   └── office_productivity.R
│       │   ├── oscar_winners/
│       │   │   └── oscar_winners.R
│       │   ├── raise_taxes_mosaic/
│       │   │   └── raise_taxes_mosaic.R
│       │   ├── randomization_avandia/
│       │   │   └── randomization_avandia.R
│       │   ├── randomization_heart_transplants/
│       │   │   ├── inference.RData
│       │   │   └── randomization_heart_transplants.R
│       │   ├── reproducing_bacteria/
│       │   │   └── reproducing_bacteria.R
│       │   ├── seattle_pet_names/
│       │   │   └── seattle_pet_names.R
│       │   ├── stats_scores_box/
│       │   │   └── stats_scores_box.R
│       │   └── unvotes/
│       │       └── unvotes.R
│       ├── expResp/
│       │   └── expResp.R
│       ├── figureShowingBlocking/
│       │   └── figureShowingBlocking.R
│       ├── interest_rate_vs_income/
│       │   └── interest_rate_vs_loan_amount.R
│       ├── interest_rate_vs_loan_amount/
│       │   └── interest_rate_vs_loan_amount.R
│       ├── interest_rate_vs_loan_income_ratio/
│       │   └── interest_rate_vs_loan_income_ratio.R
│       ├── loan_amount_vs_income/
│       │   └── loan_amount_vs_income.R
│       ├── mnWinter/
│       │   └── ReadMe.txt
│       ├── multiunitsVsOwnership/
│       │   └── multiunitsVsOwnership.R
│       ├── popToSample/
│       │   ├── popToSampleGraduates.R
│       │   ├── popToSubSampleGraduates.R
│       │   └── surveySample.R
│       ├── pop_change_v_med_income/
│       │   └── pop_change_v_med_income.R
│       ├── pop_change_v_per_capita_income/
│       │   └── pop_change_v_per_capita_income.R
│       ├── samplingMethodsFigure/
│       │   ├── SamplingMethodsFunctions.R
│       │   ├── samplingMethodsFigure.R
│       │   └── samplingMethodsFigures.R
│       └── variables/
│           ├── sunCausesCancer.R
│           └── variables.R
├── ch_probability/
│   ├── TeX/
│   │   ├── ch_probability.tex
│   │   ├── conditional_probability.tex
│   │   ├── continuous_distributions.tex
│   │   ├── defining_probability.tex
│   │   ├── random_variables.tex
│   │   ├── review_exercises.tex
│   │   └── sampling_from_a_small_population.tex
│   └── figures/
│       ├── BreastCancerTreeDiagram/
│       │   ├── BreastCancerTreeDiagram.R
│       │   └── Mammogram Research.txt
│       ├── bookCostDist/
│       │   └── bookCostDist.R
│       ├── bookWts/
│       │   └── bookWts.R
│       ├── cardsDiamondFaceVenn/
│       │   └── cardsDiamondFaceVenn.R
│       ├── changeInLeonardsStockPortfolioFor36Months/
│       │   └── changeinleonardsstockportfoliofor36months.R
│       ├── complementOfD/
│       │   └── complementOfD.R
│       ├── contBalance/
│       │   └── contBalance.R
│       ├── diceSumDist/
│       │   └── diceSumDist.R
│       ├── dieProp/
│       │   └── dieProp.R
│       ├── disjointSets/
│       │   └── disjointSets.R
│       ├── eoce/
│       │   ├── cat_weights/
│       │   │   └── cat_weights.R
│       │   ├── poverty_language/
│       │   │   ├── poverty_language.R
│       │   │   └── poverty_language.tiff
│       │   ├── swing_voters/
│       │   │   ├── swing_voters.R
│       │   │   └── swing_voters.tiff
│       │   ├── tree_drawing_box_plots/
│       │   │   └── tree_drawing_box_plots.R
│       │   ├── tree_exit_poll/
│       │   │   └── tree_exit_poll.R
│       │   ├── tree_hiv_swaziland/
│       │   │   └── tree_hiv_swaziland.R
│       │   ├── tree_lupus/
│       │   │   └── tree_lupus.R
│       │   ├── tree_thrombosis/
│       │   │   └── tree_thrombosis.R
│       │   └── tree_twins/
│       │       └── tree_twins.R
│       ├── fdicHeightContDist/
│       │   └── fdicHeightContDist.R
│       ├── fdicHeightContDistFilled/
│       │   └── fdicHeightContDistFilled.R
│       ├── fdicHistograms/
│       │   ├── fdicHistograms.R
│       │   └── fdicHistograms.rda
│       ├── indepForRollingTwo1s/
│       │   └── indepForRollingTwo1s.R
│       ├── loans_app_type_home_venn/
│       │   └── loans_app_type_home_venn.R
│       ├── photoClassifyVenn/
│       │   └── photoClassifyVenn.R
│       ├── smallpoxTreeDiagram/
│       │   └── smallpoxTreeDiagram.R
│       ├── testTree/
│       │   └── testTree.R
│       ├── treeDiagramAndPass/
│       │   └── treeDiagramAndPass.R
│       ├── treeDiagramGarage/
│       │   └── treeDiagramGarage.R
│       ├── usHeightsHist180185/
│       │   └── usHeightsHist180185.R
│       └── usHouseholdIncomeDistBar/
│           └── usHouseholdIncomeDistBar.R
├── ch_regr_mult_and_log/
│   ├── TeX/
│   │   ├── ch_regr_mult_and_log.tex
│   │   ├── checking_model_assumptions_using_graphs.tex
│   │   ├── introduction_to_logistic_regression.tex
│   │   ├── introduction_to_multiple_regression.tex
│   │   ├── model_selection.tex
│   │   ├── mult_regr_case_study.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── eoce/
│       │   ├── absent_from_school_mlr/
│       │   │   └── absent_from_school_mlr.R
│       │   ├── absent_from_school_model_select_backward/
│       │   │   └── absent_from_school_model_select_backward.R
│       │   ├── absent_from_school_model_select_forward/
│       │   │   └── absent_from_school_model_select_forward.R
│       │   ├── baby_weights_conds/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_conds.R
│       │   ├── baby_weights_mlr/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_mlr.R
│       │   ├── baby_weights_model_select_backward/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_model_select_backward.R
│       │   ├── baby_weights_model_select_forward/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_model_select_backward.R
│       │   ├── baby_weights_parity/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_parity.R
│       │   ├── baby_weights_smoke/
│       │   │   ├── babies.csv
│       │   │   └── baby_weights_smoke.R
│       │   ├── challenger_disaster_predict/
│       │   │   ├── challenger_disaster_predict.R
│       │   │   └── orings.rda
│       │   ├── gpa/
│       │   │   ├── gpa.R
│       │   │   └── gpa_survey.csv
│       │   ├── gpa_iq_conds/
│       │   │   ├── gpa_iq.csv
│       │   │   └── gpa_iq_conds.R
│       │   ├── log_regr_ex/
│       │   │   └── log_regr_ex.R
│       │   ├── movie_returns_altogether/
│       │   │   ├── horror_movies_conds.R
│       │   │   └── movie_profit.csv
│       │   ├── movie_returns_by_genre/
│       │   │   ├── horror_movies_conds.R
│       │   │   └── movie_profit.csv
│       │   ├── possum_classification_model_select/
│       │   │   └── possum_classification_model_select.R
│       │   ├── spam_filtering_model_sel/
│       │   │   └── spam_filtering_model_sel.R
│       │   └── spam_filtering_predict/
│       │       └── spam_filtering_predict.R
│       ├── loansDiagnostics/
│       │   └── loans_analysis.R
│       ├── loansSingles/
│       │   ├── intRateVsPastBankrScatter.R
│       │   └── intRateVsVerIncomeScatter.R
│       ├── logisticModel/
│       │   └── logisticModel.R
│       ├── logitTransformationFigureHoriz/
│       │   └── logitTransformationFigureHoriz.R
│       ├── marioKartDiagnostics/
│       │   └── marioKartAnalysis.R
│       └── marioKartSingle/
│           └── marioKartSingle.R
├── ch_regr_simple_linear/
│   ├── TeX/
│   │   ├── ch_regr_simple_linear.tex
│   │   ├── fitting_a_line_by_least_squares_regression.tex
│   │   ├── inference_for_linear_regression.tex
│   │   ├── line_fitting_residuals_and_correlation.tex
│   │   ├── review_exercises.tex
│   │   └── types_of_outliers_in_linear_regression.tex
│   └── figures/
│       ├── brushtail_possum/
│       │   └── ReadMe.txt
│       ├── elmhurstPlots/
│       │   └── elmhurstScatterW2Lines.R
│       ├── eoce/
│       │   ├── beer_blood_alcohol_inf/
│       │   │   ├── beer_blood_alcohol.txt
│       │   │   └── beer_blood_alcohol_inf.R
│       │   ├── body_measurements_hip_weight_corr_units/
│       │   │   └── body_measurements_hip_weight.R
│       │   ├── body_measurements_shoulder_height_corr_units/
│       │   │   └── body_measurements_shoulder_height.R
│       │   ├── body_measurements_weight_height_inf/
│       │   │   └── body_measurements_weight_height_inf.R
│       │   ├── cat_body_heart_reg/
│       │   │   └── cat_body_heart_reg.R
│       │   ├── coast_starlight_corr_units/
│       │   │   ├── coast_starlight.R
│       │   │   └── coast_starlight.txt
│       │   ├── crawling_babies_corr_units/
│       │   │   ├── crawling_babies.R
│       │   │   └── crawling_babies.csv
│       │   ├── exams_grades_correlation/
│       │   │   ├── exam_grades.txt
│       │   │   └── exams_grades_correlation.R
│       │   ├── full_lin_regr_1/
│       │   │   ├── prof_evals_beauty.csv
│       │   │   └── rate_my_prof.R
│       │   ├── full_lin_regr_2/
│       │   │   ├── prof_evals_beauty.csv
│       │   │   └── rate_my_prof.R
│       │   ├── helmet_lunch/
│       │   │   └── helmet_lunch.R
│       │   ├── husbands_wives_age_inf/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_age_inf.R
│       │   ├── husbands_wives_correlation/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_correlation.R
│       │   ├── husbands_wives_height_inf/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_height_inf.R
│       │   ├── husbands_wives_height_inf_2s/
│       │   │   ├── husbands_wives.txt
│       │   │   └── husbands_wives_height_inf_2s.R
│       │   ├── identify_relationships_1/
│       │   │   └── identify_relationships_1.R
│       │   ├── identify_relationships_2/
│       │   │   └── identify_relationships_2.R
│       │   ├── match_corr_1/
│       │   │   └── match_corr_1.R
│       │   ├── match_corr_2/
│       │   │   └── match_corr_2.R
│       │   ├── match_corr_3/
│       │   │   ├── match_corr_2.R
│       │   │   └── match_corr_3.R
│       │   ├── murders_poverty_reg/
│       │   │   ├── murders.csv
│       │   │   └── murders_poverty.R
│       │   ├── outliers_1/
│       │   │   └── outliers_1.R
│       │   ├── outliers_2/
│       │   │   └── outliers_2.R
│       │   ├── rate_my_prof/
│       │   │   ├── prof_evals_beauty.csv
│       │   │   └── rate_my_prof.R
│       │   ├── speed_height_gender/
│       │   │   ├── speed_height_gender.R
│       │   │   └── speed_survey.csv
│       │   ├── starbucks_cals_carbos/
│       │   │   ├── starbucks.csv
│       │   │   └── starbucks_cals_carbos.R
│       │   ├── starbucks_cals_protein/
│       │   │   ├── starbucks.csv
│       │   │   └── starbucks_cals_protein.R
│       │   ├── tourism_spending_reg_conds/
│       │   │   ├── tourism_spending.csv
│       │   │   └── tourism_spending_reg_cond.R
│       │   ├── trees_volume_height_diameter/
│       │   │   └── trees_volume_height_diameter.R
│       │   ├── trends_in_residuals/
│       │   │   └── trends_in_residuals.R
│       │   ├── urban_homeowners_cond/
│       │   │   ├── urban_homeowners_cond.R
│       │   │   └── urban_state_data.csv
│       │   ├── urban_homeowners_outlier/
│       │   │   ├── urban_homeowners_outlier.R
│       │   │   └── urban_state_data.csv
│       │   └── visualize_residuals/
│       │       └── visualize_residuals.R
│       ├── identifyingInfluentialPoints/
│       │   └── identifyingInfluentialPoints.R
│       ├── imperfLinearModel/
│       │   └── imperfLinearModel.R
│       ├── marioKartNewUsed/
│       │   └── marioKartNewUsed.R
│       ├── notGoodAtAllForALinearModel/
│       │   └── notGoodAtAllForALinearModel.R
│       ├── outlierPlots/
│       │   └── outlierPlots.R
│       ├── pValueMidtermUnemp/
│       │   └── pValueMidtermUnemp.R
│       ├── perfLinearModel/
│       │   └── perfLinearModel.R
│       ├── posNegCorPlots/
│       │   ├── CorrelationPlot.R
│       │   ├── corForNonLinearPlots.R
│       │   └── posNegCorPlots.R
│       ├── sampleLinesAndResPlots/
│       │   └── sampleLinesAndResPlots.R
│       ├── scattHeadLTotalL/
│       │   └── scattHeadLTotalL.R
│       ├── scattHeadLTotalLLine/
│       │   └── scattHeadLTotalLLine.R
│       ├── scattHeadLTotalLResidualPlot/
│       │   └── scattHeadLTotalLResidualPlot.R
│       ├── scattHeadLTotalLSex/
│       │   └── scattHeadLTotalLSex.R
│       ├── scattHeadLTotalLTube/
│       │   └── scattHeadLTotalLTube.R
│       ├── unemploymentAndChangeInHouse/
│       │   └── unemploymentAndChangeInHouse.R
│       └── whatCanGoWrongWithLinearModel/
│           ├── makeTubeAdv.R
│           └── whatCanGoWrongWithLinearModel.R
├── ch_summarizing_data/
│   ├── TeX/
│   │   ├── case_study_malaria_vaccine.tex
│   │   ├── ch_summarizing_data.tex
│   │   ├── considering_categorical_data.tex
│   │   ├── examining_numerical_data.tex
│   │   └── review_exercises.tex
│   └── figures/
│       ├── boxPlotLayoutNumVar/
│       │   └── boxPlotLayoutNumVar.R
│       ├── carsPriceVsWeight/
│       │   └── carsPriceVsWeight.R
│       ├── countyIncomeSplitByPopGain/
│       │   └── countyIncomeSplitByPopGain.R
│       ├── countyIntensityMaps/
│       │   ├── countyIntensityMaps.R
│       │   └── countyMap.R
│       ├── county_pop_change_v_pop_transform/
│       │   └── county_pop_change_v_pop_transform.R
│       ├── county_pop_transformed/
│       │   └── county_pop_transformed.R
│       ├── discRandDotPlot/
│       │   └── discRandDotPlot.R
│       ├── email50LinesCharacters/
│       │   └── email50LinesCharacters.R
│       ├── email50LinesCharactersMod/
│       │   └── email50LinesCharactersMod.R
│       ├── email50NumCharDotPlotRobustEx/
│       │   └── email50NumCharDotPlotRobustEx.R
│       ├── email50NumCharHist/
│       │   └── email50NumCharHist.R
│       ├── emailCharactersDotPlot/
│       │   └── emailCharactersDotPlot.R
│       ├── emailNumberBarPlot/
│       │   └── emailNumberBarPlot.R
│       ├── emailNumberPieChart/
│       │   └── emailNumberPieChart.R
│       ├── emailSpamNumberMosaicPlot/
│       │   └── emailSpamNumberMosaicPlot.R
│       ├── emailSpamNumberSegBar/
│       │   └── emailSpamNumberSegBar.R
│       ├── eoce/
│       │   ├── air_quality_durham/
│       │   │   ├── air_quality_durham.R
│       │   │   └── pm25_2011_durham.csv
│       │   ├── antibiotic_use_children/
│       │   │   └── antibiotic_use_children.R
│       │   ├── association_plots/
│       │   │   └── association_plots.R
│       │   ├── cleveland_sacramento/
│       │   │   └── cleveland_sacramento.R
│       │   ├── county_commute_times/
│       │   │   ├── countyMap.R
│       │   │   └── county_commute_times.R
│       │   ├── county_hispanic_pop/
│       │   │   ├── countyMap.R
│       │   │   └── county_hispanic_pop.R
│       │   ├── dream_act_mosaic/
│       │   │   └── dream_act_mosaic.R
│       │   ├── estimate_mean_median_simple/
│       │   │   └── estimate_mean_median_simple.R
│       │   ├── hist_box_match/
│       │   │   └── hist_box_match.R
│       │   ├── hist_vs_box/
│       │   │   └── hist_vs_box.R
│       │   ├── income_coffee_shop/
│       │   │   └── income_coffee_shop.R
│       │   ├── infant_mortality_rel_freq/
│       │   │   ├── factbook.rda
│       │   │   └── infant_mortality.R
│       │   ├── mammal_life_spans/
│       │   │   └── mammal_life_spans.R
│       │   ├── marathon_winners/
│       │   │   └── marathon_winners.R
│       │   ├── office_productivity/
│       │   │   └── office_productivity.R
│       │   ├── oscar_winners/
│       │   │   └── oscar_winners.R
│       │   ├── raise_taxes_mosaic/
│       │   │   └── raise_taxes_mosaic.R
│       │   ├── randomization_avandia/
│       │   │   └── randomization_avandia.R
│       │   ├── randomization_heart_transplants/
│       │   │   ├── inference.RData
│       │   │   └── randomization_heart_transplants.R
│       │   ├── reproducing_bacteria/
│       │   │   └── reproducing_bacteria.R
│       │   └── stats_scores_box/
│       │       └── stats_scores_box.R
│       ├── histMLBSalaries/
│       │   └── histMLBSalaries.R
│       ├── loan50IncomeHist/
│       │   └── loan50IncomeHist.R
│       ├── loan50IntRateHist/
│       │   └── loan50IntRateHist.R
│       ├── loan50LoanAmountHist/
│       │   └── loan50LoanAmountHist.R
│       ├── loan50_amt_vs_income/
│       │   └── loan50_amt_vs_income.R
│       ├── loan50_amt_vs_interest/
│       │   └── loan50_amt_vs_interest.R
│       ├── loan_amount_dot_plot/
│       │   └── loan_amount_dot_plot.R
│       ├── loan_app_type_home_mosaic_plot/
│       │   └── loan_app_type_home_mosaic_plot.R
│       ├── loan_app_type_home_seg_bar/
│       │   └── loan_app_type_home_seg_bar.R
│       ├── loan_homeownership_bar_plot/
│       │   └── loan_homeownership_bar_plot.R
│       ├── loan_homeownership_pie_chart/
│       │   └── loan_homeownership_pie_chart.R
│       ├── loan_int_rate_box_plot_layout/
│       │   └── loan_int_rate_box_plot_layout.R
│       ├── loan_int_rate_dot_plot/
│       │   └── loan_int_rate_dot_plot.R
│       ├── loan_int_rate_robust_ex/
│       │   └── loan_int_rate_robust_ex.R
│       ├── malaria_rand_dot_plot/
│       │   └── malaria_rand_dot_plot.R
│       ├── medianHHIncomePoverty/
│       │   └── medianHHIncomePoverty.R
│       ├── sdAsRuleForEmailNumChar/
│       │   └── sdAsRuleForEmailNumChar.R
│       ├── sdRuleForIncome/
│       │   └── sdRuleForIncome.R
│       ├── sdRuleForIntRate/
│       │   └── sdRuleForIntRate.R
│       ├── sdRuleForLoanAmount/
│       │   └── sdRuleForLoanAmount.R
│       ├── severalDiffDistWithSdOf1/
│       │   └── severalDiffDistWithSdOf1.R
│       ├── singleBiMultiModalPlots/
│       │   └── singleBiMultiModalPlots.R
│       └── total_income_dot_plot/
│           └── total_income_dot_plot.R
├── eoce.bib
├── extraTeX/
│   ├── data/
│   │   └── data.tex
│   ├── eoceSolutions/
│   │   └── eoceSolutions.tex
│   ├── index/
│   │   └── index.tex
│   ├── preamble/
│   │   ├── copyright.tex
│   │   ├── copyright_derivative.tex
│   │   ├── preface.tex
│   │   ├── review_copy.tex
│   │   ├── title.tex
│   │   └── title_derivative.tex
│   ├── style/
│   │   ├── colorsV1.tex
│   │   ├── hardcover.tex
│   │   ├── headers.tex
│   │   ├── headers_simple.tex
│   │   ├── style.tex
│   │   ├── style_appendices.tex
│   │   ├── style_simple.tex
│   │   ├── tablet.tex
│   │   └── video.tex
│   └── tables/
│       ├── TeX/
│       │   ├── chiSquareTable.tex
│       │   ├── tTable.tex
│       │   └── zTable.tex
│       ├── code/
│       │   ├── chiSquareProbTable.R
│       │   └── normalProbTable.R
│       └── figures/
│           ├── chiSquareTail/
│           │   └── chiSquareTail.R
│           ├── normalTails/
│           │   ├── normalTails.R
│           │   └── subtractingArea/
│           │       └── subtractingArea.R
│           └── tTails/
│               └── tTails.R
├── fullminipage.sty
├── main.tex
└── openintro-statistics.Rproj

Download .json

Condensed preview — 543 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (5,739K chars).

[
  {
    "path": ".gitignore",
    "chars": 306,
    "preview": "*.log\n*.aux\nmain-blx.bib\nmain.bbl\nmain.blg\nmain.idx\nmain.ilg\nmain.ind\nmain.out\nmain.pdf\nmain.run.xml\nmain.synctex.gz\nmai"
  },
  {
    "path": "LICENSE.md",
    "chars": 2612,
    "preview": "\nOpenIntro Statistics is available at http://www.openintro.org under a Creative Commons Attribution-ShareAlike 3.0 Unpor"
  },
  {
    "path": "README.md",
    "chars": 2310,
    "preview": "Project Organization\n--------------------\n\n- Each chapter's content is in one of the eight chapter folders that start wi"
  },
  {
    "path": "ch_distributions/TeX/binomial_distribution.tex",
    "chars": 8716,
    "preview": "\\exercisesheader{}\n\n% 17\n\n\\eoce{\\qt{Underage drinking, Part I\\label{underage_drinking_intro}}\nData collected by the Subs"
  },
  {
    "path": "ch_distributions/TeX/ch_distributions.tex",
    "chars": 91188,
    "preview": "\\begin{chapterpage}{Distributions of random variables}\n  \\chaptertitle[30]{Distributions of random \\titlebreak{} variabl"
  },
  {
    "path": "ch_distributions/TeX/geometric_distribution.tex",
    "chars": 3658,
    "preview": "\\exercisesheader{}\n\n% 11\n\n\\eoce{\\qtq{Is it Bernoulli\\label{is_it_bernouilli}} Determine if each trial can be \nconsidered"
  },
  {
    "path": "ch_distributions/TeX/negative_binomial_distribution.tex",
    "chars": 2860,
    "preview": "\\exercisesheader{}\n\n% 27\n\n\\eoce{\\qt{Rolling a die\\label{roll_die}} Calculate the \nfollowing probabilities and indicate w"
  },
  {
    "path": "ch_distributions/TeX/normal_distribution.tex",
    "chars": 7466,
    "preview": "\\exercisesheader{}\n\n% 1\n\n\\eoce{\\qt{Area under the curve, Part I\\label{area_under_curve_1}} What percent of a \nstandard n"
  },
  {
    "path": "ch_distributions/TeX/poisson_distribution.tex",
    "chars": 2890,
    "preview": "\\exercisesheader{}\n\n% 31\n\n\\eoce{\\qt{Customers at a coffee shop\\label{coffee_shop_customers}} A coffee shop \nserves an av"
  },
  {
    "path": "ch_distributions/TeX/review_exercises.tex",
    "chars": 9680,
    "preview": "\\reviewexercisesheader{}\n\n% 35\n\n\\eoce{\\qt{Roulette winnings\\label{roulette_winnings}} In the game of roulette, a \nwheel "
  },
  {
    "path": "ch_distributions/figures/6895997/6895997.R",
    "chars": 1362,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF(\"6895997.pdf\", 5, 2.5,\n      mar = c(2, 0, 0, 0))\nX <- seq(-4, 4, 0.01)\nY <- dnorm(X"
  },
  {
    "path": "ch_distributions/figures/amiIncidencesOver100Days/amiIncidencesOver100Days.R",
    "chars": 526,
    "preview": "library(openintro)\n\nx <- ami.occurrences$ami\n\nmyPDF(\"amiIncidencesOver100Days.pdf\", 5, 2.4,\n       mar = c(3, 3.5, 0.5, "
  },
  {
    "path": "ch_distributions/figures/between59And62/between59And62.R",
    "chars": 297,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('between59And62.pdf', 2.5, 0.9,\n      mar = c(1.4, 0, 0, 0),\n      mgp = c(3, 0.45, "
  },
  {
    "path": "ch_distributions/figures/eoce/GRE_intro/gre_intro.R",
    "chars": 1784,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set input data -------------"
  },
  {
    "path": "ch_distributions/figures/eoce/area_under_curve_1/area_under_curve_1.R",
    "chars": 1787,
    "preview": "# load packages -----------------------------------------------------\n\nlibrary(openintro)\n\n# Z < -1.35 -----------------"
  },
  {
    "path": "ch_distributions/figures/eoce/area_under_curve_2/area_under_curve_2.R",
    "chars": 1807,
    "preview": "# load packages -----------------------------------------------------\n\nlibrary(openintro)\n\n# Z > -1.13 -----------------"
  },
  {
    "path": "ch_distributions/figures/eoce/college_fem_heights/college_fem_heights.R",
    "chars": 1298,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# create data ----------------"
  },
  {
    "path": "ch_distributions/figures/eoce/stats_scores/stats_scores.R",
    "chars": 1314,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# create data ----------------"
  },
  {
    "path": "ch_distributions/figures/fcidMHeights/fcidMHeights-helpers.R",
    "chars": 696,
    "preview": "\nQQNorm <- function(x, M, SD, col) {\n  qqnorm(x,\n         cex = 0.7,\n         main = '',\n         axes = FALSE,\n        "
  },
  {
    "path": "ch_distributions/figures/fcidMHeights/fcidMHeights.R",
    "chars": 478,
    "preview": "library(openintro)\n\nobs <- male_heights_fcid$height_inch\nsource(\"fcidMHeights-helpers.R\")\n\nhold <- hist(obs, plot = FALS"
  },
  {
    "path": "ch_distributions/figures/fourBinomialModelsShowingApproxToNormal/fourBinomialModelsShowingApproxToNormal.R",
    "chars": 728,
    "preview": "library(openintro)\ndata(COL)\n\nk  <- -50:500\np  <- 0.1\nn  <- c(10, 30, 100, 300)\nxl <- c(0, 0, 0, 10) - 1\nxu <- c(7, 11, "
  },
  {
    "path": "ch_distributions/figures/geometricDist35/geometricDist35.R",
    "chars": 576,
    "preview": "library(openintro)\ndata(COL)\n\np <- 0.35\nx <- 1:100\ny <- (1 - p)^(x - 1) * p\nmyPDF('geometricDist35.pdf', 6, 3.1,\n      m"
  },
  {
    "path": "ch_distributions/figures/geometricDist70/geometricDist70.R",
    "chars": 663,
    "preview": "library(openintro)\ndata(COL)\n\np <- 0.7\nx <- 1:100\ny <- (1 - p)^(x - 1) * p\nmyPDF('geometricDist70.pdf', 6, 3.1,\n      ma"
  },
  {
    "path": "ch_distributions/figures/height40Perc/height40Perc.R",
    "chars": 561,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('height40Perc.pdf', 2.15, 0.95,\n      mar = c(1.31, 0, 0.01, 0),\n      mgp = c(3, 0."
  },
  {
    "path": "ch_distributions/figures/height82Perc/height82Perc.R",
    "chars": 645,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('height82Perc.pdf', 2.15, 1,\n      mar = c(1.31, 0, 0.01, 0),\n      mgp = c(3, 0.45,"
  },
  {
    "path": "ch_distributions/figures/mikeAndJosePercentiles/mikeAndJosePercentiles.R",
    "chars": 553,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF(\"mikeAndJosePercentiles.pdf\", 7, 1.3,\n      mar = c(2, 0.2, 0.2, 0.2),\n      mgp = c"
  },
  {
    "path": "ch_distributions/figures/nbaNormal/nbaNormal-helpers.R",
    "chars": 679,
    "preview": "\nQQNorm <- function(x, M, SD, col) {\n  qqnorm(x,\n         cex = 0.7,\n         main = '',\n         axes = FALSE,\n        "
  },
  {
    "path": "ch_distributions/figures/nbaNormal/nbaNormal.R",
    "chars": 420,
    "preview": "library(openintro)\ndim(nba_players_19)\nhead(nba_players_19)\n\nsource(\"nbaNormal-helpers.R\")\n\nobs <- nba_players_19$height"
  },
  {
    "path": "ch_distributions/figures/normApproxToBinomFail/normApproxToBinomFail.R",
    "chars": 761,
    "preview": "library(openintro)\ndata(COL)\n\nk <- 0:400\np <- 0.15\nn <- 400\nx1 <- 49\nx2 <- 51\nm <- n * p\ns <- sqrt(n * p * (1 - p))\n\nmyP"
  },
  {
    "path": "ch_distributions/figures/normalExamples/normalExamples-helpers.R",
    "chars": 700,
    "preview": "\nQQNorm <- function(x, M, SD, col) {\n  qqnorm(x,\n         cex = 0.7,\n         main = '',\n         axes = FALSE,\n        "
  },
  {
    "path": "ch_distributions/figures/normalExamples/normalExamples.R",
    "chars": 760,
    "preview": "library(openintro)\ndata(COL)\n\nobs1 <- simulated_normal$n40\nobs2 <- simulated_normal$n100\nobs3 <- simulated_normal$n400\n\n"
  },
  {
    "path": "ch_distributions/figures/normalQuantileExer/QQNorm.R",
    "chars": 363,
    "preview": "\nQQNorm <- function(obs, at  =  pretty(obs), lwd = 2) {\n  qqnorm(obs,\n         cex = 0.9,\n         main = '',\n         a"
  },
  {
    "path": "ch_distributions/figures/normalQuantileExer/normalQuantileExer-data.R",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ch_distributions/figures/normalQuantileExer/normalQuantileExer.R",
    "chars": 433,
    "preview": "library(openintro)\ndata(COL)\n\n\nobs1 <- simulated_dist$d1\nobs2 <- simulated_dist$d2\nobs3 <- simulated_dist$d3\nobs4 <- sim"
  },
  {
    "path": "ch_distributions/figures/normalQuantileExer/normalQuantileExerAdditional.R",
    "chars": 337,
    "preview": "library(openintro)\ndata(COL)\n\nsource(\"QQNorm.R\")\n\nobs1 <- simulated_dist$d5\nobs2 <- simulated_dist$d6\n\n\nmyPDF(\"normalQua"
  },
  {
    "path": "ch_distributions/figures/normalTails/normalTails.R",
    "chars": 726,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF(\"normalTails.pdf\", 4.3, 1,\n      mar = c(0.81, 1, 0.3, 1),\n      mgp = c(3, -0.2, 0)"
  },
  {
    "path": "ch_distributions/figures/pokerNormal/pokerNormal.R",
    "chars": 1184,
    "preview": "library(openintro)\ndata(COL)\n\nobs <- c(-110, -9, -60, 316, -200, -196,\n         320, -160, 31, 331, 1731, 21,\n         -"
  },
  {
    "path": "ch_distributions/figures/satAbove1190/satAbove1190.R",
    "chars": 269,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF(\"satAbove1190.pdf\", 3, 1.4,\n      mar = c(1.2, 0, 0, 0),\n      mgp = c(3, 0.17, 0))\n"
  },
  {
    "path": "ch_distributions/figures/satActNormals/satActNormals.R",
    "chars": 1157,
    "preview": "library(openintro)\ndata(COL)\n\nset.seed(1)\n\npdf('satActNormals.pdf', 6, 3.5)\npar(mfrow = c(2, 1),\n    las = 1,\n    mar = "
  },
  {
    "path": "ch_distributions/figures/satBelow1030/satBelow1030.R",
    "chars": 456,
    "preview": "library(openintro)\ndata(COL)\n\n\nmyPDF('satBelow1030.pdf', 2.875, 1,\n      mar = c(1.5, 0, 0, 0),\n      mgp = c(3, 0.45, 0"
  },
  {
    "path": "ch_distributions/figures/satBelow1300/satBelow1300.R",
    "chars": 235,
    "preview": "library(openintro)\ndata(COL)\n\n#===> plot <===#\nmyPDF(\"satBelow1300.pdf\", 2.25, 1,\n      mar = c(1.2, 0, 0, 0),\n      mgp"
  },
  {
    "path": "ch_distributions/figures/simpleNormal/simpleNormal.R",
    "chars": 296,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF(\"simpleNormal.pdf\", 4.3, 1.5,\n      mar = 0.1 * rep(1, 4))\n\nX <- seq(-5,5,0.01)\nY <-"
  },
  {
    "path": "ch_distributions/figures/smallNormalTails/smallNormalTails.R",
    "chars": 968,
    "preview": "library(openintro)\n\nmyPDF(\"smallNormalTails.pdf\", 4.56, 1.2,\n      mar = c(1.3, 1, 0.5, 1),\n      mgp = c(3, 0.27, 0),\n "
  },
  {
    "path": "ch_distributions/figures/standardNormal/standardNormal.R",
    "chars": 493,
    "preview": "library(openintro)\n\nset.seed(1)\nx <- rnorm(1e5)\nhold <- hist(x, breaks = 50, plot = FALSE)\n\nmyPDF(\"standardNormal.pdf\", "
  },
  {
    "path": "ch_distributions/figures/subtracting2Areas/subtracting2Areas.R",
    "chars": 1345,
    "preview": "library(openintro)\ndata(COL)\n\nAddShadedPlot <- function(x, y, offset,\n                          shade.start = -8,\n      "
  },
  {
    "path": "ch_distributions/figures/subtractingArea/subtractingArea.R",
    "chars": 1359,
    "preview": "library(openintro)\n\nAddShadedPlot <- function(x, y, offset,\n                          shade.start = -8,\n                "
  },
  {
    "path": "ch_distributions/figures/twoSampleNormals/twoSampleNormals.R",
    "chars": 955,
    "preview": "library(openintro)\ndata(COL)\n\nset.seed(1)\nx <- rnorm(100000)\nhold <- hist(x,\n             breaks = 50,\n             plot"
  },
  {
    "path": "ch_distributions/figures/twoSampleNormalsStacked/twoSampleNormalsStacked.R",
    "chars": 418,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF(\"twoSampleNormalsStacked.pdf\", 4.65, 2,\n      mar = c(1.7,1,0.1,1))\n\n# curve 1\nX <- "
  },
  {
    "path": "ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex",
    "chars": 118796,
    "preview": "\\begin{chapterpage}{Foundations for inference}\n  \\chaptertitle{Foundations for inference}\n  \\label{foundationsForInferen"
  },
  {
    "path": "ch_foundations_for_inf/TeX/confidence_intervals.tex",
    "chars": 6823,
    "preview": "\\exercisesheader{}\n\n% 7\n\n\\eoce{\\qt{Chronic illness, Part I\\label{chronic_illness_intro}} \nIn 2013, the Pew Research Foun"
  },
  {
    "path": "ch_foundations_for_inf/TeX/hypothesis_testing.tex",
    "chars": 8296,
    "preview": "\\exercisesheader{}\n\n% 15\n\n\\eoce{\\qt{Identify hypotheses, Part I\\label{\n}}\nWrite the null and alternative hypotheses in w"
  },
  {
    "path": "ch_foundations_for_inf/TeX/one_sided_tests.tex",
    "chars": 9226,
    "preview": "\\subsection{One-sided hypothesis tests (special topic)}\n\n\\Comment{This section needs a lot of work. Maybe it shouldn't\n "
  },
  {
    "path": "ch_foundations_for_inf/TeX/review_exercises.tex",
    "chars": 6773,
    "preview": "\\reviewexercisesheader{}\n\n% 27\n\n\\eoce{\\qt{Relaxing after work\\label{relax_after_work}} The General Social Survey asked t"
  },
  {
    "path": "ch_foundations_for_inf/TeX/variability_in_estimates.tex",
    "chars": 6712,
    "preview": "\\exercisesheader{}\n\n% 1\n\n\\eoce{\\qt{Identify the parameter, Part I\\label{identify_parameter_1}} For each of the following"
  },
  {
    "path": "ch_foundations_for_inf/figures/95PercentConfidenceInterval/95PercentConfidenceInterval.R",
    "chars": 1068,
    "preview": "library(openintro)\ndata(COL)\ndata(run10)\nset.seed(52)\n\n# This still references run10, but the actual range of values\n# i"
  },
  {
    "path": "ch_foundations_for_inf/figures/ARCHIVE/sampling_10k_prop_56p/sampling_10k_prop_56p.R",
    "chars": 674,
    "preview": "set.seed(1)\nlibrary(openintro)\ndata(COL)\n\nn.sim <- 10000\nsamp.size <- 1000\n\nsamples <- matrix(sample(0:1, n.sim * samp.s"
  },
  {
    "path": "ch_foundations_for_inf/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove10WithDF4/chiSquareAreaAbove10WithDF4.R",
    "chars": 229,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove10WithDF4.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp = c(2."
  },
  {
    "path": "ch_foundations_for_inf/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove11Point7WithDF7/chiSquareAreaAbove11Point7WithDF7.R",
    "chars": 237,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove11Point7WithDF7.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp "
  },
  {
    "path": "ch_foundations_for_inf/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove4Point3WithDF2/chiSquareAreaAbove4WithDF2.R",
    "chars": 235,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove4Point3WithDF2.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp ="
  },
  {
    "path": "ch_foundations_for_inf/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove5Point1WithDF5/chiSquareAreaAbove5Point1WithDF5.R",
    "chars": 235,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove5Point1WithDF5.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp ="
  },
  {
    "path": "ch_foundations_for_inf/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove6Point25WithDF3/chiSquareAreaAbove6Point25WithDF3.R",
    "chars": 237,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove6Point25WithDF3.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp "
  },
  {
    "path": "ch_foundations_for_inf/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove9Point21WithDF3/chiSquareAreaAbove9Point21WithDF3.R",
    "chars": 237,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove9Point21WithDF3.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp "
  },
  {
    "path": "ch_foundations_for_inf/figures/bladesTwoSampleHTPValueQC/bladesTwoSampleHTPValueQC.R",
    "chars": 509,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('bladesTwoSampleHTPValueQC.pdf', 3.04, 1.56,\n      mar = c(2.4, 0, 0.5, 0),\n      mg"
  },
  {
    "path": "ch_foundations_for_inf/figures/business_one_sided_20_21-p_value/business_one_sided_20_21-p_value.R",
    "chars": 323,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('business_one_sided_20_21-p_value.pdf', 2.15, 0.95,\n      mar = c(1.31, 0, 0.01, 0),"
  },
  {
    "path": "ch_foundations_for_inf/figures/chiSquareDistributionWithInceasingDF/chiSquareDistributionWithInceasingDF.R",
    "chars": 741,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareDistributionWithInceasingDF.pdf', 6.5, 3,\n      mar = c(2, 0.5, 0.25, 0.5)"
  },
  {
    "path": "ch_foundations_for_inf/figures/choosingZForCI/choosingZForCI.R",
    "chars": 945,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('choosingZForCI.pdf', 7.56, 3.84,\n      mar=c(3.3, 1, 0.5, 1),\n      mgp=c(2.1, 0.6,"
  },
  {
    "path": "ch_foundations_for_inf/figures/clt_prop_grid/clt_prop_grid.R",
    "chars": 1650,
    "preview": "library(openintro)\ndata(COL)\n\nprops <- c(0, 0.1, 0.2, 0.50, 0.8, 0.9)\nsamp.size.1 <- c(0, 10, 25)\nsamp.size.2 <- c(50, 1"
  },
  {
    "path": "ch_foundations_for_inf/figures/communityCollegeClaimedHousingExpenseDistribution/communityCollegeClaimedHousingExpenseDistribution.R",
    "chars": 506,
    "preview": "library(openintro)\ndata(COL)\n\nx <- student.housing$price\nt.test(x, mu = 650)\nmean(x)\nsd(x)\nlength(x)\n\nmyPDF('communityCo"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/adult_heights/adult_heights.R",
    "chars": 436,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/age_at_first_marriage_intro/age_at_first_marriage_intro.R",
    "chars": 470,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/assisted_reproduction_one_sample_randomization/assisted_reproduction_one_sample_randomization.R",
    "chars": 1525,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set sample size and number o"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/cflbs/cflbs.R",
    "chars": 961,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# inputs ---------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/college_credits/college_credits.R",
    "chars": 453,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/egypt_revolution_one_sample_randomization/egypt_revolution_one_sample_randomization.R",
    "chars": 1501,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set sample size and number o"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/exclusive_relationships/exclusive_relationships.R",
    "chars": 662,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(dplyr)\n\n# load data ---"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/exclusive_relationships/survey.csv",
    "chars": 468,
    "preview": "\"excl_relation\"\n2\n4\n1\n4\nNA\n2\n2\n2\n1\n4\n2\n4\n2\n7\nNA\n1\nNA\n1\n9\nNA\n4\n1\n2\n4\n2\n1\n5\n1\n9\n1\n2\n1\n4\n4\n1\n8\nNA\n1\n6\n4\n1\n1\n2\n2\n4\n2\n5\n4\n1\n1"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/gifted_children_ht/gifted_children_ht.R",
    "chars": 516,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/gifted_children_intro/gifted_children_intro.R",
    "chars": 550,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/identify_dist_ls_pop/identify_dist_ls_pop.R",
    "chars": 1841,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# create data ----------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/identify_dist_symm_pop/identify_dist_symm_pop.R",
    "chars": 1967,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# create data ----------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/pennies_ages/pennies_ages.R",
    "chars": 1939,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/penny_weights/penny_weights.R",
    "chars": 984,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# input ----------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/social_experiment_two_sample_randomization/social_experiment_two_sample_randomization.R",
    "chars": 1933,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set number of simulations --"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/songs_on_ipod/songs_on_ipod.R",
    "chars": 462,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/thanksgiving_spending_intro/thanksgiving_spending_intro.R",
    "chars": 480,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_foundations_for_inf/figures/eoce/yawning_two_sample_randomization/yawning_two_sample_randomization.R",
    "chars": 1722,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set number of simulations --"
  },
  {
    "path": "ch_foundations_for_inf/figures/geomFitEvaluationForSP500For1990To2011/geomFitEvaluationForSP500For1990To2011.R",
    "chars": 1134,
    "preview": "library(openintro)\ndata(COL)\nlibrary(stockPortfolio)\ngr <- getReturns(\"^GSPC\",\n                 freq = \"d\",\n            "
  },
  {
    "path": "ch_foundations_for_inf/figures/geomFitPValueForSP500For1990To2011/geomFitPValueForSP500For1990To2011.R",
    "chars": 426,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('geomFitPValueForSP500For1990To2011.pdf', 6.6, 2.387,\n      mar = c(2, 1, 1, 1),\n   "
  },
  {
    "path": "ch_foundations_for_inf/figures/googleHTForDiffAlgPerformancePValue/googleHTForDiffAlgPerformancePValue.R",
    "chars": 234,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('googleHTForDiffAlgPerformancePValue.pdf', 5, 2.25,\n    mar = c(2, 1, 1, 1), mgp = c"
  },
  {
    "path": "ch_foundations_for_inf/figures/helpers.R",
    "chars": 1208,
    "preview": "\nRunSimulation <- function(p, n.sim, samp.size, xlim, xlab, show = \"n\") {\n\n  samples <- matrix(sample(0:1, n.sim * samp."
  },
  {
    "path": "ch_foundations_for_inf/figures/jurorHTPValueShown/jurorHTPValueShown.R",
    "chars": 232,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('jurorHTPValueShown.pdf', 4.4, 1.87,\n      mar = c(1.5, 1, 0.2, 1),\n      mgp = c(2."
  },
  {
    "path": "ch_foundations_for_inf/figures/mammograms/mammograms.R",
    "chars": 396,
    "preview": "require(openintro)\ndata(COL)\n\nfn <- 'mammogramPValue.pdf'\nmyPDF(fn, 4, 1.2,\n      mar = c(1.5, 0, 0.1, 0),\n      mgp = c"
  },
  {
    "path": "ch_foundations_for_inf/figures/normal_dist_mean_500_se_016/normal_dist_mean_500_se_016.R",
    "chars": 1447,
    "preview": "require(openintro)\ndata(COL)\n\nfn1 <- 'normal_dist_mean_500_se_016.pdf'\nfn2 <- 'normal_dist_mean_500_se_016_with_upper.pd"
  },
  {
    "path": "ch_foundations_for_inf/figures/nuclearArmsReduction/nuclearArmsReduction.R",
    "chars": 736,
    "preview": "require(openintro)\ndata(COL)\n\nfn <- 'nuclearArmsReductionPValue.pdf'\nmyPDF(fn, 3.5, 1,\n      mar = c(1.55, 0, 0.1, 0),\n "
  },
  {
    "path": "ch_foundations_for_inf/figures/p-hat_from_53_and_59-not-used/p-hat_from_53_and_59.R",
    "chars": 322,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('p-hat_from_53_and_59.pdf', 2.15, 0.95,\n      mar = c(1.31, 0, 0.01, 0),\n      mgp ="
  },
  {
    "path": "ch_foundations_for_inf/figures/p-hat_from_53_and_59_computation/NormTailsCalc.R",
    "chars": 1475,
    "preview": "NormTailsCalc <- function(z1, z2, file.name) {\n\n  if (!missing(file.name)) {\n    pdf(paste0(file.name, '.pdf'), 4, 0.7)\n"
  },
  {
    "path": "ch_foundations_for_inf/figures/p-hat_from_53_and_59_computation/p-hat_from_53_and_59_computation.R",
    "chars": 1360,
    "preview": "library(openintro)\ndata(COL)\n\nAddShadedPlot <- function(x, y, offset,\n                          shade.start = -8,\n      "
  },
  {
    "path": "ch_foundations_for_inf/figures/p-hat_from_867_and_907-not-used/p-hat_from_867_and_907.R",
    "chars": 330,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('p-hat_from_867_and_907.pdf', 2.15, 0.95,\n      mar = c(1.31, 0, 0.01, 0),\n      mgp"
  },
  {
    "path": "ch_foundations_for_inf/figures/p-hat_from_86_and_90/p-hat_from_86_and_90.R",
    "chars": 322,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('p-hat_from_86_and_90.pdf', 2.15, 0.95,\n      mar = c(1.31, 0, 0.01, 0),\n      mgp ="
  },
  {
    "path": "ch_foundations_for_inf/figures/quadcopter/quadcopter_attribution.txt",
    "chars": 96,
    "preview": "https://secure.flickr.com/photos/sebilden/14642916088\n\nPhotographer: David J\nLicense: CC BY 2.0\n"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_100_prop_X/sampling_100_prop_X.R",
    "chars": 1173,
    "preview": "set.seed(4)\nlibrary(openintro)\ndata(COL)\nsource(\"../helpers.R\")\n\np <- c(0.03, 0.20, 0.50, 0.80, 0.97)\n# Must sub p's act"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_10_prop_25p/sampling_10_prop_25p - one figure.R",
    "chars": 862,
    "preview": "set.seed(3)\nlibrary(openintro)\n\nn.sim <- 10000\nsamp.size <- 10  # 2541\nprop <- 0.25\nwidth <- 0.025\n\nsamples <- matrix(sa"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_10_prop_25p/sampling_10_prop_25p.R",
    "chars": 1054,
    "preview": "set.seed(3)\nlibrary(openintro)\ndata(COL)\n\nn.sim <- 10000\nsamp.size <- 10  # 2541\nprop <- 0.25\nwidth <- 0.025\n\nsamples <-"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_10k_prop_887p/sampling_10k_prop_887p.R",
    "chars": 739,
    "preview": "set.seed(4)\nlibrary(openintro)\ndata(COL)\n\nn.sim <- 10000\nsamp.size <- 1000  # 2541\nprop <- 0.887\n\nsamples <- matrix(samp"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_10k_prop_88p/sampling_10k_prop_88p.R",
    "chars": 737,
    "preview": "set.seed(4)\nlibrary(openintro)\ndata(COL)\n\nn.sim <- 10000\nsamp.size <- 1000  # 2541\nprop <- 0.88\n\nsamples <- matrix(sampl"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_5k_prop_50p/sampling_5k_prop_50p.R",
    "chars": 751,
    "preview": "set.seed(3)\nlibrary(openintro)\ndata(COL)\n\nn.sim <- 5000\nsamp.size <- 1000\nprop <- 0.5\n\nsamples <- matrix(sample(0:1, n.s"
  },
  {
    "path": "ch_foundations_for_inf/figures/sampling_X_prop_56p/sampling_X_prop_56p.R",
    "chars": 510,
    "preview": "set.seed(4)\nlibrary(openintro)\ndata(COL)\nsource(\"../helpers.R\")\n\np <- 0.56\n# Must sub p's actual value into expression()"
  },
  {
    "path": "ch_foundations_for_inf/figures/sulphStudyFindPValueUsingNormalApprox/sulphStudyFindPValueUsingNormalApprox.R",
    "chars": 625,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('sulphStudyFindPValueUsingNormalApprox.pdf', 6.7, 2.4,\n      mar = c(2, 0, 0.5, 0),\n"
  },
  {
    "path": "ch_foundations_for_inf/figures/whyWeWantPValue/whyWeWantPValue.R",
    "chars": 1115,
    "preview": "library(openintro)\ndata(COL)\n\nBuildWhyWeWantPValuePlot <- function(\n    file.name = 'whyWeWantPValue.pdf',\n    expressio"
  },
  {
    "path": "ch_inference_for_means/TeX/ch_inference_for_means.tex",
    "chars": 141389,
    "preview": "\\begin{chapterpage}{Inference for numerical data}\n  \\chaptertitle{Inference for numerical data}\n  \\label{inferenceForNum"
  },
  {
    "path": "ch_inference_for_means/TeX/comparing_many_means_with_anova.tex",
    "chars": 17937,
    "preview": "\\exercisesheader{}\n\n% 35\n\n\\eoce{\\qt{Fill in the blank\\label{fitb_anova}} When doing an ANOVA, you observe \nlarge differe"
  },
  {
    "path": "ch_inference_for_means/TeX/difference_of_two_means.tex",
    "chars": 15611,
    "preview": "\\exercisesheader{}\n\n% 23\n\n\\eoce{\\qt{Friday the 13$^{\\text{th}}$, Part I\\label{friday_13th_traffic}} In the \nearly 1990's"
  },
  {
    "path": "ch_inference_for_means/TeX/one-sample_means_with_the_t-distribution.tex",
    "chars": 10225,
    "preview": "\\exercisesheader{}\n\n% 1\n\n\\eoce{\\qt{Identify the critical $t$\\label{identify_critical_t}} An independent random \nsample i"
  },
  {
    "path": "ch_inference_for_means/TeX/paired_data.tex",
    "chars": 8958,
    "preview": "\\exercisesheader{}\n\n% 15\n\n\\eoce{\\qt{Air quality\\label{air_quality_shortened}}\nAir quality measurements were collected in"
  },
  {
    "path": "ch_inference_for_means/TeX/power_calculations_for_a_difference_of_means.tex",
    "chars": 1548,
    "preview": "\\exercisesheader{}\n\n% 33\n\n\\eoce{\\qt{Increasing corn yield\\label{increase_corn_yield}} A large farm wants to \ntry out a n"
  },
  {
    "path": "ch_inference_for_means/TeX/review_exercises.tex",
    "chars": 14946,
    "preview": "\\reviewexercisesheader{}\n\n% 47\n\n\\eoce{\\qt{Gaming and distracted eating, Part I\\label{gaming_distracted_eating_intake}}\nA"
  },
  {
    "path": "ch_inference_for_means/figures/babySmokePlotOfTwoGroupsToExamineSkew/babySmokePlotOfTwoGroupsToExamineSkew.R",
    "chars": 666,
    "preview": "library(openintro)\ndata(COL)\ndata(births)\nd <- births\n\n\nmyPDF('babySmokePlotOfTwoGroupsToExamineSkew.pdf', 2 * 4.5, 2.3,"
  },
  {
    "path": "ch_inference_for_means/figures/cbrRunTimesMenWomen/cbrRunTimesMenWomen.R",
    "chars": 679,
    "preview": "library(openintro)\ndata(COL)\ndata(run10Samp)\n\nset.seed(1)\nm <- run10Samp$time[run10Samp$gender=='M']\nmean(m); sd(m)\nf <-"
  },
  {
    "path": "ch_inference_for_means/figures/classData/classData.R",
    "chars": 686,
    "preview": "library(openintro)\ndata(COL)\nlibrary(xtable)\ndata(classData)\n\nmyPDF(\"classDataSBSBoxPlot.pdf\", 5.5, 2.7,\n      mgp = c(2"
  },
  {
    "path": "ch_inference_for_means/figures/distOfDiffOfSampleMeansForBWOfBabySmokeData/distOfDiffOfSampleMeansForBWOfBabySmokeData.R",
    "chars": 446,
    "preview": "library(openintro)\ndata(COL)\ndata(births)\nd <- births\n\nmyPDF('distOfDiffOfSampleMeansForBWOfBabySmokeData.pdf', 3.5, 1.2"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/adult_heights/adult_heights.R",
    "chars": 436,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/age_at_first_marriage_intro/age_at_first_marriage_intro.R",
    "chars": 470,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/anova_exercise_1/anova_exercise_1.R",
    "chars": 434,
    "preview": "library(openintro)\n\nd <- penetrating_oil\n\nmyPDF(\"torque_on_rusty_bolt_dot_plot.pdf\", 7, 3.2,\n    mar = c(3.5, 6.5, 0.1, "
  },
  {
    "path": "ch_inference_for_means/figures/eoce/chick_wts_anova/chick_wts.R",
    "chars": 805,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(dplyr)\n\n# load data ---"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/chick_wts_linseed_horsebean/chick_wts.R",
    "chars": 784,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(dplyr)\n\n# load data ---"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/child_care_hours/child_care_hours.R",
    "chars": 1601,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(xtable)\n\n# load data --"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/child_care_hours/china.csv",
    "chars": 70863,
    "preview": "gender,edu,child_care\r1,1,-99\r1,5,-99\r2,2,-99\r1,2,-99\r2,3,-99\r2,NA,-99\r2,2,-99\r2,2,-99\r2,2,-99\r2,NA,-99\r2,NA,-99\r2,2,-99"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/cleveland_sacramento/cleveland_sacramento.R",
    "chars": 1335,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# take a sample --------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/college_credits/college_credits.R",
    "chars": 522,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/diamonds_1/diamonds.R",
    "chars": 1272,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(ggplot2)\n\n# load data -"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/exclusive_relationships/exclusive_relationships.R",
    "chars": 672,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(dplyr)\n\n# load data ---"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/exclusive_relationships/survey.csv",
    "chars": 468,
    "preview": "\"excl_relation\"\n2\n4\n1\n4\nNA\n2\n2\n2\n1\n4\n2\n4\n2\n7\nNA\n1\nNA\n1\n9\nNA\n4\n1\n2\n4\n2\n1\n5\n1\n9\n1\n2\n1\n4\n4\n1\n8\nNA\n1\n6\n4\n1\n1\n2\n2\n4\n2\n5\n4\n1\n1"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/friday_13th_accident/friday_13th_accident.R",
    "chars": 811,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# subset for accidents -------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/friday_13th_traffic/friday_13th_traffic.R",
    "chars": 891,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/fuel_eff_city/fuel_eff.csv",
    "chars": 238855,
    "preview": "model_yr,mfr_name,division,carline,mfr_code,model_type_index,engine_displacement,no_cylinders,transmission_speed,city_mp"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/fuel_eff_city/fuel_eff_city.R",
    "chars": 1086,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/fuel_eff_hway/fuel_eff.csv",
    "chars": 238855,
    "preview": "model_yr,mfr_name,division,carline,mfr_code,model_type_index,engine_displacement,no_cylinders,transmission_speed,city_mp"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/fuel_eff_hway/fuel_eff_hway.R",
    "chars": 1086,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/gifted_children/gifted_children.R",
    "chars": 974,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/gifted_children_ht/gifted_children_ht.R",
    "chars": 516,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/gifted_children_intro/gifted_children_intro.R",
    "chars": 550,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/global_warming_v2_1/global_warming_v2_1.R",
    "chars": 281,
    "preview": "library(openintro)\nd <- climate70$dx90_2018 - climate70$dx90_1948\nmean(d)\nsd(d)\nlength(d)\nt.test(d)\n\n\nmyPDF(\"global_warm"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/gpa_major/gpa_major.R",
    "chars": 1270,
    "preview": "library(openintro)\nlibrary(xtable)\n\nsurvey <- read.csv(\"survey.csv\")\n\n# subset for meaningful gpa ----------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/gpa_major/survey.csv",
    "chars": 5008,
    "preview": "\"gpa\",\"major\"\n4,\"social sciences\"\n3.8,\"social sciences\"\n3.93,\"social sciences\"\n3.4,\"natural sciences\"\nNA,\"natural scienc"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/hs_beyond_1/hs_beyond.R",
    "chars": 1346,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/oscar_winners/oscar_winners.R",
    "chars": 837,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/prison_isolation_T/prison_isolation.R",
    "chars": 1019,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# load data ------------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/prison_isolation_T/prison_isolation.csv",
    "chars": 304,
    "preview": "PreTrt1,PostTrt1,PreTrt2,PostTrt2,PreTrt3,PostTrt3\r67,74,88,79,86,90\r86,50,79,81,53,53\r64,64,67,83,81,102\r69,76,83,74,69"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/prius_fuel_efficiency/prius_fuel_efficiency.R",
    "chars": 761,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# create data ----------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/prius_fuel_efficiency_update/prius_fuel_efficiency.R",
    "chars": 761,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# create data ----------------"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/t_distribution/t_distribution.R",
    "chars": 485,
    "preview": "# plot --------------------------------------------------------------\npdf('t_distribution.pdf', 4.3, 2.3)\n\npar(mar=c(2, "
  },
  {
    "path": "ch_inference_for_means/figures/eoce/torque_on_rusty_bolt/torque_on_rusty_bolt (Autosaved).R",
    "chars": 1012,
    "preview": "library(openintro)\nlibrary(xtable)\n\nd <- penetrating_oil\n\nmyPDF(\"torque_on_rusty_bolt_dot_plot.pdf\", 7, 3.2,\n    mar = c"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/torque_on_rusty_bolt/torque_on_rusty_bolt.R",
    "chars": 1069,
    "preview": "library(openintro)\nlibrary(xtable)\n\nd <- penetrating_oil\n\nmyPDF(\"torque_on_rusty_bolt_dot_plot.pdf\", 7, 3.2,\n    mar = c"
  },
  {
    "path": "ch_inference_for_means/figures/eoce/work_hours_education/work_hours_education.R",
    "chars": 1219,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\nlibrary(xtable)\n\n# load data --"
  },
  {
    "path": "ch_inference_for_means/figures/fDist2And423/fDist2And423.R",
    "chars": 807,
    "preview": "rm(list = ls())\nlibrary(openintro)\n\nX <- seq(0, 8, len = 300)\nY <- df(X, 2.00001, 423)\n\nmyPDF(\"fDist2And423.pdf\", 5, 2.4"
  },
  {
    "path": "ch_inference_for_means/figures/fDist3And323/fDist3And323.R",
    "chars": 665,
    "preview": "rm(list = ls())\nlibrary(openintro)\n\nX <- seq(0, 6, len = 300)\nY <- df(X, 3, 323)\n\nmyPDF(\"fDist3And323.pdf\", 5, 2.4,\n    "
  },
  {
    "path": "ch_inference_for_means/figures/mlbANOVA/mlbANOVA.R",
    "chars": 3162,
    "preview": "rm(list = ls())\nlibrary(xtable)\nlibrary(openintro)\nlibrary(dplyr)\nd   <- subset(mlb_players_18, AB >= 100)\nd   <- subset"
  },
  {
    "path": "ch_inference_for_means/figures/outliers_and_ss_condition/outliers_and_ss_condition.R",
    "chars": 692,
    "preview": "library(openintro)\nset.seed(2)\n\nd1 <- rnorm(15, 3, 2)\nd2 <- c(exp(rnorm(49, 0, 0.7)), 22)\n\nmyPDF('outliers_and_ss_condit"
  },
  {
    "path": "ch_inference_for_means/figures/pValueOfTwoTailAreaOfExamVersionsWhereDFIs26/pValueOfTwoTailAreaOfExamVersionsWhereDFIs26.R",
    "chars": 388,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('pValueOfTwoTailAreaOfExamVersionsWhereDFIs26.pdf',\n      4.8, 1.7,\n      mar = c(1."
  },
  {
    "path": "ch_inference_for_means/figures/pValueShownForSATHTOfOver100PtGain/pValueShownForSATHTOfOver100PtGain.R",
    "chars": 363,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('pValueShownForSATHTOfOver100PtGain.pdf', 4, 2,\n      mar = c(1.5, 1, 0.2, 1),\n     "
  },
  {
    "path": "ch_inference_for_means/figures/power_best_sample_size/power_best_sample_size.R",
    "chars": 1583,
    "preview": "library(openintro)\ndata(COL)\n\nBuildNull <- function() {\n  normTail(0, 1.07, L = -1000, U = 1000,\n           df = 50, lwd"
  },
  {
    "path": "ch_inference_for_means/figures/power_curve/power_curve.R",
    "chars": 531,
    "preview": "library(openintro)\ndata(COL)\n\nn <- c(10:500, seq(510, 2000, 10), seq(2100, 10000, 100))\nse <- sapply(n, function(x) sqrt"
  },
  {
    "path": "ch_inference_for_means/figures/power_null_0_0-76/power_null_0_0-76.R",
    "chars": 1214,
    "preview": "library(openintro)\ndata(COL)\n\nBuildNull <- function() {\n  normTail(0, 0.8, L = -1000, U = 1000,\n           df = 50, lwd "
  },
  {
    "path": "ch_inference_for_means/figures/power_null_0_1-7/power_null_0_1-7.R",
    "chars": 2730,
    "preview": "library(openintro)\ndata(COL)\n\nBuildNull <- function(xlim = c(-10, 10)) {\n  normTail(0, 1.70, L = -1000, U = 1000,\n      "
  },
  {
    "path": "ch_inference_for_means/figures/rissosDolphin/ReadMe.txt",
    "chars": 119,
    "preview": "\nPhoto by Mike Baird (http://www.bairdphotos.com/). Image was licensed under Creative Commons Attribution 2.0 Generic.\n"
  },
  {
    "path": "ch_inference_for_means/figures/run10SampTimeHistogram/run10SampTimeHistogram.R",
    "chars": 746,
    "preview": "library(openintro)\ndata(COL)\n\n\ndata(run10Samp)\nd <- run10Samp\n\nmyPDF(\"run10SampTimeHistogram.pdf\", 5, 2.8,\n      mar = c"
  },
  {
    "path": "ch_inference_for_means/figures/satImprovementHTDataHistogram/satImprovementHTDataHistogram.R",
    "chars": 379,
    "preview": "library(openintro)\ndata(COL)\n\nset.seed(2)\nx <- round(rnorm(30, 120, 70))\nt.test(x - 100)\nmean(x)\nsd(x)\n\nmyPDF('satImprov"
  },
  {
    "path": "ch_inference_for_means/figures/stemCellTherapyForHearts/stemCellTherapyForHearts.R",
    "chars": 1197,
    "preview": "library(openintro)\ndata(COL)\ndata(stem.cells)\nd <- stem.cells\n\nchange <- d$after - d$before\nt.test(change ~ d[,1])\n\nmyPD"
  },
  {
    "path": "ch_inference_for_means/figures/stemCellTherapyForHeartsPValue/stemCellTherapyForHeartsPValue.R",
    "chars": 487,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('stemCellTherapyForHeartsPValue.pdf', 3.9, 2.3,\n      mar = c(1.75, 1, 1, 1),\n      "
  },
  {
    "path": "ch_inference_for_means/figures/tDistAppendixTwoEx/tDistAppendixTwoEx.R",
    "chars": 417,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('tDistAppendixTwoEx.pdf', 6.8, 1.9,\n      mar = c(1.6, 1, 0.05, 1),\n      mgp = c(5,"
  },
  {
    "path": "ch_inference_for_means/figures/tDistCompareToNormalDist/tDistCompareToNormalDist.R",
    "chars": 851,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('tDistCompareToNormalDist.pdf', 5, 2.3,\n      mar = c(2, 1, 1, 1),\n      mgp = c(5, "
  },
  {
    "path": "ch_inference_for_means/figures/tDistConvergeToNormalDist/tDistConvergeToNormalDist.R",
    "chars": 797,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('tDistConvergeToNormalDist.pdf', 5.94, 2.53,\n      mar = c(2, 1, 1, 1),\n      mgp = "
  },
  {
    "path": "ch_inference_for_means/figures/tDistDF18LeftTail2Point10/tDistDF18LeftTail2Point10.R",
    "chars": 263,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('tDistDF18LeftTail2Point10.pdf', 4, 1.8,\n      mar = c(1.6, 1, 0.1, 1),\n      mgp = "
  },
  {
    "path": "ch_inference_for_means/figures/tDistDF20RightTail1Point65/tDistDF20RightTail1Point65.R",
    "chars": 425,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('tDistDF20RightTail1Point65.pdf', 6.8, 1.9,\n      mar = c(1.6, 1, 0.05, 1),\n      mg"
  },
  {
    "path": "ch_inference_for_means/figures/textbooksF18/diffInTextbookPricesF18.R",
    "chars": 574,
    "preview": "library(openintro)\ndata(textbooks)\ndata(COL)\n\nd <- as.numeric(na.omit(ucla_textbooks_f18$bookstore_new -\n    ucla_textbo"
  },
  {
    "path": "ch_inference_for_means/figures/textbooksF18/textbooksF18HTTails.R",
    "chars": 627,
    "preview": "library(openintro)\ndata(textbooks)\ndata(COL)\nd <- as.numeric(na.omit(ucla_textbooks_f18$bookstore_new -\n    ucla_textboo"
  },
  {
    "path": "ch_inference_for_means/figures/textbooksS10/diffInTextbookPricesS10.R",
    "chars": 486,
    "preview": "library(openintro)\ndata(textbooks)\ndata(COL)\n\nd <- textbooks\n\nmyPDF('diffInTextbookPricesS10.pdf', 6, 3,\n      mar = c(3"
  },
  {
    "path": "ch_inference_for_means/figures/textbooksS10/textbooksS10HTTails.R",
    "chars": 747,
    "preview": "library(openintro)\ndata(textbooks)\ndata(COL)\nd <- textbooks\n\nmyPDF('textbooksS10HTTails.pdf', 5, 1.6,\n      mar = c(1.7,"
  },
  {
    "path": "ch_inference_for_means/figures/textbooks_scatter/textbooks_scatter.R",
    "chars": 1203,
    "preview": "library(openintro)\nlibrary(xtable)\nlibrary(dplyr)\n\nd <- select(ucla_textbooks_f18,\n    subject, course_num, bookstore_ne"
  },
  {
    "path": "ch_inference_for_means/figures/toyANOVA/toyANOVA.R",
    "chars": 1234,
    "preview": "library(xtable)\nlibrary(openintro)\n\nby(toy_anova$outcome, toy_anova$group, mean)\n\n\nmyPDF(\"toyANOVA.pdf\",\n      mar = c(1"
  },
  {
    "path": "ch_inference_for_props/TeX/ch_inference_for_props.tex",
    "chars": 103385,
    "preview": "\\begin{chapterpage}{Inference for categorical data}\n  \\chaptertitle{Inference for categorical data}\n  \\label{inferenceFo"
  },
  {
    "path": "ch_inference_for_props/TeX/difference_of_two_proportions.tex",
    "chars": 16784,
    "preview": "\\exercisesheader{}\n\n% 17\n\n\\eoce{\\qt{Social experiment, Part I\\label{social_experiment_conditions}} A ``social \nexperimen"
  },
  {
    "path": "ch_inference_for_props/TeX/inference_for_a_single_proportion.tex",
    "chars": 13165,
    "preview": "\\exercisesheader{}\n\n% 1\n\n\\eoce{\\qt{Vegetarian college students\\label{veg_coll_students_CLT}} Suppose that 8\\% \nof colleg"
  },
  {
    "path": "ch_inference_for_props/TeX/review_exercises.tex",
    "chars": 13065,
    "preview": "\\reviewexercisesheader{}\n\n% 39\n\n\\eoce{\\qt{Active learning\\label{active_learning_HT_concept}} A teacher wanting to \nincre"
  },
  {
    "path": "ch_inference_for_props/TeX/testing_for_goodness_of_fit_using_chi-square.tex",
    "chars": 4151,
    "preview": "\\exercisesheader{}\n\n% 31\n\n\\eoce{\\qt{True or false, Part I\\label{tf_chisq_1}} Determine if the statements below \nare true"
  },
  {
    "path": "ch_inference_for_props/TeX/testing_for_independence_in_two-way_tables.tex",
    "chars": 4558,
    "preview": "\\exercisesheader{}\n\n% 35\n\n\\eoce{\\qt{Quitters\\label{quitters_chisq_independence}} Does being part of a \nsupport group aff"
  },
  {
    "path": "ch_inference_for_props/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove10WithDF4/chiSquareAreaAbove10WithDF4.R",
    "chars": 229,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove10WithDF4.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp = c(2."
  },
  {
    "path": "ch_inference_for_props/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove11Point7WithDF7/chiSquareAreaAbove11Point7WithDF7.R",
    "chars": 237,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove11Point7WithDF7.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp "
  },
  {
    "path": "ch_inference_for_props/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove4Point3WithDF2/chiSquareAreaAbove4WithDF2.R",
    "chars": 235,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove4Point3WithDF2.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp ="
  },
  {
    "path": "ch_inference_for_props/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove5Point1WithDF5/chiSquareAreaAbove5Point1WithDF5.R",
    "chars": 235,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove5Point1WithDF5.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp ="
  },
  {
    "path": "ch_inference_for_props/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove6Point25WithDF3/chiSquareAreaAbove6Point25WithDF3.R",
    "chars": 237,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove6Point25WithDF3.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp "
  },
  {
    "path": "ch_inference_for_props/figures/arrayOfFigureAreasForChiSquareDistribution/chiSquareAreaAbove9Point21WithDF3/chiSquareAreaAbove9Point21WithDF3.R",
    "chars": 237,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareAreaAbove9Point21WithDF3.pdf', 5, 3,\n      mar = c(2, 1, 1, 1),\n      mgp "
  },
  {
    "path": "ch_inference_for_props/figures/bladesTwoSampleHTPValueQC/bladesTwoSampleHTPValueQC.R",
    "chars": 519,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('bladesTwoSampleHTPValueQC.pdf', 3.04, 1.56,\n      mar = c(2.4, 0, 0.5, 0),\n      mg"
  },
  {
    "path": "ch_inference_for_props/figures/chiSquareDistributionWithInceasingDF/chiSquareDistributionWithInceasingDF.R",
    "chars": 741,
    "preview": "library(openintro)\ndata(COL)\n\nmyPDF('chiSquareDistributionWithInceasingDF.pdf', 6.5, 3,\n      mar = c(2, 0.5, 0.25, 0.5)"
  },
  {
    "path": "ch_inference_for_props/figures/eoce/assisted_reproduction_one_sample_randomization/assisted_reproduction_one_sample_randomization.R",
    "chars": 1525,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set sample size and number o"
  },
  {
    "path": "ch_inference_for_props/figures/eoce/egypt_revolution_one_sample_randomization/egypt_revolution_one_sample_randomization.R",
    "chars": 1501,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set sample size and number o"
  },
  {
    "path": "ch_inference_for_props/figures/eoce/social_experiment_two_sample_randomization/social_experiment_two_sample_randomization.R",
    "chars": 1933,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set number of simulations --"
  },
  {
    "path": "ch_inference_for_props/figures/eoce/yawning_two_sample_randomization/yawning_two_sample_randomization.R",
    "chars": 1722,
    "preview": "# load packages -----------------------------------------------------\nlibrary(openintro)\n\n# set number of simulations --"
  },
  {
    "path": "ch_inference_for_props/figures/geomFitEvaluationForSP500/geomFitEvaluationForSP500.R",
    "chars": 1291,
    "preview": "library(openintro)\nd <- sp500_1950_2018  # read.csv(\"sp500_1950_2018.csv\")\nd <- subset(d, \"2009-01-01\" <= as.Date(Date) "
  },
  {
    "path": "ch_inference_for_props/figures/geomFitEvaluationForSP500/sp500_1950_2018.csv",
    "chars": 1301875,
    "preview": "Date,Open,High,Low,Close,Adj Close,Volume\n1950-01-03,16.660000,16.660000,16.660000,16.660000,16.660000,1260000\n1950-01-0"
  }
]

// ... and 343 more files (download for full content)

About this extraction

This page contains the full source code of the OpenIntroOrg/openintro-statistics GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 543 files (5.2 MB), approximately 1.4M tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo