Statistics Notes
Autor: kmo1326 • February 13, 2018 • Course Note • 49,443 Words (198 Pages) • 649 Views
Page 1 of 198
Chapter 1: Stats Start Here
Definitions:
- Context: Who, what, where, when, how.
- Case: An individual about whom, or which, we have data.
- Respondents: Someone who answers, or responds, to a survey.
- Subjects/Participants: A human experimental unit.
- Experimental Units: An individual in a study for which, or for whom, data values are recorded. Humans typically not referred to with this label.
- Records: Information about an individual in a database.
- Sample: Subset of a population that we use to gather information in hopes of learning about the population as a whole..
Types of Variables:
- Categorical/Nominal: A variable that states what group, category, or name (nominal) it belongs to or has.
- Quantitative: Variable contains numerical values with measured units.
- Identifiers: A variable that contains a unique, one of a kind identifier for each record. E.g. student Id number. Not useful for statistics but very useful for sorting and identifying records especially in big data environments.
- Ordinal: A variable used to rank a sample of individuals by some characteristic. Typically ranked on a scale but not necessarily. Different points on the scale may not be equivalent.
Chapter 2: Displaying and Describing Categorical Data
Principles:
- Three Tools of Data Analysis: 1. Make a picture 2. Make a picture 3. Make a picture
- The Area Principle: In a statistical display, each data value should be represented by the same amount of area.
- Simpson’s Paradox: When averages are taken across different groups, they can appear to contradict the overall averages.
Definitions:
- Distribution:
- The possible values of the variable
- The relative frequency of the variable
- Marginal Distribution: In a contingency table, the distribution of either variable alone is called the marginal distribution. The counts or percentages are the totals found in the margins (last row or column) of the table.
- Row variable total divided by overall total.
- Conditional Distribution: The distribution of a variable restricting the Who to consider only a smaller group of individuals. Pg. 22
- Looks at the subsets within a single variable and compares those subsets to the variable total value. Pg. 23
- Restrict attention to single row or column. Add row cells up and figure out percentages within that row. This shows how the distribution of one variable for just the cases in that row, are affected by another variable.
- Independence: Variables are said to be independent if the conditional distribution of one variable is the same for each category of the other.
- Categorical Data Condition: Always check that the data are counts or percentages of individuals in categories before making pie or bar charts.
Types of Tables and Charts for Categorical Data:
- Frequency Table: Lists categories in a categorical variable and gives the count of observations for each category.
- Relative Frequency Table: Displays the percentages, rather than counts, of the values in each category
- Bar Chart: Show a bar whose area represents the count (or percentage) of observations for each category of a categorical variable.
- Bin: On bar charts refers to each bar (looks like a bin). When finding the count within a numerical range, count up to but not including the number. E.g.: Bin range 1-7. Count all values that start at one and go up to, but not including 7.
- Relative Frequency Bar Chart: Replaces count with percentages.
- Segmented Bar Chart: Works the same as a pie chart. Percentages are stacked within a single bar instead of circle. Treats each bar as the “whole” and divides it proportionally into segments corresponding to the percentage in each group.
- Pie Chart: Show how a “whole” divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category.
- Contingency Table: Displays how one variable is contingent upon the value of the other. Displays counts, and sometimes, percentages of individuals falling into named categories on two or more variables. The table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the category of the other.
- Takes two categorical variables and places each on an axis. For instance Survival rate of passengers and Class of the passengers. Survival rate is placed on the rows and broken into subcategories: alive, dead, total. Class is placed on columns and subcategories broken down as: First, Second, Third, Crew, Total.
- Each cell gives the count for a combination of values within the two variables.
- Margins of the table, both on the right and bottom, display totals.
- Marginal Distribution: When the margins of a contingency table display the frequency distribution for one of the variables.
Chapter 3: Displaying and Summarizing Quantitative Data
Quantitative Data Condition: Always check whether data is quantitative. A good way to be sure is to report the measurement units.
Quantitative Variables - Make and Interpret displays of the distribution:
- Histogram: Uses adjacent bars to show the distribution of quantitative variables. Each bar represents the frequency (or relative frequency) of values falling in each bin.
- Stem-and-leaf: A vertical chart made by placing the tens place on the left (stem) and dividing it from the ones place with a vertical line. The tens count up in increments and each entry in the ones place are written horizontally over their corresponding tens place. An example of a single stem block: 5 | 0444 This translates to four entries: 50, 54, 54, 54.
- Dotplot: A simple display that places a dot along an axis for each case in the data. Each record that shares the same value as a previous entry is stacked adjacent to it.
- Distribution: for quantitative variables, slices up all the possible values of the variable into equal-width bins and gives the number of values (or counts) falling into each bin.
- Gap: A region where there are no values.
Shape and how to describe it:
- Symmetric: Has roughly the same shape reflected around the center. Describe using mean.
- Skewed: Distribution extends farther on one side than another. Use median to describe.
- Mean and median are equal if its symmetrical in shape
- Mean > median , skewed right - Greater portion of data is on the left.
- Mean < median , skewed left - Greater portion of data is on the right.
- Unimodal: Distribution has a single major hump or mode. Describe using mean.
- Bimodal & Multimodal: Distribution has two major humps or modes for bimodal. Multi has more than two major humps or modes. Depends if skewed or symmetric.
- Outliers: Values that lie far from the rest of the data. Use median to describe.
- Unusual Feature: Report anything out of the ordinary with the distribution, such as gaps.
Summarize Center of Distribution with mean and median - When best to use each:
...