10-1 Correlation

A correlation exists between 2 variables when the values of 1 variable are somehow associated with the values of the other variable
A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line
The closer r is to 1, the more linear the correlation
The linear correlation coefficient r measures the strength of the linear correlation between the paired quantitative x values and y values in a sample
Properties of r:
- The value of r is always between -1 and 1 inclusive
- If all values of either variable are converted to a different scale, the value of r does not change
- The value of r is not affected by the choice of x or y
- r measures the strength of a linear relationship
- r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value
A p-value less than or equal to alpha supports the claim of a linear correlation
A p-value greater than alpha does not support the claim of a linear correlation
The value of r squared is the proportion of the variation in y that is explained by the linear relationship between x and y
CORRELATION DOES NOT IMPLY CAUSALITY
Common errors involving correlation: assuming that correlation implies causality, using data based on averages, ignoring the possibility of a nonlinear relationship
A lurking variable is one that affects the variables being studied but is not included in the study
If conducing a formal hypothesis test, use p = 0 as H0 and p not equal to 0 as Ha.

10-2 Regression

The best-fitting line is called the regression line, and its equation is called the regression equation.
The regression equation: y hat = b0 + b1x
The regression equation algebraically describes the regression line. x is called the explanatory/predictor/independent variable, and y hat is called the response/dependent variable
The marginal change in a variable is the amount that it changes when the other variable changes by exactly 1 unit
An outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line
For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is, residual = observed y - predicted y = y - y hat
A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible
A residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y - y hat

10-3 Prediction Intervals and Variation

A prediction interval is a range of values used to estimate a variable
A confidence interval is a range of values used to estimate a population parameter
The total deviation of (x, y) is the vertical distance y - y bar, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y hat.
The explained deviation is the vertical distance y hat - y hat
The unexplained deviation is the vertical distance y - y hat
Total variation = explained variation + unexplained variation
The coefficient of determination is the proportion of the variation in y that is explained by the regression line. r^2 = explained variation / total variation

10-4 Multiple Regression

A multiple regression equation expresses a linear relationship between a response variable y and two or more predictor variables (x1, x2, ... xk)
The adjusted coefficient of determination is the multiple coefficient of determination R^2 modified to account for the number of variables and the sample size
A dummy variable is a variable having only the values of 0 and 1 that are used to represent the 2 different categories of a qualitative variable

10-5 Nonlinear Regression

Not all relationships are linear, there are nonlinear functions that can fit sample data as well, such as logarithmic, power, quadratic, and exponential relationships
To identify a good mathematical model: look for a pattern in the graph, compare values of R^2, and think

Home $knowt breadcrumb$

Math

Statistics

10-1 Correlation

A correlation exists between 2 variables when the values of 1 variable are somehow associated with the values of the other variable
A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line
The closer r is to 1, the more linear the correlation
The linear correlation coefficient r measures the strength of the linear correlation between the paired quantitative x values and y values in a sample
Properties of r:
- The value of r is always between -1 and 1 inclusive
- If all values of either variable are converted to a different scale, the value of r does not change
- The value of r is not affected by the choice of x or y
- r measures the strength of a linear relationship
- r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value
A p-value less than or equal to alpha supports the claim of a linear correlation
A p-value greater than alpha does not support the claim of a linear correlation
The value of r squared is the proportion of the variation in y that is explained by the linear relationship between x and y
CORRELATION DOES NOT IMPLY CAUSALITY
Common errors involving correlation: assuming that correlation implies causality, using data based on averages, ignoring the possibility of a nonlinear relationship
A lurking variable is one that affects the variables being studied but is not included in the study
If conducing a formal hypothesis test, use p = 0 as H0 and p not equal to 0 as Ha.

10-2 Regression

The best-fitting line is called the regression line, and its equation is called the regression equation.
The regression equation: y hat = b0 + b1x
The regression equation algebraically describes the regression line. x is called the explanatory/predictor/independent variable, and y hat is called the response/dependent variable
The marginal change in a variable is the amount that it changes when the other variable changes by exactly 1 unit
An outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line
For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is, residual = observed y - predicted y = y - y hat
A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible
A residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y - y hat

10-3 Prediction Intervals and Variation

A prediction interval is a range of values used to estimate a variable
A confidence interval is a range of values used to estimate a population parameter
The total deviation of (x, y) is the vertical distance y - y bar, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y hat.
The explained deviation is the vertical distance y hat - y hat
The unexplained deviation is the vertical distance y - y hat
Total variation = explained variation + unexplained variation
The coefficient of determination is the proportion of the variation in y that is explained by the regression line. r^2 = explained variation / total variation

10-4 Multiple Regression

A multiple regression equation expresses a linear relationship between a response variable y and two or more predictor variables (x1, x2, ... xk)
The adjusted coefficient of determination is the multiple coefficient of determination R^2 modified to account for the number of variables and the sample size
A dummy variable is a variable having only the values of 0 and 1 that are used to represent the 2 different categories of a qualitative variable

10-5 Nonlinear Regression

Not all relationships are linear, there are nonlinear functions that can fit sample data as well, such as logarithmic, power, quadratic, and exponential relationships
To identify a good mathematical model: look for a pattern in the graph, compare values of R^2, and think