knowt logo

Chapter 10 - Correlation and Regression

10-1 Correlation

  • correlation exists between 2 variables when the values of 1 variable are somehow associated with the values of the other variable

  • linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line

  • The closer r is to 1, the more linear the correlation

  • The linear correlation coefficient r measures the strength of the linear correlation between the paired quantitative x values and y values in a sample

  • Properties of r:

    • The value of r is always between -1 and 1 inclusive

    • If all values of either variable are converted to a different scale, the value of r does not change

    • The value of r is not affected by the choice of x or y

    • r measures the strength of a linear relationship

    • r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value

  • A p-value less than or equal to alpha supports the claim of a linear correlation

  • A p-value greater than alpha does not support the claim of a linear correlation

  • The value of r squared is the proportion of the variation in y that is explained by the linear relationship between x and y

  • CORRELATION DOES NOT IMPLY CAUSALITY

  • Common errors involving correlation: assuming that correlation implies causality, using data based on averages, ignoring the possibility of a nonlinear relationship

  • lurking variable is one that affects the variables being studied but is not included in the study

  • If conducing a formal hypothesis test, use p = 0 as H0 and p not equal to 0 as Ha.

10-2 Regression

  • The best-fitting line is called the regression line, and its equation is called the regression equation.

  • The regression equation: y hat = b0 + b1x

  • The regression equation algebraically describes the regression line. x is called the explanatory/predictor/independent variable, and y hat is called the response/dependent variable

  • The marginal change in a variable is the amount that it changes when the other variable changes by exactly 1 unit

  • An outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line

  • For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is, residual = observed y - predicted y = y - y hat

  • A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible

  • residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y - y hat

10-3 Prediction Intervals and Variation

  • prediction interval is a range of values used to estimate a variable

  • confidence interval is a range of values used to estimate a population parameter

  • The total deviation of (x, y) is the vertical distance y - y bar, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y hat.

  • The explained deviation is the vertical distance y hat - y hat

  • The unexplained deviation is the vertical distance y - y hat

  • Total variation = explained variation + unexplained variation

  • The coefficient of determination is the proportion of the variation in y that is explained by the regression line. r^2 = explained variation / total variation

10-4 Multiple Regression

  • multiple regression equation expresses a linear relationship between a response variable y and two or more predictor variables (x1, x2, ... xk)

  • The adjusted coefficient of determination is the multiple coefficient of determination R^2 modified to account for the number of variables and the sample size

  • dummy variable is a variable having only the values of 0 and 1 that are used to represent the 2 different categories of a qualitative variable

10-5 Nonlinear Regression

  • Not all relationships are linear, there are nonlinear functions that can fit sample data as well, such as logarithmic, power, quadratic, and exponential relationships

  • To identify a good mathematical model: look for a pattern in the graph, compare values of R^2, and think

10-1 Correlation

  • correlation exists between 2 variables when the values of 1 variable are somehow associated with the values of the other variable

  • linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line

  • The closer r is to 1, the more linear the correlation

  • The linear correlation coefficient r measures the strength of the linear correlation between the paired quantitative x values and y values in a sample

  • Properties of r:

    • The value of r is always between -1 and 1 inclusive

    • If all values of either variable are converted to a different scale, the value of r does not change

    • The value of r is not affected by the choice of x or y

    • r measures the strength of a linear relationship

    • r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value

  • A p-value less than or equal to alpha supports the claim of a linear correlation

  • A p-value greater than alpha does not support the claim of a linear correlation

  • The value of r squared is the proportion of the variation in y that is explained by the linear relationship between x and y

  • CORRELATION DOES NOT IMPLY CAUSALITY

  • Common errors involving correlation: assuming that correlation implies causality, using data based on averages, ignoring the possibility of a nonlinear relationship

  • lurking variable is one that affects the variables being studied but is not included in the study

  • If conducing a formal hypothesis test, use p = 0 as H0 and p not equal to 0 as Ha.

10-2 Regression

  • The best-fitting line is called the regression line, and its equation is called the regression equation.

  • The regression equation: y hat = b0 + b1x

  • The regression equation algebraically describes the regression line. x is called the explanatory/predictor/independent variable, and y hat is called the response/dependent variable

  • The marginal change in a variable is the amount that it changes when the other variable changes by exactly 1 unit

  • An outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line

  • For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is, residual = observed y - predicted y = y - y hat

  • A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible

  • residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y - y hat

10-3 Prediction Intervals and Variation

  • prediction interval is a range of values used to estimate a variable

  • confidence interval is a range of values used to estimate a population parameter

  • The total deviation of (x, y) is the vertical distance y - y bar, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y hat.

  • The explained deviation is the vertical distance y hat - y hat

  • The unexplained deviation is the vertical distance y - y hat

  • Total variation = explained variation + unexplained variation

  • The coefficient of determination is the proportion of the variation in y that is explained by the regression line. r^2 = explained variation / total variation

10-4 Multiple Regression

  • multiple regression equation expresses a linear relationship between a response variable y and two or more predictor variables (x1, x2, ... xk)

  • The adjusted coefficient of determination is the multiple coefficient of determination R^2 modified to account for the number of variables and the sample size

  • dummy variable is a variable having only the values of 0 and 1 that are used to represent the 2 different categories of a qualitative variable

10-5 Nonlinear Regression

  • Not all relationships are linear, there are nonlinear functions that can fit sample data as well, such as logarithmic, power, quadratic, and exponential relationships

  • To identify a good mathematical model: look for a pattern in the graph, compare values of R^2, and think