# Correlation and Regression Analysis

Correlation Analysis

Correlation analysis is used to identify how closely related two variables are to each other.  A numeric value ranging from -1 to +1 indicates if the correlation between the two variables is positive or negative and the strength of the relationship.  The closer the correlation is to negative or positive 1 the stronger the relationship.  A correlation of zero would indicate that no relationship exists.

One of the more common uses of correlation in marketing research is customer satisfaction studies.

Example:  A multi-level marketing company selling home décor products and accessories conducted a survey of its in-home demonstrators regarding their level of satisfaction with several key factors related to their experience with the company.

Q: Using the following 1 to 7 scale, where 1 means “totally disagree” and 7 means “totally agree,” please indicate your level of agreement with the following statements:

• The call center answers my questions
• The demonstrator support website is easy to use
• The training I have received is effective
• I am supported by my upline
• I am satisfied with the variety of products available for me to offer
• Products are shipped in a timely manner
• I feel like [company name] cares about me personally
• [Company name] is responsive to problems

A question related to overall satisfaction is also included using the same scale.

Q: Using the following 0 to 7 scale, where 0 means “totally dissatisfied” and 7 means “totally satisfied,” please rate your overall satisfaction with [company name].

Running a correlation analysis between overall satisfaction and each of the other variables will allow us to measure the strength of the relation each variable has to overall satisfaction, and thus identify which ones appear to have the greatest impact/influence on overall satisfaction.

If we were to plot the results of the correlation between two variables it may look something like this:

As you can see there is a degree of linearity between the two.  In this case the correlation is .63, indicating a moderately strong relationship.

A simple correlation analysis can be run using an Excel spreadsheet.  However, a more advanced statistical analysis program like SPSS will provide more complete data.  It will also produce a “Sig.” value to indicate statistical significance.

It is important to realize that correlation does not establish cause and effect.  Correlations are often used erroneously to state a relationship.  A correlation may suggest, but not imply a causal linkage between the variables.  Correlations that are real, but obviously not related are known as spurious correlations.

Partial Correlation: A partial correlation coefficient measures the association between two variables after controlling for the effects of one or more additional variables.  For example, we may want to measure the correlation between sales of a product and the amount spent on advertising that product while controlling for price.

There are two main types of correlation, Spearman’s rho and Kendall’s tau.  Spearman’s is preferable when we have a relatively larger number of categories to cases, and Kendall’s when a large number cases fall into a relatively small number of categories.

Multiple Regression Analysis

Multiple regression is used to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable.  For example, a university may want to know what factors contribute most to the successful graduation of students.  A successful graduation would be the dependent variable, and things like the choice of major, the level of the student’s social involvement, marital status of the student, financial situation, employment, students age, GPA, and other factors would be the independent or predictor variables. The goal would be to learn which of these factors contribute to the likelihood of a student graduating.

In another example, we may want to understand how variations in sales (the dependent variable) are explained by variations in advertising expenditures, the item price, and changes in packaging (the independent variables).

Multiple regression allows us to answer the question “what are the best predictors of…”.  While correlation analysis looks at the strength of the relationship between two variables, regression analysis looks at the combined effect all variables have on the dependent variable.  It estimates magnitude, relative importance, and statistical significance of the contribution of each of the predictor variables to the dependent variable.  Correlation describes a relationship while regression analysis predicts a value. Regression Analysis can be used to forecast sales, profitability, market share, buying patterns, the impact of marketing programs.

Example: For this example we will continue to use our previous example from the correlation analysis about the impact of variables on demonstrator satisfaction.  In the SPSS output, the first thing we see is a Model Summary.  The R value in this table is called the multiple correlation coefficient because it looks at the association of all the variables together.  It ranges from -1 to 1.  The closer it is to 1, the stronger the influence all the independent (predictor) variables have of the dependent variable together.  When expressed as R squared (also called the coefficient of determination), it explains the percent of the variance in the dependent variable that can be predicted by the combination of the independent variables.  In the example below, 49% of the variance of the dependent variable is explained by the independent variables.  The remainder of the variation (52%) is unaccounted for.  A small R squared value indicates that the model is not a good fit.  Adjusted R squared takes into consideration the number of observations and the number of predictor variables.  It is superior to R-square because it is sensitive to the addition of irrelevant variables.

In this case, an R value of .697 indicates that these variables do a very good job of predicting overall satisfaction.

The ANOVA table tests for significance.  The F test is used to test the null hypothesis that there is no association.  The F value should be large and sig should be below .05 to be considered significant.  In this case the “Sig.” value is 0, indicating that the model fits very well and is significantly related to the dependent variable.

The coefficients measure how well each of the individual variables contribute to overall satisfaction.  They indicate the increase in the value of the dependent variable for each unit increase in the predictor.  This is the B value.  However, this is an unstandardized coefficient.  The Beta creates a standardized measure, the closer to 1, the better the predictor.  This is the number that should be used for comparison.  The t value helps determine the relative importance of each variable, it is related to “Sig”.  As stated earlier “Sig.” should be less .05 for 95% confidence in the ability of the model to explain the dependent variable.  The “constant” is the value of the dependent variable if all the other independent variables were zero.

Regression analysis is similar to correlation analysis in that it looks at the individual contribution of each variable.  However, the correlation of each variable is calculated only in combination with the other variables.

Examining the results we see that the degree to which demonstrators feel that the company cares about them personally is the best predictor of overall satisfaction, followed by how satisfied demonstrators are with products.  With regression analysis we are able to measure the degree to which each variable contributes to overall satisfaction, such as .063 with the call center and .079 with the website.  This is the average change we can expect in the satisfaction score given a one unit change in each independent variable.  So if Call Center satisfaction was an average of 3 on our 1 to 7 sacale, and we improved that to an average of 4, we would expect to see the the constant rise from .235 to .298.

The real value of regression analysis would be found in a case where data was collected and measured in a non-standardized format (unlike our seven point scale), such as measuring the relationship an employee’s education level (in years), beginning salary (dollars), and months since hire has on an employee’s current salary (the dependent variable).  Our coefficient output would show the actual dollar increase in salary, such as \$1,000, for each year of education.

In the next example, the percent of students who scored at or above proficiency on a math assessment test (the dependent variable) is analyzed in relation to four predictor variables:

• If the student is classified as economically disadvantaged.
• If the student to teacher ratio in the class is acceptable.
• If the student’s parents indicated that they are heavily involved in the student’s education.
• If the student scored proficient or above on an art test.

The model is moderately strong, with a total of 56% of the variance in the percent of students scoring proficient or above on the assessment test being explained by the model.

The Parental Involvement Factor is the strongest predictor of the students text scores; a one-point increase in the mean rating on this factor is associated with a 13.5 point increase in students scoring at or above proficiency on the test.

The percent of students who indicated some sort of financial hardship, negatively predicts student test performance, and a higher student to teacher ratio, positively predicts student test performance.

Stepwise Regression:  When there is a large number of independent (predictor) variables and the belief is that not all of them are significant, stepwise regression can be used.  In stepwise regression, a smaller subset of variables that account for most of the variation in the dependent variable are selected.  The independent variables are removed from the equation one at a time.