Statistical Rules in Simple Regression Analysis - Paper Example

Published: 2021-07-12
1858 words
7 pages
16 min to read
Carnegie Mellon University
Type of paper: 
Course work
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Simple regression analysis is an inferential statistical technique that entails using one variable, the independent variable, to predict the value of another variable - the dependent variable. Brase and Brase (2017) posit that, when conducting simple regression analysis, one must adhere to several rules. The independent variable must be either quantitative or categorical if the variable is categorical, there must be two categories. For the independent variable to be quantitative, it has to be measured at the interval level. In addition, the independent variable has to be unbounded, which means there should be no constraint on the extent to which it can vary. There must be some variation in the value of the independent variable, which means that this variable must have a non-zero variance. The point of simple regression analysis is to find how the independent variable explains the variation in the dependent variable, and if the independent variable has zero variance, it is pointless to conduct a regression analysis of the dependent variable on the independent variable.

The independent variable should not correlate highly with the error term. A high correlation between the independent variable and the error term means that the error term has some effect on the dependent variable, and the conclusion drawn from such a regression model might not be accurate. To ensure low correlation between the independent variable and the error term, it is important to select that independent variable based on sound conceptual and theoretical reasons justifying its hypothesized relationship with the dependent variable. At every level of the independent variable, the variance of the residual terms is not supposed to change; this means the residual factors at each level of the independent variable should have equal variance. The homogeneity of variance is particularly important when someone has drawn the data for the independent variable from several groups. For data belonging to several entities, the variance for one entity should not be different from the variance for other entities.

The error term in simple regression analysis is supposed to be independent, and this means if one were to pick two random observations, there should be no correlation between the residual terms. In a simple regression model, the errors should have a normal distribution. With normally distributed errors, the residuals are supposed to random variables that average zero; this implies that the differences between the observed and predicted values should always be zero or thereabout. If predicted values differ from observed values by a margin that significantly differs from zero, it should only be on a random occasion. Simple regression analysis, like many other techniques of inferential statistics, rests on the assumption that the sampling distribution follows a normal distribution; this assumption is problematic because we cannot access this distribution. The best we can do is look at the shape of the distribution and evaluate whether the distribution is normal. While we cannot access the sampling distribution, we know from the central limit theorem that if a sample of data has a normal distribution, the sampling distribution will also be normal. Therefore, anyone analyzing data tends to look at the sample data to see if it has a normal distribution.

In addition, from the central limit theorem, in big samples, it hardly matters the kind of distribution that the data set has the distribution tends towards normality anyway. In samples with at least 30 elements, the sampling distribution tends towards normality even if the population from which it is drawn does not have a normal distribution. Therefore, as the sample size in simple regression analysis grows bigger than 30, one can become confident that the sampling distribution is normal. Some elements of regression analysis are robust to violations of the requirement that errors have a normal distribution. Robustness connotes the situation where a statistical model remains accurate even when some of its assumptions do not hold. An example of a robust test is the F-test in ANOVA a test that comprises one of the components of regression analysis. In simple regression analysis, the F-test is used in evaluating if the variation in the dependent variable that the regression model explains is more than the variation that the regression model fails to explain.

Considering the robustness of the F-test, it is not prudent to transform data when the requirement of normally distributed errors is not met. Actually, transforming data to bring it in line with the rules of simple regression analysis creates more complications. When one transforms their data, it means the hypothesis they are testing has changed. For instance, if one uses log transformation in regression analysis, when it comes to testing the significance of the regression coefficients, they will end up comparing geometric means while their real intention was to compare arithmetic means. Another problem with transforming data during simple regression analysis is that one ends up addressing a construct that is different from what the one that formed the basis of original data collection and measurement, and this means conclusions from such data are inaccurate. There is also the risk of severe consequences from using an inappropriate data transformation, and this means one is better off analyzing untransformed data even if the data violate some requirements. However, the requirement that the error terms have a random distribution does not mean that the independent variable should also have a normal distribution. All the values of the dependent variable should be independent one or more values should not influence another value. For each increase in the value of the independent variable, the value of the dependent variable should fall on a straight line.

Variable Scale and Selection of Dependent/ Independent Variables

All the variables have been measured on the ratio scale. Looking at the measurement scale of the variables, the intervals on that scale represent equal differences in the property under measurement, and the ratios of the scores are meaningful. The dependent variable is salary while the independent variable is the experience in months; the justification for this choice of variables is that more months of experience translate to a greater skill level, which, in turn, makes one earn a higher salary than a worker with a lower level of skill does.

Necessary Settings in SPSS

The first setting is the main dialog box that one accesses by selecting Analyze, and then choosing Regression from the drop-down list that becomes active after the selection of Analyze. After selecting Regression, the next step is to select Linear, which will activate the main dialog box. In the dialog box, there are spaces labeled Dependent and Independent, and one selects the dependent and independent variables from the list on the left-hand-side and moves them into the respective space.

Interpretation of SPSS Results

The table below summarizes the simple linear regression model.

Table 1: Model Summary

Model Summaryb

Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson

1 .097a .009 .007 17.012,353 1.656

a. Predictors: (Constant), Experience in months b. Dependent Variable: Salary The preceding table summarizes the results of the simple linear regression analysis. From the table, R, the coefficient of correlation between Salary and experience in months, is 0.097. Therefore, there seems to be a weak correlation between a persons salary and the period (in months), over which they have been working. It is important to note that, this being a simple regression, the interpretation of R is different from what it would be if it were a case of multiple regression. In multiple regression, there is more than one independent variable, which means is not prudent to use R to link the dependent variable to a single independent variable. When it comes to multiple regression, the interpretation of R changes to reflect how it links the predicted values of the dependent variable and the observed values of the dependent variable.

From the preceding table, it is also clear that R Square is 0.009 here, R Square represents the proportion of variation in the dependent variable that the independent variable can account for. With R Square standing at 0.009, it means about 0.9% of the variation in salary results from the variation in the experience in months. The R Square shows that experience - in months - accounts for a low proportion of the variation in salary, and this is consistent with the earlier observation of a weak correlation between salary and experience in months. The following table shows the test of the significance of the simple regression model.

Table 2: Test of Model Significance


Model Sum of Squares df Mean Square F Sig.

1 Regression 1.310E9 1 1.310E9 4.527 .034a

Residual 1.366E11 472 2.894E8 Total 1.379E11 473 a. Predictors: (Constant), Experience in months b. Dependent Variable: Salary In the preceding table, the F test of significance of a regression model is used to evaluate whether the simple regression model is a significant predictor of salary; in this test, the null hypothesis is that the regression model does not predict the dependent variable. In testing the null hypothesis, we compare the variation in the dependent variable that results from the simple regression model to the variation that the model cannot account for. The comparison of the two sources of variation in the dependent variable results in the computation of the F test statistic a statistic that follows the F probability distribution. Considering that the F test statistic follows the F probability distribution, it is possible to estimate the probability of the occurrence of a test statistic of a certain value, provided one has information on the requisite degrees of freedom. Now that it is possible to estimate the probability of the occurrence of an F test statistic of a certain value, one can set a threshold probability at which they can accept or reject the null hypothesis.

In the preceding table, the regression sum of squares represents the improvement that results from predicting salary using a persons experience instead of using the average salary. In calculating the regression sum of squares, the first step is to compare the differences between the mean salary and the salary and all the points on the regression line; once one compiles these differences, they sum them up to get the regression sum of squares. Regression models are not perfect predictors of the dependent variable, and there will always certainly be differences between the observed values of the dependent variable and the values that the regression model has predicted; these differences are the residuals. Residuals can be positive or negative, and this can bring complications when there is need to proceed further and make important inferences.

To address the complications that negative residuals can cause, one squares all the residuals and sums the squared residuals in order to get the residual sum of squares. The residual sum of squares is a numerical measure of the variation that a regression model cannot explain. Once there is a numerical measure of the variation that a regression model cannot explain and a corresponding measure of the variation the model can explain, the next step is to compare the two measures.

However, before comparing the regression sum of squares to the residual sum of squares, it is important to address one problem these sums of squares depend on the values that one has computed, which creates difficulties in using the F ratio across various contexts...

Request Removal

If you are the original author of this essay and no longer wish to have it published on the website, please click below to request its removal: