Scatter Diagrams and Regression Lines

 Contents

A. Constructing a Scatter Diagram

Constructing a scatter diagram is a fairly straightforward process. First decide which variable is going to be your x-value and which variable is going to be your y-value.

Find the minimum and maximum of your x-values and set up a uniform number line on your horizontal axis so that the values extend from the minimum x-value to the maximum x-value but not much farther.

Next find the minimum and maximum of your y-values and set up a uniform number line on the vertical axis so that the values extend from the minimum y-value to the maximum y-value but not much farther.

Once the axes are set up, you just act like each pair of x- and y-values is an ordered pair, and you plot these ordered pairs on the coordinate axes you just created.

For example, consider Table 1 on page 178 of Sullivan, which is reproduced below. Usually you would pick the first row or column as your x-values and the second row or column as your y-values. Once you make that decision, your axes should be fairly similar to those shown in the figure below.

 Club-Head Speed (mph) 100 102 103 101 105 100 99 105 Distance (yards) 257 264 274 266 277 263 258 275

The lowest club head speed is 99 and the highest is 105, and the x-axis shown extends from 99 to 105. The shortest distance is 257 and the largest is 277, but since multiples of 5 were used to mark the scale, the vertical scale extends a little below 257 to 255, and a little above 277 to 280.

Each row of Table 1 on page 178 will be represented by a point on the scatter diagram. To plot the point represented by the first row in the table, you find 100 on the x-axis and then move up to a height representing 257 on the y-axis. Since 257 is between 255 and 260, but closer to 255, that is where the point should be. In a similar fashion, points would be plotted for the other rows in Table 1. Once a point has been plotted for each row and a title has be added to the chart, the scatter diagram is complete. B. Using the TI 83/84 Calculator to Find Equations of Regression Lines and Coefficients of Determination and Correlation

To find coefficients of determination and correlation, you must first make a change in the settings on your calculator. Press "2ND" and "0" (zero). This will bring up a list of all the procedures in the calculator in alphabetical order. Use the down-arrow key to put the triangle cursor next to "DiagnosticOn". Make sure you choose "DiagnosticOn" and NOT "DiagnosticOff". Then press "Enter" twice. On your calculator screen, you should see:

DiagnosticOn

Done

You should only need to do this procedure again if (1) your calculator is reset to the manufacturer’s default settings, or (2) you start using a different calculator.

To find regression lines and coefficients of determination and correlation, you need to be working with two variables with each value of the first variable paired with one value of the second. For this demonstration, we will use the data from Table 1 again:

 Club-Head Speed (mph) 100 102 103 101 105 100 99 105 Distance (yards) 257 264 274 266 277 263 258 275

Press "STAT" and "ENTER". Enter the numbers for the first variable under "L1". Enter the numbers for the second variable under "L2". When you are finished, the data entry screen should look like the following:

 L1 L2 103 274 101 266 105 277 100 263 99 258 105 275 ------ ------

The first two rows appear to be missing, but if you press the up-arrow a few times, both rows will reappear.

The numbers in the first column are paired with the same numbers in the second column that they were in the original table of data (except numbers in the first row were paired with numbers in the second row in the original table). If the pairing is changed, your results will most likely be wrong.

Now press "STAT". Move the cursor to "CALC" and select "LinReg(ax+b)" using the down-arrow button. Pressing "ENTER" twice will bring up the following display:

 LinReg y = ax + b a = 3.166101695 b = -55.79661017 r2 = .8811498758 r = .9386958377

If your data is in some other lists besides L1 and L2, you could choose those lists by pressing "ENTER" just one time after selecting "LinReg(ax+b)", typing in the correct lists, and pressing"ENTER" again. If your data is in L4 and L5, for example, you would press "2ND", "4", "," (comma), "2ND", "5" and then "ENTER" to compute the regression line for those lists.

For this example, the equation of the regression line is y = 3.166x-55.797. The coefficient of determination, 0.881, says that about 88.1% of the variation in the data is determined by the regression line. The correlation coefficient, 0.939, indicates a strong positive correlation. See Coefficients of Determination and Correlation below to find out how to interpret the coefficients of determination and correlation.

To plot the regression line on the scatter diagram, you need to find two points on the regression line. Since you have the equation of the regression line, all you need are some x-values to plug into this equation. The minimum and maximum x-values are good to use, but you could use any numbers that are close to these x-values. In the above example, 99 is the minimum x-value and 105 is the maximum. Plugging these into the equation gives:

 y = 3.166 x 99 - 55.797 = 257.6, or x = 99 and y = 257.6 for the first point, and y = 3.166 x 105 - 55.797 = 276.6, or x =105 and y = 276.6 for the second point.

Plotting these two points on the scatter diagram and drawing a line through them gives a graph of the regression line. When the regression line is plotted correctly, about half of the data points will be above the line and the other half will be below the line. If your line is below or above much more than half of the data points, then you have done something wrong. Usually this indicates that you recorded a wrong number somewhere or you have switched the x-values and y-values at some step in the process.

C. Good and Bad Scatter Diagrams

Figure 1: GOOD Scatter Diagram This is a GOOD scatter diagram. It has a title and both axes are labeled. Both scales extend only as far as the data values and not much farther. Notice that the regression line goes through the middle of the points. Three points are above the regression line and three points are below it, while two points just touch the regression line. This is a BAD scatter diagram. Notice that there are no data values to the left of 90 on the horizontal axis, and yet the horizontal scale goes all the way down to zero. As a result, most of the left side of the chart is empty and all the data values are squeezed against the right side of the chart. This is a BAD scatter diagram. Notice that there are no data values below 250 on the vertical axis, and yet the vertical scale goes all the way down to zero. As a result, most of the bottom part of the chart is empty and all the data values are pushed up against the top of the graph. This is a BAD scatter diagram. While both scales are restricted, they still go a lot farther than they need to. As a result, the data is forced into a very small area of the chart and there is a lot of blank space around it. There is no need for the horizontal axis to go below 95 or above 110. There is no need for the vertical axis to go below 250 or above 280. The smaller the range of data on each axis is, the more the chart becomes focused on the data. See how the data points take up most of the graph in Figure 1 above. This is a BAD scatter diagram. The vertical axis has been extended to show the line, but since the line is nowhere near the data, this is not the regression line. Usually when this happens, it means the x and y variables have been switched somewhere in the process of finding the regression line. Always set up the scatter diagram first. Then if the regression line is nowhere near the data, that means you made a mistake in computing the regression line. One thing to try if this happens is switching x and y values.

D. Coefficients of Determination and Correlation

The coefficient of determination, r 2, tells what percent of the variation in data values is explained by the regression line. If this percent is less than 100%, then the difference between 100% and the coefficient of determination tells what percent of the variation is determined by something other than the regression line.

Examples

a) If r 2 = 0.82, then 82% of the variation is determined by the regression line, and 18% of the variation is determined by some other factor or factors.

b) If r 2 = 0.47, then 47% of the variation is determined by the regression line, and 53% of the variation is determined by some other factor or factors.

The correlation coefficient, r, tells how closely the scatter diagram points are to being on a line. If the correlation coefficient is positive, the line slopes upward. If the correlation coefficient is negative, the line slopes downward. All values of the correlation coefficient are between -1 and 1, inclusive.

The correlation scale* below provides a way to categorize the values of correlation coefficients. According to this scale, a correlation coefficient of 0.2 would indicate a weak positive correlation, while a coefficient of -0.9 would indicate a strong negative correlation. A correlation coefficient of 1.0 indicates a perfect positive correlation.

* This scale has been revised and expanded from the correlation scale presented in Jay Devore and Nicholas Farnum, Applied Statistics for Engineers and Scientists, 2nd edition, Brooks/Cole 2005, p. 109.

Exercise: Interpret the values of r 2 and r given below.

 1) r 2 = 0.452 2) r 2 = 0.913 3) r 2 = 0.721 4) r 2 = 0.264 5) r = 0.431 6) r = -0.083 7) r = 0.972 8) r = -1.0 9) r = 0.681 10) r = -0.753 11) r = 0.047 12) r = -0.994

The following scatter diagrams are provided to give you some idea of how correlation coefficients and coefficients of determination relate to how points are clustered around a regression line. The regression line has the equation: Distance = 3.17 x Speed - 55.80. The correlation coefficient is 0.939, which signifies a strong positive correlation. The coefficient of determination is 0.881, indicating that 88.1% of the variation in the data is determined by the regression line. The regression line has the equation: Life Exp. = 0.0261 x Gestation + 7.87. The correlation coefficient is 0.726, which signifies a moderate positive correlation. The coefficient of determination is 0.527, indicating that 52.7% of the variation in the data is determined by the regression line. The regression line has the equation: IQ = 0.0172 x (MRI Count) + 119.22. The correlation coefficient is 0.357, which signifies a weak positive correlation. The coefficient of determination is 0.128, indicating that 12.8% of the variation in the data is determined by the regression line. The regression line has the equation: TECO = 0.0170 x GE + 0.0427 . The correlation coefficient is 0.017, which shows no correlation between the annual rates of return for the two stocks.  The coefficient of determination is 0.0003, indicating that almost none of the variation in the data is determined by the regression line. The regression line has the equation: TECO = -0.112 x Cisco + 0.0888 . The correlation coefficient is -0.235, which signifies a weak negative correlation. The coefficient of determination is 0.055, indicating that 5.5% of the variation in the data is determined by the regression line. The regression line has the equation: Percentage = -0.111 x ERA + 0.977. The correlation coefficient is -0.660, which signifies a moderate negative correlation. The coefficient of determination is 0.436, indicating that 43.6% of the variation in the data is determined by the regression line. The regression line has the equation: MPG = -0.00617 x Weight + 41.46 . The correlation coefficient is -0.892, which signifies a strong negative correlation. The coefficient of determination is 0.796, indicating that 79.6% of the variation in the data is determined by the regression line.

E. Outliers and Influential Observations in a Scatter Diagram

If there is a regression line on a scatter diagram, you can identify outliers. An outlier for a scatter diagram is the point or points that are farthest from the regression line. Distance from a point to the regression line is the length of the line segment that is perpendicular to the regression line and extends from the point to the regression line. (See the figure below.) Note that outliers for a scatter plot are very different from outliers for a boxplot. There is usually at least one outlier and usually only one outlier on a scatter diagram. If one point of a scatter diagram is farther from the regression line than some other point, then the scatter diagram has at least one outlier. If two or more points are the same farthest distance from the regression line (not a common occurrence), then each of these points is an outlier. If all points of the scatter diagram are the same distance from the regression line (which very rarely happens), then there is no outlier. (See the GeoGebra applet Scatter Diagram Outliers.)

An influential observation (inf. obs.) is a point on a scatter diagram that has a large horizontal gap containing no points between it and a vast majority of the other points. As shown in the graph below, there can be more than one influential observation. If there is no large horizontal gap between data points in a scatter diagram, there are no influential observations. In many cases, a scatter diagram will have no influential observations; but influential observations should be identified if they occur. When an influentialobservation is moved up or down and the regression line is recomputed, the newline will be much closer to the new location of the influential observation. If a non-influential observation is relocated, the recomputed regression line will be in almost thesame position as the original regression line. Thus the influential observation "influences" the location of the regression line. (See the Influential Observations GeoGebra applet)

Exercises Identify influential observations and outliers in the scatter diagrams shown below.   