Scatter Diagrams and Regression Lines
A. Constructing a Scatter Diagram
Constructing a scatter diagram is a fairly straightforward process. First decide which variable is going to be your xvalue and which variable is going to be your yvalue.
Find the minimum and maximum of your xvalues and set up a uniform number line on your horizontal axis so that the values extend from the minimum xvalue to the maximum xvalue but not much farther.
Next find the minimum and maximum of your yvalues and set up a uniform number line on the vertical axis so that the values extend from the minimum yvalue to the maximum yvalue but not much farther.
Once the axes are set up, you just act like each pair of x and yvalues is an ordered pair, and you plot these ordered pairs on the coordinate axes you just created.
For example, consider Table 1 on page 178 of Sullivan, which is reproduced below. Usually you would pick the first row or column as your xvalues and the second row or column as your yvalues. Once you make that decision, your axes should be fairly similar to those shown in the figure below.
ClubHead Speed
(mph) 
100 
102 
103 
101 
105 
100 
99 
105 
Distance (yards) 
257 
264 
274 
266 
277 
263 
258 
275 
The lowest club head speed is 99 and the highest is 105, and the xaxis shown extends from 99 to 105. The shortest distance is 257 and the largest is 277, but since multiples of 5 were used to mark the scale, the vertical scale extends a little below 257 to 255, and a little above 277 to 280.
Each row of Table 1 on page 178 will be represented by a point on the scatter diagram. To plot the point represented by the first row in the table, you find 100 on the xaxis and then move up to a height representing 257 on the yaxis. Since 257 is between 255 and 260, but closer to 255, that is where the point should be. In a similar fashion, points would be plotted for the other rows in Table 1. Once a point has been plotted for each row and a title has be added to the chart, the scatter diagram is complete.
B. Using
the TI 83/84 Calculator to Find Equations of Regression Lines and Coefficients
of Determination and Correlation
To find coefficients of determination and correlation, you must first make a change in the settings on your calculator. Press "2ND" and "0" (zero). This will bring up a list of all the procedures in the calculator in alphabetical order. Use the downarrow key to put the triangle cursor next to "DiagnosticOn". Make sure you choose "DiagnosticOn" and NOT "DiagnosticOff". Then press "Enter" twice. On your calculator screen, you should see:
DiagnosticOn
Done
You should only need to do this procedure again if (1) your calculator
is reset to the manufacturer’s default settings, or (2) you start using a
different calculator.
To
find regression lines and coefficients of determination and correlation, you
need to be working with two variables with each value of the first variable
paired with one value of the second. For
this demonstration, we will use the data from Table 1 again:
ClubHead Speed (mph) 
100 
102 
103 
101 
105 
100 
99 
105 
Distance (yards) 
257 
264 
274 
266 
277 
263 
258 
275 
Press "STAT" and "ENTER". Enter the numbers for the first variable under "L1". Enter the numbers for the second variable under "L2". When you are finished, the data entry screen should look like the following:
L1 
L2 
103 
274 
101 
266 
105 
277 
100 
263 
99 
258 
105 
275 
 
 
The first two rows appear to be missing, but if you press the uparrow a few times, both rows will reappear.
The numbers in the first column are paired with the same numbers in the second column that they were in the original table of data (except numbers in the first row were paired with numbers in the second row in the original table). If the pairing is changed, your results will most likely be wrong.
Now press "STAT". Move the cursor to "CALC" and select "LinReg(ax+b)" using the downarrow button. Pressing "ENTER" twice will bring up the following display:
LinReg 


y = ax + b 

a = 3.166101695 

b = 55.79661017 

r^{2} = .8811498758 

r = .9386958377 
If your data is in some other lists besides L1 and L2, you could choose those lists by pressing "ENTER" just one time after selecting "LinReg(ax+b)", typing in the correct lists, and pressing"ENTER" again. If your data is in L4 and L5, for example, you would press "2ND", "4", "," (comma), "2ND", "5" and then "ENTER" to compute the regression line for those lists.
For this example, the equation of the regression line is y = 3.166x55.797. The coefficient of determination, 0.881, says that about 88.1% of the variation in the data is determined by the regression line. The correlation coefficient, 0.939, indicates a strong positive correlation. See Coefficients of Determination and Correlation below to find out how to interpret the coefficients of determination and correlation.
To plot the regression line on the scatter diagram, you need to find two points on the regression line. Since you have the equation of the regression line, all you need are some xvalues to plug into this equation. The minimum and maximum xvalues are good to use, but you could use any numbers that are close to these xvalues. In the above example, 99 is the minimum xvalue and 105 is the maximum. Plugging these into the equation gives:
y = 3.166 x 99  55.797 = 257.6, 
or x = 99 and y = 257.6 for the first point, and 
y = 3.166 x 105  55.797 = 276.6, 
or x =105 and y = 276.6 for the second point. 
Plotting these two points on the scatter diagram and drawing a line through them gives a graph of the regression line. When the regression line is plotted correctly, about half of the data points will be above the line and the other half will be below the line. If your line is below or above much more than half of the data points, then you have done something wrong. Usually this indicates that you recorded a wrong number somewhere or you have switched the xvalues and yvalues at some step in the process.
C. Good and Bad Scatter Diagrams
Figure 1: GOOD Scatter
Diagram
This is a GOOD scatter diagram. It has a title and both axes are labeled. Both scales extend only as far as the data values and not much farther. Notice that the regression line goes through the middle of the points. Three points are above the regression line and three points are below it, while two points just touch the regression line.
Figure 2: BAD Scatter
Diagram
This is a BAD scatter diagram. Notice that there are no data values to the left of 90 on the horizontal axis, and yet the horizontal scale goes all the way down to zero. As a result, most of the left side of the chart is empty and all the data values are squeezed against the right side of the chart.
Figure 3: BAD Scatter Diagram
This is a BAD scatter diagram. Notice that there are no data values below 250 on the vertical axis, and yet the vertical scale goes all the way down to zero. As a result, most of the bottom part of the chart is empty and all the data values are pushed up against the top of the graph.
Figure 4: BAD Scatter
Diagram
This is a BAD scatter diagram. While both scales are restricted, they still go a lot farther than they need to. As a result, the data is forced into a very small area of the chart and there is a lot of blank space around it. There is no need for the horizontal axis to go below 95 or above 110. There is no need for the vertical axis to go below 250 or above 280. The smaller the range of data on each axis is, the more the chart becomes focused on the data. See how the data points take up most of the graph in Figure 1 above.
Figure 5: BAD Scatter
Diagram
This is a BAD scatter diagram. The vertical axis has been extended to show the line, but since the line is nowhere near the data, this is not the regression line. Usually when this happens, it means the x and y variables have been switched somewhere in the process of finding the regression line. Always set up the scatter diagram first. Then if the regression line is nowhere near the data, that means you made a mistake in computing the regression line. One thing to try if this happens is switching x and y values.
D. Coefficients of Determination and Correlation
The coefficient of determination, r^{ 2}, tells what percent of the variation in data values is explained by the regression line. If this percent is less than 100%, then the difference between 100% and the coefficient of determination tells what percent of the variation is determined by something other than the regression line.
Examples
a) If r^{ 2} = 0.82, then 82% of the variation is determined by the regression line, and 18% of the variation is determined by some other factor or factors.
b) If r ^{2} = 0.47, then 47% of the variation is determined by the regression line, and 53% of the variation is determined by some other factor or factors.
The correlation coefficient, r, tells how closely the scatter diagram points are to being on a line. If the correlation coefficient is positive, the line slopes upward. If the correlation coefficient is negative, the line slopes downward. All values of the correlation coefficient are between 1 and 1, inclusive.
The correlation scale* below provides a way to categorize the values of correlation coefficients.
According to this scale, a correlation coefficient of 0.2 would indicate a weak positive correlation, while a coefficient of 0.9 would indicate a strong negative correlation. A correlation coefficient of 1.0 indicates a perfect positive correlation.
* This scale has been revised and expanded from the correlation scale presented in Jay Devore and Nicholas Farnum, Applied Statistics for Engineers and Scientists, 2^{nd} edition, Brooks/Cole 2005, p. 109.
Exercise: Interpret the values of r ^{2} and r given below.
1) r^{ 2} = 0.452 
2) r ^{2} = 0.913 
3) r ^{2} = 0.721 
4) r ^{2} = 0.264 
5) r = 0.431 
6) r = 0.083 
7) r = 0.972 
8) r = 1.0 
9) r = 0.681 
10) r = 0.753 
11) r = 0.047 
12) r = 0.994 
The following scatter diagrams are provided to give you some idea of how correlation coefficients and coefficients of determination relate to how points are clustered around a regression line.
The regression line has the equation: Distance = 3.17 x Speed  55.80. The correlation coefficient is 0.939, which signifies a strong positive correlation. The coefficient of determination is 0.881, indicating that 88.1% of the variation in the data is determined by the regression line.
The regression line has the equation: Life Exp. = 0.0261 x Gestation + 7.87. The correlation coefficient is 0.726, which signifies a moderate positive correlation. The coefficient of determination is 0.527, indicating that 52.7% of the variation in the data is determined by the regression line.
The regression line has the equation: IQ = 0.0172 x (MRI Count) + 119.22. The correlation coefficient is 0.357, which signifies a weak positive correlation. The coefficient of determination is 0.128, indicating that 12.8% of the variation in the data is determined by the regression line.
The regression line has the equation: TECO = 0.0170 x GE + 0.0427 . The correlation coefficient is 0.017, which shows no correlation between the annual rates of return for the two stocks. The coefficient of determination is 0.0003, indicating that almost none of the variation in the data is determined by the regression line.
The regression line has the equation: TECO = 0.112 x Cisco + 0.0888 . The correlation coefficient is 0.235, which signifies a weak negative correlation. The coefficient of determination is 0.055, indicating that 5.5% of the variation in the data is determined by the regression line.
The regression line has the equation: Percentage = 0.111 x ERA + 0.977. The correlation coefficient is 0.660, which signifies a moderate negative correlation. The coefficient of determination is 0.436, indicating that 43.6% of the variation in the data is determined by the regression line.
The regression line has the equation: MPG = 0.00617 x Weight + 41.46 . The correlation coefficient is 0.892, which signifies a strong negative correlation. The coefficient of determination is 0.796, indicating that 79.6% of the variation in the data is determined by the regression line.
E. Outliers and Influential Observations in a Scatter Diagram
If there is a regression line on a scatter diagram, you can identify outliers. An outlier for a scatter diagram is the point or points that are farthest from the regression line. Distance from a point to the regression line is the length of the line segment that is perpendicular to the regression line and extends from the point to the regression line. (See the figure below.) Note that outliers for a scatter plot are very different from outliers for a boxplot.
There is usually at least one outlier and usually only one outlier on a scatter diagram. If one point of a scatter diagram is farther from the regression line than some other point, then the scatter diagram has at least one outlier. If two or more points are the same farthest distance from the regression line (not a common occurrence), then each of these points is an outlier. If all points of the scatter diagram are the same distance from the regression line (which very rarely happens), then there is no outlier. (See the GeoGebra applet Scatter Diagram Outliers.)
An influential observation (inf. obs.) is a point on a scatter diagram that has a large horizontal gap containing no points between it and a vast majority of the other points. As shown in the graph below, there can be more than one influential observation. If there is no large horizontal gap between data points in a scatter diagram, there are no influential observations. In many cases, a scatter diagram will have no influential observations; but influential observations should be identified if they occur.
When an influentialobservation is moved up or down and the
regression line is recomputed, the newline will be much closer to the new
location of the influential observation. If a noninfluential observation is
relocated, the recomputed regression line will be in almost thesame position as
the original regression line. Thus the influential
observation "influences" the location of the regression line. (See the
Influential Observations GeoGebra applet)
Exercises Identify influential observations and outliers in the scatter diagrams shown below.