Return to ESA 7

 

Outliers and Influential Observations on a Scatter Plot

 

    If there is a regression line on a scatter plot, you can identify outliers. An outlier for a scatter plot is the point or points that are farthest from the regression line. There is at least one outlier on a scatter plot in most cases, and there is usually only one outlier. Note that outliers for a scatter plot are very different from outliers for a boxplot. 

    Distance from a point to the regression line is the length of the line segment that is perpendicular to the regression line and extends from the point to the regression line.   If one point of a scatter plot is farther from the regression line than some other point, then the scatter plot has at least one outlier.  If a number of points are the same farthest distance from the regression line, then all these points are outliers. If all points of the scatter plot are the same distance from the regression line, then there is no outlier.  

 

 

    An influential observation (inf. obs.) is a point on a scatter plot that has a large horizontal gap containing no points between it and a vast majority of the other points. As shown in the graph below, there can be more than one influential observation.  If there is no large horizontal gap between data points in a scatter plot, there are no influential observations. In many cases, a scatter plot will have no influential observations; but influential observations should be identified if they occur. 

 

 

 

When an influential observation is moved up or down and the regression line is recomputed, the new line will be much closer to the new location of the influential observation.  If a non-influential observation is relocated, the recomputed regression line will be in almost the same position as the original regression line. Thus the influential observation has "influence" on the location of the regression line.

 

Exercises Identify influential observations and outliers in the scatter plots shown below.