1 Anscombe’s Quartet


Sometimes data can be tricky. Real-world data does not always conform to the nice, ordered examples of data in the textbooks. Here a special data set, ‘Anscombe’s quartet,’ serves as a powerful teaching tool. It illustrates the difficulties inherent with some data and the limitations of straightforward, descriptive statistics. Here we explore Anscombe’s quartet using both R and Python.

2 Francis Anscombe


Francis J. Anscombe

Francis Anscombe was a renowned British statistician. Born in 1918, he studied mathematics at Cambridge University and later studied statistical aspects of problems in agriculture, industry, and physics. He later joined Yale University where he played a seminal role in establishing the statistics department. Anscombe was a major proponent of understanding the proper conditions necessary for a statistical model to be applicable to a data set. Anscombe passed away in 2001. Notably, he was brother-in-law to John Tukey, another well-known statistician.

3 The Data Set


To illustrate potential difficulties, Anscombe introduced a remarkable data set, now called Anscombe’s quartet, in a celebrated 1973 paper. This data set contains four pairs of \(x\) (predictor) and \(y\) (response) variables. Quite curiously, all four data sets have virtually identical mean values and variances among the \(x\)’s and \(y\)’s. Furthermore, across all four datasets, the \(x\)’s and \(y\)’s have virtually identical correlation. A linear regression between the \(x\)’s and \(y\)’s produces a virtually identical line for each data set and returns a virtually identical \(R^2\). This may lead the observer to conclude that all four data sets are equally suitable for regression.


We obtain these basic statistics and regression on each data set below using both R and Python. Anscombe’s quartet is available on both R and Python. In R the data set is available with the base distribution; in Python the data set is available via the seaborn package.

Descriptive Statistics in R

data("anscombe") # load dataset, available in base

writeLines("STATS IN R\n")

writeLines("\nMean of Xs\n")
colMeans(anscombe[,1:4]) # means of the x's

writeLines("\nMean of Ys\n")
dis = as.numeric(format(colMeans(anscombe[,5:8]), digits = 2))  # means of the y's
names(dis) = names(anscombe[,5:8])
dis

writeLines("\nVariances of Xs\n")
dis = as.numeric(format(sapply(anscombe[,1:4],var), digits = 2)) # variance of the x's
names(dis) = names(anscombe[,1:4])
dis

writeLines("\nVariances of Ys\n")
dis = as.numeric(format(sapply(anscombe[,5:8],var), digits = 2)) # variance of the x's
names(dis) = names(anscombe[,5:8])
dis

writeLines("\nCorrelation of X1 and Y1\n")
dis = as.numeric(format(cor(anscombe[,c(1,5)]), digits = 2)) # correlation of data set 1
dis = matrix(dis, nrow = 2)
colnames(dis) = names(anscombe[,c(1,5)])
rownames(dis) = names(anscombe[,c(1,5)])
dis

writeLines("\nCorrelation of X2 and Y2\n")
dis = as.numeric(format(cor(anscombe[,c(2,6)]), digits = 2)) # correlation of data set 2
dis = matrix(dis, nrow = 2)
colnames(dis) = names(anscombe[,c(2,6)])
rownames(dis) = names(anscombe[,c(2,6)])
dis

writeLines("\nCorrelation of X3 and Y3\n")
dis = as.numeric(format(cor(anscombe[,c(3,7)]), digits = 2)) # correlation of data set 3
dis = matrix(dis, nrow = 2)
colnames(dis) = names(anscombe[,c(3,7)])
rownames(dis) = names(anscombe[,c(3,7)])
dis

writeLines("\nCorrelation of X4 and Y4\n")
dis = as.numeric(format(cor(anscombe[,c(4,8)]), digits = 2)) # correlation of data set 4
dis = matrix(dis, nrow = 2)
colnames(dis) = names(anscombe[,c(4,8)])
rownames(dis) = names(anscombe[,c(4,8)])
dis
## STATS IN R
## 
## 
## Mean of Xs
## 
## x1 x2 x3 x4 
##  9  9  9  9 
## 
## Mean of Ys
## 
##  y1  y2  y3  y4 
## 7.5 7.5 7.5 7.5 
## 
## Variances of Xs
## 
## x1 x2 x3 x4 
## 11 11 11 11 
## 
## Variances of Ys
## 
##  y1  y2  y3  y4 
## 4.1 4.1 4.1 4.1 
## 
## Correlation of X1 and Y1
## 
##      x1   y1
## x1 1.00 0.82
## y1 0.82 1.00
## 
## Correlation of X2 and Y2
## 
##      x2   y2
## x2 1.00 0.82
## y2 0.82 1.00
## 
## Correlation of X3 and Y3
## 
##      x3   y3
## x3 1.00 0.82
## y3 0.82 1.00
## 
## Correlation of X4 and Y4
## 
##      x4   y4
## x4 1.00 0.82
## y4 0.82 1.00


Descriptive Statistics in Python

import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import statsmodels.formula.api as smf

anscombe = sb.load_dataset(name='anscombe')

print("STATS IN PYTHON\n")

print("\nMeans of Xs and Ys\n")
with pd.option_context('display.float_format', '{:,.2f}'.format):
  anscombe.groupby('dataset').mean()

print("\nVariances of Xs and Ys\n")
with pd.option_context('display.float_format', '{:,.2f}'.format):
  anscombe.groupby('dataset').var()

print("\nCorrelations of Xs and Ys\n")
with pd.option_context('display.float_format', '{:,.2f}'.format):
  anscombe.groupby('dataset').corr()
## STATS IN PYTHON
## 
## 
## Means of Xs and Ys
## 
##            x    y
## dataset          
## I       9.00 7.50
## II      9.00 7.50
## III     9.00 7.50
## IV      9.00 7.50
## 
## Variances of Xs and Ys
## 
##             x    y
## dataset           
## I       11.00 4.13
## II      11.00 4.13
## III     11.00 4.12
## IV      11.00 4.12
## 
## Correlations of Xs and Ys
## 
##              x    y
## dataset            
## I       x 1.00 0.82
##         y 0.82 1.00
## II      x 1.00 0.82
##         y 0.82 1.00
## III     x 1.00 0.82
##         y 0.82 1.00
## IV      x 1.00 0.82
##         y 0.82 1.00


Linear Regression in R

reg1 = lm(y1~x1,data=anscombe)
reg2 = lm(y2~x2,data=anscombe)
reg3 = lm(y3~x3,data=anscombe)
reg4 = lm(y4~x4,data=anscombe)

writeLines("REGRESSION MODELS IN R\n")

writeLines("\nData Set 1 Regression\n")
summary(reg1)

writeLines("\nData Set 2 Regression\n")
summary(reg2)

writeLines("\nData Set 3 Regression\n")
summary(reg3)

writeLines("\nData Set 4 Regression\n")
summary(reg4)
## REGRESSION MODELS IN R
## 
## 
## Data Set 1 Regression
## 
## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
## 
## 
## Data Set 2 Regression
## 
## 
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
## 
## 
## Data Set 3 Regression
## 
## 
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
## 
## 
## Data Set 4 Regression
## 
## 
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Linear Regression in Python (sklearn)

# intercept defaults to True
X1 = anscombe.loc[anscombe.dataset=='I',['x']]
X2 = anscombe.loc[anscombe.dataset=='II',['x']]
X3 = anscombe.loc[anscombe.dataset=='III',['x']]
X4 = anscombe.loc[anscombe.dataset=='IV',['x']]

Y1 = anscombe.loc[anscombe.dataset=='I',['y']]
Y2 = anscombe.loc[anscombe.dataset=='II',['y']]
Y3 = anscombe.loc[anscombe.dataset=='III',['y']]
Y4 = anscombe.loc[anscombe.dataset=='IV',['y']]

reg1 = LinearRegression().fit(X1,Y1)
reg2 = LinearRegression().fit(X2,Y2)
reg3 = LinearRegression().fit(X3,Y3)
reg4 = LinearRegression().fit(X4,Y4)

print("REGRESSION MODELS IN SKLEARN\n")

print("\nData Set I Regression\n")
print(
  "slope:",
  '%.2f' % reg1.coef_[0][0],
  "intercept:",
  '%.2f' % reg1.intercept_[0],
  "R^2:",
  '%.2f ' % reg1.score(X1,Y1),
  sep = " "
)

print("\nData Set II Regression\n")
print(
  "slope:",
  '%.2f' % reg2.coef_[0][0],
  "intercept:",
  '%.2f' % reg2.intercept_[0],
  "R^2:",
  '%.2f ' % reg2.score(X2,Y2),
  sep = " "
)

print("\nData Set III Regression\n")
print(
  "slope:",
  '%.2f' % reg3.coef_[0][0],
  "intercept:",
  '%.2f' % reg3.intercept_[0],
  "R^2:",
  '%.2f ' % reg3.score(X3,Y3),
  sep = " "
)

print("\nData Set IV Regression\n")
print(
  "slope:",
  '%.2f' % reg4.coef_[0][0],
  "intercept:",
  '%.2f' % reg4.intercept_[0],
  "R^2:",
  '%.2f ' % reg4.score(X4,Y4),
  sep = " "
)
## REGRESSION MODELS IN SKLEARN
## 
## 
## Data Set I Regression
## 
## slope: 0.50 intercept: 3.00 R^2: 0.67 
## 
## Data Set II Regression
## 
## slope: 0.50 intercept: 3.00 R^2: 0.67 
## 
## Data Set III Regression
## 
## slope: 0.50 intercept: 3.00 R^2: 0.67 
## 
## Data Set IV Regression
## 
## slope: 0.50 intercept: 3.00 R^2: 0.67


Linear Regression in Python (statsmodels)

x1 = sm.add_constant(X1)
x2 = sm.add_constant(X2)
x3 = sm.add_constant(X3)
x4 = sm.add_constant(X4)

M1 = sm.OLS(Y1,x1)
M2 = sm.OLS(Y2,x2)
M3 = sm.OLS(Y3,x3)
M4 = sm.OLS(Y4,x4)

R1 = M1.fit()
R2 = M2.fit()
R3 = M3.fit()
R4 = M4.fit()

print("REGRESSION MODELS IN STATSMODELS\n")

print("\nData Set I Regression\n")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    print(R1.summary())

print("\nData Set II Regression\n")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    print(R2.summary())

print("\nData Set III Regression\n")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    print(R3.summary())


print("\nData Set IV Regression\n")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    print(R4.summary())
## REGRESSION MODELS IN STATSMODELS
## 
## 
## Data Set I Regression
## 
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      y   R-squared:                       0.667
## Model:                            OLS   Adj. R-squared:                  0.629
## Method:                 Least Squares   F-statistic:                     17.99
## Date:                Fri, 26 May 2023   Prob (F-statistic):            0.00217
## Time:                        07:43:57   Log-Likelihood:                -16.841
## No. Observations:                  11   AIC:                             37.68
## Df Residuals:                       9   BIC:                             38.48
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          3.0001      1.125      2.667      0.026       0.456       5.544
## x              0.5001      0.118      4.241      0.002       0.233       0.767
## ==============================================================================
## Omnibus:                        0.082   Durbin-Watson:                   3.212
## Prob(Omnibus):                  0.960   Jarque-Bera (JB):                0.289
## Skew:                          -0.122   Prob(JB):                        0.865
## Kurtosis:                       2.244   Cond. No.                         29.1
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## 
## Data Set II Regression
## 
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      y   R-squared:                       0.666
## Model:                            OLS   Adj. R-squared:                  0.629
## Method:                 Least Squares   F-statistic:                     17.97
## Date:                Fri, 26 May 2023   Prob (F-statistic):            0.00218
## Time:                        07:43:57   Log-Likelihood:                -16.846
## No. Observations:                  11   AIC:                             37.69
## Df Residuals:                       9   BIC:                             38.49
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          3.0009      1.125      2.667      0.026       0.455       5.547
## x              0.5000      0.118      4.239      0.002       0.233       0.767
## ==============================================================================
## Omnibus:                        1.594   Durbin-Watson:                   2.188
## Prob(Omnibus):                  0.451   Jarque-Bera (JB):                1.108
## Skew:                          -0.567   Prob(JB):                        0.575
## Kurtosis:                       1.936   Cond. No.                         29.1
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## 
## Data Set III Regression
## 
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      y   R-squared:                       0.666
## Model:                            OLS   Adj. R-squared:                  0.629
## Method:                 Least Squares   F-statistic:                     17.97
## Date:                Fri, 26 May 2023   Prob (F-statistic):            0.00218
## Time:                        07:43:57   Log-Likelihood:                -16.838
## No. Observations:                  11   AIC:                             37.68
## Df Residuals:                       9   BIC:                             38.47
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          3.0025      1.124      2.670      0.026       0.459       5.546
## x              0.4997      0.118      4.239      0.002       0.233       0.766
## ==============================================================================
## Omnibus:                       19.540   Durbin-Watson:                   2.144
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               13.478
## Skew:                           2.041   Prob(JB):                      0.00118
## Kurtosis:                       6.571   Cond. No.                         29.1
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## 
## Data Set IV Regression
## 
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      y   R-squared:                       0.667
## Model:                            OLS   Adj. R-squared:                  0.630
## Method:                 Least Squares   F-statistic:                     18.00
## Date:                Fri, 26 May 2023   Prob (F-statistic):            0.00216
## Time:                        07:43:57   Log-Likelihood:                -16.833
## No. Observations:                  11   AIC:                             37.67
## Df Residuals:                       9   BIC:                             38.46
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          3.0017      1.124      2.671      0.026       0.459       5.544
## x              0.4999      0.118      4.243      0.002       0.233       0.766
## ==============================================================================
## Omnibus:                        0.555   Durbin-Watson:                   1.662
## Prob(Omnibus):                  0.758   Jarque-Bera (JB):                0.524
## Skew:                           0.010   Prob(JB):                        0.769
## Kurtosis:                       1.931   Cond. No.                         29.1
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


With all these stats lining up, one might think that these four data sets are practically identical, or at least interchangeable. Yet only one is actually appropriate for linear regression. This fact is invisible when examining the descriptive statistics and basic linear regression fit. Yet, a simple examination of side-by-side graphs immediately reveals this fact. Only the first data set has a scatter pattern appropriate for linear regression. The second data set exhibits a quadratic relationship. The third contains an outlier, which when removed, yields a perfect linear relationship. The fourth data set is inappropriate for linear regression altogether.

Regression Plots in R

par(
  mfrow=c(2,2),
  bg = "#71797E"
)

plot(0, 0, type="n", ann=FALSE, axes=FALSE)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
par(new=TRUE)

plot(
  anscombe$x1, anscombe$y1,
  main = "Data Set 1",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green"
)
abline(reg1, lwd = 1.5)

plot(
  anscombe$x2, anscombe$y2,
  main = "Data Set 2",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green"
)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)

points(
  anscombe$x2, anscombe$y2,
  main = "Data Set 2",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green",
  add = TRUE
)
abline(reg2, lwd = 1.5)

plot(
  anscombe$x3, anscombe$y3,
  main = "Data Set 3",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green"
)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)

points(
  anscombe$x3, anscombe$y3,
  main = "Data Set 3",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green",
  add = TRUE
)
abline(reg3, lwd = 1.5)

plot(
  anscombe$x4, anscombe$y4,
  main = "Data Set 4",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green"
)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)

points(
  anscombe$x4, anscombe$y4,
  main = "Data Set 4",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "darkgreen",
  bg = "green",
  add = TRUE
)
abline(reg4, lwd = 1.5)

mtext("Anscombe's Quartet",                   # Add main title
      side = 3,
      line = - 2,
      outer = TRUE)


Regression Plots in Python

Xp = np.array([anscombe.x.min(),anscombe.x.max()]).reshape((2,1))

fig, axs = plt.subplots(2,2)
fig.set_facecolor('#71797E')
fig.tight_layout(h_pad=2)
fig.subplots_adjust(top=0.9)
fig.suptitle('Anscombe\'s Quartet')

axs[0,0].scatter(X1,Y1,color='k')
axs[0,0].plot(Xp,reg1.predict(Xp),color='g')
axs[0,0].set_title('Dataset I')
axs[0,0].set_facecolor('#C0C0C0')

axs[0,1].scatter(X2,Y2,color='k')
axs[0,1].plot(Xp,reg2.predict(Xp),color='g')
axs[0,1].set_title('Dataset II')
axs[0,1].set_facecolor('#C0C0C0')

axs[1,0].scatter(X3,Y3,color='k')
axs[1,0].plot(Xp,reg3.predict(Xp),color='g')
axs[1,0].set_title('Dataset III')
axs[1,0].set_facecolor('#C0C0C0')

axs[1,1].scatter(X4,Y4,color='k')
axs[1,1].plot(Xp,reg4.predict(Xp),color='g')
axs[1,1].set_title('Dataset IV')
axs[1,1].set_facecolor('#C0C0C0')

plt.show()


Another way of seeing this is to plot the residuals from the regression fits. Only the first plot for the first data set shows the random scatter appropriate for linear regression data. The other data sets all exhibit patterns that indicate a linear regression is ill-suited.

Residual Plots in R

par(
  mfrow=c(2,2),
  bg = "#71797E"
)

plot(0, 0, type="n", ann=FALSE, axes=FALSE)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
par(new=TRUE)


plot(
  anscombe$x1,resid(reg1),
  main = "Data Set 1",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red"
)
abline(h = 0, lwd = 1.5)

plot(
  anscombe$x2,resid(reg2),
  main = "Data Set 2",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red"
)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)

points(
  anscombe$x2,resid(reg2),
  main = "Data Set 2",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red",
  add = TRUE
)
abline(h = 0, lwd = 1.5)

plot(
  anscombe$x3,resid(reg3),
  main = "Data Set 3",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red"
)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)

points(
  anscombe$x3,resid(reg3),
  main = "Data Set 3",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red",
  add = TRUE
)
abline(h = 0, lwd = 1.5)

plot(
  anscombe$x4,resid(reg4),
  main = "Data Set 4",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red"
)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)

points(
  anscombe$x4,resid(reg4),
  main = "Data Set 4",
  xlab = "X",
  ylab = "Residual",
  pch = 21,
  col = "darkred",
  bg = "red",
  add = TRUE
)
abline(h = 0, lwd = 1.5)

mtext("Residual Plots",                   # Add main title
      side = 3,
      line = - 2,
      outer = TRUE)


In particular, taking the second data set and fitting a polynomial regression of order two, that is a quadratic curve, we return a perfect relationship between the \(x\)’s and \(y\)’s. This indicates that Anscombe likely used a quadratic equation to create this data set.

Quadratic Regression and Plot in R

reg2.1 = lm(y2 ~ x2 + I(x2^2), data = anscombe)

xnew = seq(min(anscombe$x2), max(anscombe$x2), by = .1)
xnew = data.frame(x2 = xnew)

par(bg = "#71797E")

plot(0, 0, type="n", ann=FALSE, axes=FALSE)

rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
par(new=TRUE)

plot(
  anscombe$x2, anscombe$y2,
  main = "Data set 2 Quadratic Regression",
  xlab = "X",
  ylab = "Y",
  pch = 21,
  col = "purple",
  bg = 'violet'
)

lines(xnew$x2,predict.lm(reg2.1,xnew), lwd = 1.5)


Quadratic Regression (statsmodels.formula) and Plots in Python

quad = pd.concat([X2,Y2],axis=1)

form1 = 'y ~ x'
form2 = 'y ~ x + I(x**2)'

mod1 = smf.ols(formula=form1, data=quad)
mod2 = smf.ols(formula=form2, data=quad)

res1 = mod1.fit()
res2 = mod2.fit()

fig, axs = plt.subplots(1,2)

fig.set_facecolor('#71797E')

fig.suptitle('Dataset II Linear vs Quadratic')

axs[0].scatter(X2.x,Y2,color='k')
axs[0].plot(X2.x,res1.predict(X2.x),color='g')
axs[0].set_title('Dataset II Linear')
axs[0].set_facecolor('#C0C0C0')

axs[1].scatter(X2.x,Y2,color='k')
axs[1].plot(X2.x.sort_values(),res2.predict(X2.x.sort_values()),color='g')
axs[1].set_title('Dataset II Quadratic')
axs[1].set_facecolor('#C0C0C0')

plt.show()


Data can be tricky and real-world data can sometimes exhibit unexpected patterns. Anscombe’s quartet demonstrates the importance of leaving no stone unturned when analyzing data. In particular, it emphasizes the importance of graphing data and visually inspecting it.