Sometimes data can be tricky. Real-world data does not always
conform to the nice, ordered examples of data in the textbooks. Here a
special data set, ‘Anscombe’s quartet,’ serves as a powerful teaching
tool. It illustrates the difficulties inherent with some data and the
limitations of straightforward, descriptive statistics. Here we explore
Anscombe’s quartet using both R and Python.
Francis J. Anscombe
Francis Anscombe was a renowned British statistician. Born in 1918,
he studied mathematics at Cambridge University and later studied
statistical aspects of problems in agriculture, industry, and physics.
He later joined Yale University where he played a seminal role in
establishing the statistics department. Anscombe was a major proponent
of understanding the proper conditions necessary for a statistical model
to be applicable to a data set. Anscombe passed away in 2001. Notably,
he was brother-in-law to John Tukey, another well-known statistician.
To illustrate potential difficulties, Anscombe introduced a
remarkable data set, now called Anscombe’s quartet, in a celebrated 1973
paper. This data set contains four pairs of \(x\) (predictor) and \(y\) (response) variables. Quite curiously,
all four data sets have virtually identical mean values and variances
among the \(x\)’s and \(y\)’s. Furthermore, across all four
datasets, the \(x\)’s and \(y\)’s have virtually identical correlation.
A linear regression between the \(x\)’s
and \(y\)’s produces a virtually
identical line for each data set and returns a virtually identical \(R^2\). This may lead the observer to
conclude that all four data sets are equally suitable for
regression.
We obtain these basic statistics and regression on each data set
below using both R and Python. Anscombe’s quartet is available on both R
and Python. In R the data set is available with the base distribution;
in Python the data set is available via the seaborn
package.
Descriptive Statistics in R
data("anscombe") # load dataset, available in base
writeLines("STATS IN R\n")
writeLines("\nMean of Xs\n")
colMeans(anscombe[,1:4]) # means of the x's
writeLines("\nMean of Ys\n")
= as.numeric(format(colMeans(anscombe[,5:8]), digits = 2)) # means of the y's
dis names(dis) = names(anscombe[,5:8])
dis
writeLines("\nVariances of Xs\n")
= as.numeric(format(sapply(anscombe[,1:4],var), digits = 2)) # variance of the x's
dis names(dis) = names(anscombe[,1:4])
dis
writeLines("\nVariances of Ys\n")
= as.numeric(format(sapply(anscombe[,5:8],var), digits = 2)) # variance of the x's
dis names(dis) = names(anscombe[,5:8])
dis
writeLines("\nCorrelation of X1 and Y1\n")
= as.numeric(format(cor(anscombe[,c(1,5)]), digits = 2)) # correlation of data set 1
dis = matrix(dis, nrow = 2)
dis colnames(dis) = names(anscombe[,c(1,5)])
rownames(dis) = names(anscombe[,c(1,5)])
dis
writeLines("\nCorrelation of X2 and Y2\n")
= as.numeric(format(cor(anscombe[,c(2,6)]), digits = 2)) # correlation of data set 2
dis = matrix(dis, nrow = 2)
dis colnames(dis) = names(anscombe[,c(2,6)])
rownames(dis) = names(anscombe[,c(2,6)])
dis
writeLines("\nCorrelation of X3 and Y3\n")
= as.numeric(format(cor(anscombe[,c(3,7)]), digits = 2)) # correlation of data set 3
dis = matrix(dis, nrow = 2)
dis colnames(dis) = names(anscombe[,c(3,7)])
rownames(dis) = names(anscombe[,c(3,7)])
dis
writeLines("\nCorrelation of X4 and Y4\n")
= as.numeric(format(cor(anscombe[,c(4,8)]), digits = 2)) # correlation of data set 4
dis = matrix(dis, nrow = 2)
dis colnames(dis) = names(anscombe[,c(4,8)])
rownames(dis) = names(anscombe[,c(4,8)])
dis
## STATS IN R
##
##
## Mean of Xs
##
## x1 x2 x3 x4
## 9 9 9 9
##
## Mean of Ys
##
## y1 y2 y3 y4
## 7.5 7.5 7.5 7.5
##
## Variances of Xs
##
## x1 x2 x3 x4
## 11 11 11 11
##
## Variances of Ys
##
## y1 y2 y3 y4
## 4.1 4.1 4.1 4.1
##
## Correlation of X1 and Y1
##
## x1 y1
## x1 1.00 0.82
## y1 0.82 1.00
##
## Correlation of X2 and Y2
##
## x2 y2
## x2 1.00 0.82
## y2 0.82 1.00
##
## Correlation of X3 and Y3
##
## x3 y3
## x3 1.00 0.82
## y3 0.82 1.00
##
## Correlation of X4 and Y4
##
## x4 y4
## x4 1.00 0.82
## y4 0.82 1.00
Descriptive Statistics in Python
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import statsmodels.formula.api as smf
= sb.load_dataset(name='anscombe')
anscombe
print("STATS IN PYTHON\n")
print("\nMeans of Xs and Ys\n")
with pd.option_context('display.float_format', '{:,.2f}'.format):
'dataset').mean()
anscombe.groupby(
print("\nVariances of Xs and Ys\n")
with pd.option_context('display.float_format', '{:,.2f}'.format):
'dataset').var()
anscombe.groupby(
print("\nCorrelations of Xs and Ys\n")
with pd.option_context('display.float_format', '{:,.2f}'.format):
'dataset').corr() anscombe.groupby(
## STATS IN PYTHON
##
##
## Means of Xs and Ys
##
## x y
## dataset
## I 9.00 7.50
## II 9.00 7.50
## III 9.00 7.50
## IV 9.00 7.50
##
## Variances of Xs and Ys
##
## x y
## dataset
## I 11.00 4.13
## II 11.00 4.13
## III 11.00 4.12
## IV 11.00 4.12
##
## Correlations of Xs and Ys
##
## x y
## dataset
## I x 1.00 0.82
## y 0.82 1.00
## II x 1.00 0.82
## y 0.82 1.00
## III x 1.00 0.82
## y 0.82 1.00
## IV x 1.00 0.82
## y 0.82 1.00
Linear Regression in R
= lm(y1~x1,data=anscombe)
reg1 = lm(y2~x2,data=anscombe)
reg2 = lm(y3~x3,data=anscombe)
reg3 = lm(y4~x4,data=anscombe)
reg4
writeLines("REGRESSION MODELS IN R\n")
writeLines("\nData Set 1 Regression\n")
summary(reg1)
writeLines("\nData Set 2 Regression\n")
summary(reg2)
writeLines("\nData Set 3 Regression\n")
summary(reg3)
writeLines("\nData Set 4 Regression\n")
summary(reg4)
## REGRESSION MODELS IN R
##
##
## Data Set 1 Regression
##
##
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
##
##
## Data Set 2 Regression
##
##
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
##
##
## Data Set 3 Regression
##
##
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
##
##
## Data Set 4 Regression
##
##
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
Linear Regression in Python (sklearn)
# intercept defaults to True
= anscombe.loc[anscombe.dataset=='I',['x']]
X1 = anscombe.loc[anscombe.dataset=='II',['x']]
X2 = anscombe.loc[anscombe.dataset=='III',['x']]
X3 = anscombe.loc[anscombe.dataset=='IV',['x']]
X4
= anscombe.loc[anscombe.dataset=='I',['y']]
Y1 = anscombe.loc[anscombe.dataset=='II',['y']]
Y2 = anscombe.loc[anscombe.dataset=='III',['y']]
Y3 = anscombe.loc[anscombe.dataset=='IV',['y']]
Y4
= LinearRegression().fit(X1,Y1)
reg1 = LinearRegression().fit(X2,Y2)
reg2 = LinearRegression().fit(X3,Y3)
reg3 = LinearRegression().fit(X4,Y4)
reg4
print("REGRESSION MODELS IN SKLEARN\n")
print("\nData Set I Regression\n")
print(
"slope:",
'%.2f' % reg1.coef_[0][0],
"intercept:",
'%.2f' % reg1.intercept_[0],
"R^2:",
'%.2f ' % reg1.score(X1,Y1),
= " "
sep
)
print("\nData Set II Regression\n")
print(
"slope:",
'%.2f' % reg2.coef_[0][0],
"intercept:",
'%.2f' % reg2.intercept_[0],
"R^2:",
'%.2f ' % reg2.score(X2,Y2),
= " "
sep
)
print("\nData Set III Regression\n")
print(
"slope:",
'%.2f' % reg3.coef_[0][0],
"intercept:",
'%.2f' % reg3.intercept_[0],
"R^2:",
'%.2f ' % reg3.score(X3,Y3),
= " "
sep
)
print("\nData Set IV Regression\n")
print(
"slope:",
'%.2f' % reg4.coef_[0][0],
"intercept:",
'%.2f' % reg4.intercept_[0],
"R^2:",
'%.2f ' % reg4.score(X4,Y4),
= " "
sep )
## REGRESSION MODELS IN SKLEARN
##
##
## Data Set I Regression
##
## slope: 0.50 intercept: 3.00 R^2: 0.67
##
## Data Set II Regression
##
## slope: 0.50 intercept: 3.00 R^2: 0.67
##
## Data Set III Regression
##
## slope: 0.50 intercept: 3.00 R^2: 0.67
##
## Data Set IV Regression
##
## slope: 0.50 intercept: 3.00 R^2: 0.67
Linear Regression in Python (statsmodels)
= sm.add_constant(X1)
x1 = sm.add_constant(X2)
x2 = sm.add_constant(X3)
x3 = sm.add_constant(X4)
x4
= sm.OLS(Y1,x1)
M1 = sm.OLS(Y2,x2)
M2 = sm.OLS(Y3,x3)
M3 = sm.OLS(Y4,x4)
M4
= M1.fit()
R1 = M2.fit()
R2 = M3.fit()
R3 = M4.fit()
R4
print("REGRESSION MODELS IN STATSMODELS\n")
print("\nData Set I Regression\n")
with warnings.catch_warnings():
"ignore")
warnings.simplefilter(print(R1.summary())
print("\nData Set II Regression\n")
with warnings.catch_warnings():
"ignore")
warnings.simplefilter(print(R2.summary())
print("\nData Set III Regression\n")
with warnings.catch_warnings():
"ignore")
warnings.simplefilter(print(R3.summary())
print("\nData Set IV Regression\n")
with warnings.catch_warnings():
"ignore")
warnings.simplefilter(print(R4.summary())
## REGRESSION MODELS IN STATSMODELS
##
##
## Data Set I Regression
##
## OLS Regression Results
## ==============================================================================
## Dep. Variable: y R-squared: 0.667
## Model: OLS Adj. R-squared: 0.629
## Method: Least Squares F-statistic: 17.99
## Date: Fri, 26 May 2023 Prob (F-statistic): 0.00217
## Time: 07:43:57 Log-Likelihood: -16.841
## No. Observations: 11 AIC: 37.68
## Df Residuals: 9 BIC: 38.48
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 3.0001 1.125 2.667 0.026 0.456 5.544
## x 0.5001 0.118 4.241 0.002 0.233 0.767
## ==============================================================================
## Omnibus: 0.082 Durbin-Watson: 3.212
## Prob(Omnibus): 0.960 Jarque-Bera (JB): 0.289
## Skew: -0.122 Prob(JB): 0.865
## Kurtosis: 2.244 Cond. No. 29.1
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
##
## Data Set II Regression
##
## OLS Regression Results
## ==============================================================================
## Dep. Variable: y R-squared: 0.666
## Model: OLS Adj. R-squared: 0.629
## Method: Least Squares F-statistic: 17.97
## Date: Fri, 26 May 2023 Prob (F-statistic): 0.00218
## Time: 07:43:57 Log-Likelihood: -16.846
## No. Observations: 11 AIC: 37.69
## Df Residuals: 9 BIC: 38.49
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 3.0009 1.125 2.667 0.026 0.455 5.547
## x 0.5000 0.118 4.239 0.002 0.233 0.767
## ==============================================================================
## Omnibus: 1.594 Durbin-Watson: 2.188
## Prob(Omnibus): 0.451 Jarque-Bera (JB): 1.108
## Skew: -0.567 Prob(JB): 0.575
## Kurtosis: 1.936 Cond. No. 29.1
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
##
## Data Set III Regression
##
## OLS Regression Results
## ==============================================================================
## Dep. Variable: y R-squared: 0.666
## Model: OLS Adj. R-squared: 0.629
## Method: Least Squares F-statistic: 17.97
## Date: Fri, 26 May 2023 Prob (F-statistic): 0.00218
## Time: 07:43:57 Log-Likelihood: -16.838
## No. Observations: 11 AIC: 37.68
## Df Residuals: 9 BIC: 38.47
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 3.0025 1.124 2.670 0.026 0.459 5.546
## x 0.4997 0.118 4.239 0.002 0.233 0.766
## ==============================================================================
## Omnibus: 19.540 Durbin-Watson: 2.144
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 13.478
## Skew: 2.041 Prob(JB): 0.00118
## Kurtosis: 6.571 Cond. No. 29.1
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
##
## Data Set IV Regression
##
## OLS Regression Results
## ==============================================================================
## Dep. Variable: y R-squared: 0.667
## Model: OLS Adj. R-squared: 0.630
## Method: Least Squares F-statistic: 18.00
## Date: Fri, 26 May 2023 Prob (F-statistic): 0.00216
## Time: 07:43:57 Log-Likelihood: -16.833
## No. Observations: 11 AIC: 37.67
## Df Residuals: 9 BIC: 38.46
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 3.0017 1.124 2.671 0.026 0.459 5.544
## x 0.4999 0.118 4.243 0.002 0.233 0.766
## ==============================================================================
## Omnibus: 0.555 Durbin-Watson: 1.662
## Prob(Omnibus): 0.758 Jarque-Bera (JB): 0.524
## Skew: 0.010 Prob(JB): 0.769
## Kurtosis: 1.931 Cond. No. 29.1
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
With all these stats lining up, one might think that these four
data sets are practically identical, or at least interchangeable. Yet
only one is actually appropriate for linear regression. This fact is
invisible when examining the descriptive statistics and basic linear
regression fit. Yet, a simple examination of side-by-side graphs
immediately reveals this fact. Only the first data set has a scatter
pattern appropriate for linear regression. The second data set exhibits
a quadratic relationship. The third contains an outlier, which when
removed, yields a perfect linear relationship. The fourth data set is
inappropriate for linear regression altogether.
Regression Plots in R
par(
mfrow=c(2,2),
bg = "#71797E"
)
plot(0, 0, type="n", ann=FALSE, axes=FALSE)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
par(new=TRUE)
plot(
$x1, anscombe$y1,
anscombemain = "Data Set 1",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green"
)abline(reg1, lwd = 1.5)
plot(
$x2, anscombe$y2,
anscombemain = "Data Set 2",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green"
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
points(
$x2, anscombe$y2,
anscombemain = "Data Set 2",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green",
add = TRUE
)abline(reg2, lwd = 1.5)
plot(
$x3, anscombe$y3,
anscombemain = "Data Set 3",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green"
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
points(
$x3, anscombe$y3,
anscombemain = "Data Set 3",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green",
add = TRUE
)abline(reg3, lwd = 1.5)
plot(
$x4, anscombe$y4,
anscombemain = "Data Set 4",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green"
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
points(
$x4, anscombe$y4,
anscombemain = "Data Set 4",
xlab = "X",
ylab = "Y",
pch = 21,
col = "darkgreen",
bg = "green",
add = TRUE
)abline(reg4, lwd = 1.5)
mtext("Anscombe's Quartet", # Add main title
side = 3,
line = - 2,
outer = TRUE)
Regression Plots in Python
= np.array([anscombe.x.min(),anscombe.x.max()]).reshape((2,1))
Xp
= plt.subplots(2,2)
fig, axs '#71797E')
fig.set_facecolor(=2)
fig.tight_layout(h_pad=0.9)
fig.subplots_adjust(top'Anscombe\'s Quartet')
fig.suptitle(
0,0].scatter(X1,Y1,color='k')
axs[0,0].plot(Xp,reg1.predict(Xp),color='g')
axs[0,0].set_title('Dataset I')
axs[0,0].set_facecolor('#C0C0C0')
axs[
0,1].scatter(X2,Y2,color='k')
axs[0,1].plot(Xp,reg2.predict(Xp),color='g')
axs[0,1].set_title('Dataset II')
axs[0,1].set_facecolor('#C0C0C0')
axs[
1,0].scatter(X3,Y3,color='k')
axs[1,0].plot(Xp,reg3.predict(Xp),color='g')
axs[1,0].set_title('Dataset III')
axs[1,0].set_facecolor('#C0C0C0')
axs[
1,1].scatter(X4,Y4,color='k')
axs[1,1].plot(Xp,reg4.predict(Xp),color='g')
axs[1,1].set_title('Dataset IV')
axs[1,1].set_facecolor('#C0C0C0')
axs[
plt.show()
Another way of seeing this is to plot the residuals from the
regression fits. Only the first plot for the first data set shows the
random scatter appropriate for linear regression data. The other data
sets all exhibit patterns that indicate a linear regression is
ill-suited.
Residual Plots in R
par(
mfrow=c(2,2),
bg = "#71797E"
)
plot(0, 0, type="n", ann=FALSE, axes=FALSE)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
par(new=TRUE)
plot(
$x1,resid(reg1),
anscombemain = "Data Set 1",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red"
)abline(h = 0, lwd = 1.5)
plot(
$x2,resid(reg2),
anscombemain = "Data Set 2",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red"
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
points(
$x2,resid(reg2),
anscombemain = "Data Set 2",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red",
add = TRUE
)abline(h = 0, lwd = 1.5)
plot(
$x3,resid(reg3),
anscombemain = "Data Set 3",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red"
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
points(
$x3,resid(reg3),
anscombemain = "Data Set 3",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red",
add = TRUE
)abline(h = 0, lwd = 1.5)
plot(
$x4,resid(reg4),
anscombemain = "Data Set 4",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red"
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
points(
$x4,resid(reg4),
anscombemain = "Data Set 4",
xlab = "X",
ylab = "Residual",
pch = 21,
col = "darkred",
bg = "red",
add = TRUE
)abline(h = 0, lwd = 1.5)
mtext("Residual Plots", # Add main title
side = 3,
line = - 2,
outer = TRUE)
In particular, taking the second data set and fitting a
polynomial regression of order two, that is a quadratic curve, we return
a perfect relationship between the \(x\)’s and \(y\)’s. This indicates that Anscombe likely
used a quadratic equation to create this data set.
Quadratic Regression and Plot in R
.1 = lm(y2 ~ x2 + I(x2^2), data = anscombe)
reg2
= seq(min(anscombe$x2), max(anscombe$x2), by = .1)
xnew = data.frame(x2 = xnew)
xnew
par(bg = "#71797E")
plot(0, 0, type="n", ann=FALSE, axes=FALSE)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col="#C0C0C0", border=NA)
par(new=TRUE)
plot(
$x2, anscombe$y2,
anscombemain = "Data set 2 Quadratic Regression",
xlab = "X",
ylab = "Y",
pch = 21,
col = "purple",
bg = 'violet'
)
lines(xnew$x2,predict.lm(reg2.1,xnew), lwd = 1.5)
Quadratic Regression (statsmodels.formula) and Plots in Python
= pd.concat([X2,Y2],axis=1)
quad
= 'y ~ x'
form1 = 'y ~ x + I(x**2)'
form2
= smf.ols(formula=form1, data=quad)
mod1 = smf.ols(formula=form2, data=quad)
mod2
= mod1.fit()
res1 = mod2.fit()
res2
= plt.subplots(1,2)
fig, axs
'#71797E')
fig.set_facecolor(
'Dataset II Linear vs Quadratic')
fig.suptitle(
0].scatter(X2.x,Y2,color='k')
axs[0].plot(X2.x,res1.predict(X2.x),color='g')
axs[0].set_title('Dataset II Linear')
axs[0].set_facecolor('#C0C0C0')
axs[
1].scatter(X2.x,Y2,color='k')
axs[1].plot(X2.x.sort_values(),res2.predict(X2.x.sort_values()),color='g')
axs[1].set_title('Dataset II Quadratic')
axs[1].set_facecolor('#C0C0C0')
axs[
plt.show()
Data can be tricky and real-world data can sometimes exhibit
unexpected patterns. Anscombe’s quartet demonstrates the importance of
leaving no stone unturned when analyzing data. In particular, it
emphasizes the importance of graphing data and visually inspecting
it.