In this example we look at how to apply a linear regression
model to a highly non-linear data set. Working through several
iterations and relying on analysis of patterns in the data, we are able
to converge on model with a strong fit and good explanatory powers.
The data set may be found on the UC Irvine machine learning
repository:
The data set was created by Karl Ulrich, an MIT professor, and
donated to the UCI data repository by Ross Quinlan. It contains two
discrete numerical predictors dealing with gain settings:
and two categorical variables dealing mechanical components:
The output variable is class, a measure of some sort of reaction time. From Ulrich’s recollection, “the output value is almost certainly a rise time, or the time required for the system to respond to a step change in a position set point.” Hence, presumably lower class values correspond to faster/better performance.
We begin by loading the data set. Then we begin exploring by
examining column names, checking for missing values, etc.
head(servo)
str(servo)
sum(is.na(servo))
## motor screw pgain vgain class
## 1 E E 5 4 0.2812509
## 2 B D 6 5 0.5062525
## 3 D D 4 3 0.3562515
## 4 B A 3 2 5.5000330
## 5 D B 6 5 0.3562515
## 6 E C 4 3 0.8062546
## 'data.frame': 167 obs. of 5 variables:
## $ motor: chr "E" "B" "D" "B" ...
## $ screw: chr "E" "D" "D" "A" ...
## $ pgain: int 5 6 4 3 6 4 3 3 6 4 ...
## $ vgain: int 4 5 3 2 5 3 2 2 5 1 ...
## $ class: num 0.281 0.506 0.356 5.5 0.356 ...
## [1] 0
# servo |> group_by(motor) |> summarize(count=n())
# servo |> group_by(screw) |> summarize(count=n())
# servo |> group_by(pgain) |> summarize(count=n())
# servo |> group_by(vgain) |> summarize(count=n())
|> ggplot(aes(x = class))+
servo geom_histogram(
aes(y = ..density..),
col = 'darkgray',
fill = 'darkred',
bins = 20
+
)ggtitle("Histogram of Class")+
theme(plot.background = element_rect(fill = "darkgray"))
|> ggplot(aes(x = motor))+
servo geom_bar(
aes(y = ..count..),
col = 'darkgray',
fill = 'darkgreen'
+
)ggtitle("Counts of Motor Type")+
theme(plot.background = element_rect(fill = "darkgray"))
|> ggplot(aes(x = screw))+
servo geom_bar(
aes(y = ..count..),
col = 'darkgray',
fill = 'darkgreen'
+
)ggtitle("Counts of Screw Type")+
theme(plot.background = element_rect(fill = "darkgray"))
|> ggplot(aes(x = factor(pgain)))+
servo geom_bar(
aes(y = ..count..),
col = 'darkgray',
fill = 'darkgreen'
+
)ggtitle("Counts of pgain level")+
theme(plot.background = element_rect(fill = "darkgray"))
|> ggplot(aes(x = factor(vgain)))+
servo geom_bar(
aes(y = ..count..),
col = 'darkgray',
fill = 'darkgreen'
+
)ggtitle("Counts of vgain level")+
theme(plot.background = element_rect(fill = "darkgray"))
While there is some variation by category of screw/motor and
level of pgain/vgain, we see all levels/categories have at least 20
observations. This is good for fitting individual predictor effects.
Next we count the number of observations across every possible
combination of the predictor variables. This gives us a sense of the
combinations of predictors present in the data for evaluating possible
interaction effects.
|> group_by(motor, screw, pgain, vgain) |> summarize(count=n()) servo
## # A tibble: 167 x 5
## # Groups: motor, screw, pgain [93]
## motor screw pgain vgain count
## <chr> <chr> <int> <int> <int>
## 1 A A 3 1 1
## 2 A A 3 2 1
## 3 A A 4 1 1
## 4 A A 4 2 1
## 5 A A 4 3 1
## 6 A A 5 3 1
## 7 A A 5 4 1
## 8 A A 6 5 1
## 9 A B 3 1 1
## 10 A B 3 2 1
## # ... with 157 more rows
In the resulting output only 10 observations are displayed, out
of a total of 167. Now if every combination of categories/levels were
present in the data, we would expect 5x5x4x5=500 observations. Hence
running a blind regression including all possible combinations is not a
good idea here. We will do it anyway for instructional purposes and to
see the outcome.
We begin the first regression by regressing class
on the full set of interactions between all predictor variables. There
are too many predictors here, given the size of the data set, to be able
to fit a parsimonious model. Nonetheless, this serves as a starting
point.
= lm(class ~ factor(motor)*factor(screw)*pgain*vgain,data = servo)
reg0 summary(reg0)
##
## Call:
## lm(formula = class ~ factor(motor) * factor(screw) * pgain *
## vgain, data = servo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5917 -0.2360 0.0000 0.2217 2.0431
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 20.446836 2.771706 7.377
## factor(motor)B -0.704880 3.919784 -0.180
## factor(motor)C -11.376378 3.420310 -3.326
## factor(motor)D -10.071290 4.218559 -2.387
## factor(motor)E 11.394076 4.218559 2.701
## factor(screw)B -0.044329 4.218559 -0.011
## factor(screw)C -1.507741 4.218559 -0.357
## factor(screw)D -0.686520 4.218559 -0.163
## factor(screw)E -1.755959 4.218559 -0.416
## pgain -4.942712 0.744565 -6.638
## vgain -2.757403 1.017472 -2.710
## factor(motor)B:factor(screw)B -0.115237 5.965943 -0.019
## factor(motor)C:factor(screw)B 9.653595 5.650336 1.708
## factor(motor)D:factor(screw)B -5.193946 6.166362 -0.842
## factor(motor)E:factor(screw)B -27.093137 6.166362 -4.394
## factor(motor)B:factor(screw)C 0.583523 5.965943 0.098
## factor(motor)C:factor(screw)C 4.723777 5.650336 0.836
## factor(motor)D:factor(screw)C -1.545853 4.790977 -0.323
## factor(motor)E:factor(screw)C -25.547544 6.166362 -4.143
## factor(motor)B:factor(screw)D -0.889781 5.965943 -0.149
## factor(motor)C:factor(screw)D -1.834408 5.650336 -0.325
## factor(motor)D:factor(screw)D -0.248130 4.790977 -0.052
## factor(motor)E:factor(screw)D -27.456094 6.457955 -4.252
## factor(motor)B:factor(screw)E 0.694367 5.965943 0.116
## factor(motor)C:factor(screw)E 2.248261 5.650336 0.398
## factor(motor)D:factor(screw)E -1.436751 2.163660 -0.664
## factor(motor)E:factor(screw)E -27.640442 6.457955 -4.280
## factor(motor)B:pgain 0.163145 1.052974 0.155
## factor(motor)C:pgain 3.332524 0.842701 3.955
## factor(motor)D:pgain 2.495256 1.150894 2.168
## factor(motor)E:pgain -2.999212 1.150894 -2.606
## factor(screw)B:pgain -0.135591 1.150894 -0.118
## factor(screw)C:pgain 0.057937 1.150894 0.050
## factor(screw)D:pgain -0.185314 1.150894 -0.161
## factor(screw)E:pgain 0.086511 1.150894 0.075
## factor(motor)B:vgain 0.098498 1.438923 0.068
## factor(motor)C:vgain 1.311632 1.429776 0.917
## factor(motor)D:vgain 1.138405 1.468437 0.775
## factor(motor)E:vgain -2.084235 1.468437 -1.419
## factor(screw)B:vgain -0.250556 1.468437 -0.171
## factor(screw)C:vgain 0.468394 1.468437 0.319
## factor(screw)D:vgain 0.363169 1.468437 0.247
## factor(screw)E:vgain 0.561242 1.468437 0.382
## pgain:vgain 0.801230 0.206289 3.884
## factor(motor)B:factor(screw)B:pgain 0.034295 1.627610 0.021
## factor(motor)C:factor(screw)B:pgain -3.032635 1.500168 -2.022
## factor(motor)D:factor(screw)B:pgain 1.390248 1.692607 0.821
## factor(motor)E:factor(screw)B:pgain 7.118616 1.692607 4.206
## factor(motor)B:factor(screw)C:pgain -0.136823 1.627610 -0.084
## factor(motor)C:factor(screw)C:pgain -1.820263 1.500168 -1.213
## factor(motor)D:factor(screw)C:pgain 0.115537 1.845392 0.063
## factor(motor)E:factor(screw)C:pgain 6.628206 1.692607 3.916
## factor(motor)B:factor(screw)D:pgain 0.282844 1.627610 0.174
## factor(motor)C:factor(screw)D:pgain 0.060353 1.500168 0.040
## factor(motor)D:factor(screw)D:pgain -0.547526 1.845392 -0.297
## factor(motor)E:factor(screw)D:pgain 7.086690 1.897715 3.734
## factor(motor)B:factor(screw)E:pgain -0.214982 1.627610 -0.132
## factor(motor)C:factor(screw)E:pgain -1.129062 1.500168 -0.753
## factor(motor)D:factor(screw)E:pgain NA NA NA
## factor(motor)E:factor(screw)E:pgain 7.194708 1.897715 3.791
## factor(motor)B:factor(screw)B:vgain 0.024850 2.076683 0.012
## factor(motor)C:factor(screw)B:vgain -0.749704 2.070356 -0.362
## factor(motor)D:factor(screw)B:vgain 1.265075 2.097241 0.603
## factor(motor)E:factor(screw)B:vgain 4.612245 2.097241 2.199
## factor(motor)B:factor(screw)C:vgain -0.153361 2.076683 -0.074
## factor(motor)C:factor(screw)C:vgain 0.393133 2.070356 0.190
## factor(motor)D:factor(screw)C:vgain 0.432859 1.260390 0.343
## factor(motor)E:factor(screw)C:vgain 4.106601 2.097241 1.958
## factor(motor)B:factor(screw)D:vgain -0.056767 2.076683 -0.027
## factor(motor)C:factor(screw)D:vgain 0.671127 2.070356 0.324
## factor(motor)D:factor(screw)D:vgain 0.819264 1.260390 0.650
## factor(motor)E:factor(screw)D:vgain 4.641183 2.113351 2.196
## factor(motor)B:factor(screw)E:vgain 0.050707 2.076683 0.024
## factor(motor)C:factor(screw)E:vgain 0.861588 2.070356 0.416
## factor(motor)D:factor(screw)E:vgain 0.377512 1.260390 0.300
## factor(motor)E:factor(screw)E:vgain 4.545147 2.113351 2.151
## factor(motor)B:pgain:vgain -0.025779 0.291736 -0.088
## factor(motor)C:pgain:vgain -0.516465 0.277091 -1.864
## factor(motor)D:pgain:vgain -0.372368 0.304728 -1.222
## factor(motor)E:pgain:vgain 0.558232 0.304728 1.832
## factor(screw)B:pgain:vgain 0.059194 0.304728 0.194
## factor(screw)C:pgain:vgain -0.056282 0.304728 -0.185
## factor(screw)D:pgain:vgain -0.016676 0.304728 -0.055
## factor(screw)E:pgain:vgain -0.068781 0.304728 -0.226
## factor(motor)B:factor(screw)B:pgain:vgain -0.008320 0.430950 -0.019
## factor(motor)C:factor(screw)B:pgain:vgain 0.416892 0.421173 0.990
## factor(motor)D:factor(screw)B:pgain:vgain -0.308634 0.439848 -0.702
## factor(motor)E:factor(screw)B:pgain:vgain -1.291648 0.439848 -2.937
## factor(motor)B:factor(screw)C:pgain:vgain 0.033143 0.430950 0.077
## factor(motor)C:factor(screw)C:pgain:vgain 0.145554 0.421173 0.346
## factor(motor)D:factor(screw)C:pgain:vgain NA NA NA
## factor(motor)E:factor(screw)C:pgain:vgain -1.154697 0.439848 -2.625
## factor(motor)B:factor(screw)D:pgain:vgain -0.016055 0.430950 -0.037
## factor(motor)C:factor(screw)D:pgain:vgain -0.055313 0.421173 -0.131
## factor(motor)D:factor(screw)D:pgain:vgain NA NA NA
## factor(motor)E:factor(screw)D:pgain:vgain -1.276390 0.444710 -2.870
## factor(motor)B:factor(screw)E:pgain:vgain 0.009257 0.430950 0.021
## factor(motor)C:factor(screw)E:pgain:vgain 0.007621 0.421173 0.018
## factor(motor)D:factor(screw)E:pgain:vgain NA NA NA
## factor(motor)E:factor(screw)E:pgain:vgain -1.277234 0.444710 -2.872
## Pr(>|t|)
## (Intercept) 2.38e-10 ***
## factor(motor)B 0.857801
## factor(motor)C 0.001397 **
## factor(motor)D 0.019634 *
## factor(motor)E 0.008640 **
## factor(screw)B 0.991645
## factor(screw)C 0.721848
## factor(screw)D 0.871187
## factor(screw)E 0.678485
## pgain 5.37e-09 ***
## vgain 0.008428 **
## factor(motor)B:factor(screw)B 0.984643
## factor(motor)C:factor(screw)B 0.091912 .
## factor(motor)D:factor(screw)B 0.402447
## factor(motor)E:factor(screw)B 3.82e-05 ***
## factor(motor)B:factor(screw)C 0.922360
## factor(motor)C:factor(screw)C 0.405950
## factor(motor)D:factor(screw)C 0.747902
## factor(motor)E:factor(screw)C 9.33e-05 ***
## factor(motor)B:factor(screw)D 0.881863
## factor(motor)C:factor(screw)D 0.746397
## factor(motor)D:factor(screw)D 0.958841
## factor(motor)E:factor(screw)D 6.36e-05 ***
## factor(motor)B:factor(screw)E 0.907673
## factor(motor)C:factor(screw)E 0.691899
## factor(motor)D:factor(screw)E 0.508817
## factor(motor)E:factor(screw)E 5.75e-05 ***
## factor(motor)B:pgain 0.877310
## factor(motor)C:pgain 0.000179 ***
## factor(motor)D:pgain 0.033504 *
## factor(motor)E:pgain 0.011155 *
## factor(screw)B:pgain 0.906548
## factor(screw)C:pgain 0.959992
## factor(screw)D:pgain 0.872537
## factor(screw)E:pgain 0.940292
## factor(motor)B:vgain 0.945618
## factor(motor)C:vgain 0.362055
## factor(motor)D:vgain 0.440767
## factor(motor)E:vgain 0.160171
## factor(screw)B:vgain 0.865002
## factor(screw)C:vgain 0.750682
## factor(screw)D:vgain 0.805377
## factor(screw)E:vgain 0.703452
## pgain:vgain 0.000228 ***
## factor(motor)B:factor(screw)B:pgain 0.983248
## factor(motor)C:factor(screw)B:pgain 0.046995 *
## factor(motor)D:factor(screw)B:pgain 0.414188
## factor(motor)E:factor(screw)B:pgain 7.48e-05 ***
## factor(motor)B:factor(screw)C:pgain 0.933242
## factor(motor)C:factor(screw)C:pgain 0.229009
## factor(motor)D:factor(screw)C:pgain 0.950254
## factor(motor)E:factor(screw)C:pgain 0.000205 ***
## factor(motor)B:factor(screw)D:pgain 0.862534
## factor(motor)C:factor(screw)D:pgain 0.968022
## factor(motor)D:factor(screw)D:pgain 0.767563
## factor(motor)E:factor(screw)D:pgain 0.000377 ***
## factor(motor)B:factor(screw)E:pgain 0.895291
## factor(motor)C:factor(screw)E:pgain 0.454164
## factor(motor)D:factor(screw)E:pgain NA
## factor(motor)E:factor(screw)E:pgain 0.000312 ***
## factor(motor)B:factor(screw)B:vgain 0.990486
## factor(motor)C:factor(screw)B:vgain 0.718343
## factor(motor)D:factor(screw)B:vgain 0.548292
## factor(motor)E:factor(screw)B:vgain 0.031122 *
## factor(motor)B:factor(screw)C:vgain 0.941338
## factor(motor)C:factor(screw)C:vgain 0.849940
## factor(motor)D:factor(screw)C:vgain 0.732288
## factor(motor)E:factor(screw)C:vgain 0.054149 .
## factor(motor)B:factor(screw)D:vgain 0.978269
## factor(motor)C:factor(screw)D:vgain 0.746770
## factor(motor)D:factor(screw)D:vgain 0.517784
## factor(motor)E:factor(screw)D:vgain 0.031351 *
## factor(motor)B:factor(screw)E:vgain 0.980588
## factor(motor)C:factor(screw)E:vgain 0.678552
## factor(motor)D:factor(screw)E:vgain 0.765419
## factor(motor)E:factor(screw)E:vgain 0.034907 *
## factor(motor)B:pgain:vgain 0.929836
## factor(motor)C:pgain:vgain 0.066473 .
## factor(motor)D:pgain:vgain 0.225761
## factor(motor)E:pgain:vgain 0.071160 .
## factor(screw)B:pgain:vgain 0.846533
## factor(screw)C:pgain:vgain 0.853994
## factor(screw)D:pgain:vgain 0.956511
## factor(screw)E:pgain:vgain 0.822073
## factor(motor)B:factor(screw)B:pgain:vgain 0.984651
## factor(motor)C:factor(screw)B:pgain:vgain 0.325616
## factor(motor)D:factor(screw)B:pgain:vgain 0.485171
## factor(motor)E:factor(screw)B:pgain:vgain 0.004470 **
## factor(motor)B:factor(screw)C:pgain:vgain 0.938914
## factor(motor)C:factor(screw)C:pgain:vgain 0.730672
## factor(motor)D:factor(screw)C:pgain:vgain NA
## factor(motor)E:factor(screw)C:pgain:vgain 0.010597 *
## factor(motor)B:factor(screw)D:pgain:vgain 0.970387
## factor(motor)C:factor(screw)D:pgain:vgain 0.895885
## factor(motor)D:factor(screw)D:pgain:vgain NA
## factor(motor)E:factor(screw)D:pgain:vgain 0.005402 **
## factor(motor)B:factor(screw)E:pgain:vgain 0.982922
## factor(motor)C:factor(screw)E:pgain:vgain 0.985614
## factor(motor)D:factor(screw)E:pgain:vgain NA
## factor(motor)E:factor(screw)E:pgain:vgain 0.005373 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6647 on 71 degrees of freedom
## Multiple R-squared: 0.9223, Adjusted R-squared: 0.8184
## F-statistic: 8.873 on 95 and 71 DF, p-value: < 2.2e-16
Expectedly, the full regression with all interactions is a mess.
There are too many variables for a coherent model with a dataset of this
size. In fact, 4 of the interaction variables have too few datapoints
for a coefficient to be fit. However we see that there are subsets of
variables that are highly statistically significant. The \(R^2\) value is also decently high, but the
adjusted \(R^2\) is significantly less.
Adjusted \(R^2\) penalizes for adding
predictors with no additional predictive value, hence this points to the
inclusion of too many nuisance variables. While this is not a good model
for explaining or predicting our dataset, it shows us that there may be
a subset of predictors producing a decent regression.
One of the issues in visualizing the data is pgain and vgain
appear to be discrete variables. Hence, many values overlap and will be
plot on top of each other. The geom_jitter function allows us to avoid
this problem. It adds a small amount of noise to the data for more
effective visualization. For example all points with a pgain of 3 are
adjusted to fall somewhere between 2.5 and 3.5. This allows us to
visualize each point without them falling on top of each other.
|> ggplot(
servo aes(
x = pgain,
y = vgain,
color = class,
shape = screw
)+
) geom_point() +
geom_jitter()+
scale_color_gradient2(
low = "darkblue",
mid = "turquoise",
high = "darkred",
midpoint = 3
+
)annotate(geom="rect", alpha = .08, xmin = 2.5, xmax = 3.5, ymin = .5, ymax = 2.5,
fill = "red", color = "black")+
theme(legend.position = "bottom",plot.background = element_rect(fill = "darkgray"))+
ggtitle("Servo Class by vgain/pgain/screw")
|> ggplot(
servo aes(
x = pgain,
y = vgain,
color = class,
shape = motor
)+
) geom_point() +
geom_jitter()+
scale_color_gradient2(
low = "darkblue",
mid = "turquoise",
high = "darkred",
midpoint = 3
+
)annotate(geom="rect", alpha = .08, xmin = 2.5, xmax = 3.5, ymin = .5, ymax = 2.5,
fill = "red", color = "black")+
theme(legend.position = "bottom",plot.background = element_rect(fill = "darkgray"))+
ggtitle("Servo Class by vgain/pgain/motor")
We see that higher class values (i.e. class values > 2)
generally occur within a box of values in the plot. This box corresponds
to points with a pgain of 3 and a vgain or 2 or 1. On the basis of this
we create a new variable: pvclass.
$pvclass = 0
servo$pvclass[servo$pgain==3 & servo$vgain<2.5] = 1
servo
= lm(class ~ factor(pvclass) * factor(motor) * factor(screw), data = servo)
reg1
summary(reg1)
##
## Call:
## lm(formula = class ~ factor(pvclass) * factor(motor) * factor(screw),
## data = servo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4000 -0.1638 -0.0525 0.2000 1.4000
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.856254 0.149687 5.720
## factor(pvclass)1 4.643779 0.299374 15.512
## factor(motor)B -0.025000 0.211689 -0.118
## factor(motor)C -0.177500 0.189341 -0.937
## factor(motor)D -0.275001 0.222022 -1.239
## factor(motor)E -0.005002 0.222022 -0.023
## factor(screw)B -0.132500 0.222022 -0.597
## factor(screw)C -0.267500 0.222022 -1.205
## factor(screw)D -0.290002 0.222022 -1.306
## factor(screw)E -0.282503 0.222022 -1.272
## factor(pvclass)1:factor(motor)B -0.175010 0.423379 -0.413
## factor(pvclass)1:factor(motor)C -0.622537 0.412659 -1.509
## factor(pvclass)1:factor(motor)D -2.625051 0.428638 -6.124
## factor(pvclass)1:factor(motor)E 1.505072 0.428638 3.511
## factor(pvclass)1:factor(screw)B -0.767542 0.428638 -1.791
## factor(pvclass)1:factor(screw)C -0.932553 0.428638 -2.176
## factor(pvclass)1:factor(screw)D -0.810049 0.428638 -1.890
## factor(pvclass)1:factor(screw)E -1.017555 0.428638 -2.374
## factor(motor)B:factor(screw)B -0.020000 0.313986 -0.064
## factor(motor)C:factor(screw)B 0.035000 0.299374 0.117
## factor(motor)D:factor(screw)B 0.027500 0.321043 0.086
## factor(motor)E:factor(screw)B 0.005000 0.321043 0.016
## factor(motor)B:factor(screw)C -0.027501 0.313986 -0.088
## factor(motor)C:factor(screw)C 0.094999 0.299374 0.317
## factor(motor)D:factor(screw)C 0.230001 0.458932 0.501
## factor(motor)E:factor(screw)C -0.182499 0.321043 -0.568
## factor(motor)B:factor(screw)D -0.004999 0.313986 -0.016
## factor(motor)C:factor(screw)D 0.087501 0.299374 0.292
## factor(motor)D:factor(screw)D 0.065001 0.458932 0.142
## factor(motor)E:factor(screw)D -0.064372 0.331346 -0.194
## factor(motor)B:factor(screw)E -0.027500 0.313986 -0.088
## factor(motor)C:factor(screw)E 0.155001 0.299374 0.518
## factor(motor)D:factor(screw)E -0.599922 0.518531 -1.157
## factor(motor)E:factor(screw)E -0.090621 0.331346 -0.273
## factor(pvclass)1:factor(motor)B:factor(screw)B 0.020000 0.606186 0.033
## factor(pvclass)1:factor(motor)C:factor(screw)B 0.365022 0.598748 0.610
## factor(pvclass)1:factor(motor)D:factor(screw)B -0.227444 0.609871 -0.373
## factor(pvclass)1:factor(motor)E:factor(screw)B -4.405068 0.609871 -7.223
## factor(pvclass)1:factor(motor)B:factor(screw)C 0.127505 0.606186 0.210
## factor(pvclass)1:factor(motor)C:factor(screw)C -0.494958 0.598748 -0.827
## factor(pvclass)1:factor(motor)D:factor(screw)C -0.529929 0.692454 -0.765
## factor(pvclass)1:factor(motor)E:factor(screw)C -4.317553 0.609871 -7.079
## factor(pvclass)1:factor(motor)B:factor(screw)D -0.195007 0.606186 -0.322
## factor(pvclass)1:factor(motor)C:factor(screw)D -1.987453 0.598748 -3.319
## factor(pvclass)1:factor(motor)D:factor(screw)D -0.464933 0.692454 -0.671
## factor(pvclass)1:factor(motor)E:factor(screw)D -4.735681 0.615357 -7.696
## factor(pvclass)1:factor(motor)B:factor(screw)E 0.227516 0.606186 0.375
## factor(pvclass)1:factor(motor)C:factor(screw)E -0.854956 0.598748 -1.428
## factor(pvclass)1:factor(motor)D:factor(screw)E NA NA NA
## factor(pvclass)1:factor(motor)E:factor(screw)E -4.709423 0.615357 -7.653
## Pr(>|t|)
## (Intercept) 8.19e-08 ***
## factor(pvclass)1 < 2e-16 ***
## factor(motor)B 0.906192
## factor(motor)C 0.350435
## factor(motor)D 0.217944
## factor(motor)E 0.982063
## factor(screw)B 0.551792
## factor(screw)C 0.230676
## factor(screw)D 0.194029
## factor(screw)E 0.205729
## factor(pvclass)1:factor(motor)B 0.680090
## factor(pvclass)1:factor(motor)C 0.134073
## factor(pvclass)1:factor(motor)D 1.23e-08 ***
## factor(pvclass)1:factor(motor)E 0.000633 ***
## factor(pvclass)1:factor(screw)B 0.075913 .
## factor(pvclass)1:factor(screw)C 0.031578 *
## factor(pvclass)1:factor(screw)D 0.061236 .
## factor(pvclass)1:factor(screw)E 0.019214 *
## factor(motor)B:factor(screw)B 0.949319
## factor(motor)C:factor(screw)B 0.907131
## factor(motor)D:factor(screw)B 0.931884
## factor(motor)E:factor(screw)B 0.987601
## factor(motor)B:factor(screw)C 0.930354
## factor(motor)C:factor(screw)C 0.751556
## factor(motor)D:factor(screw)C 0.617188
## factor(motor)E:factor(screw)C 0.570805
## factor(motor)B:factor(screw)D 0.987323
## factor(motor)C:factor(screw)D 0.770586
## factor(motor)D:factor(screw)D 0.887609
## factor(motor)E:factor(screw)D 0.846296
## factor(motor)B:factor(screw)E 0.930356
## factor(motor)C:factor(screw)E 0.605601
## factor(motor)D:factor(screw)E 0.249623
## factor(motor)E:factor(screw)E 0.784952
## factor(pvclass)1:factor(motor)B:factor(screw)B 0.973735
## factor(pvclass)1:factor(motor)C:factor(screw)B 0.543271
## factor(pvclass)1:factor(motor)D:factor(screw)B 0.709864
## factor(pvclass)1:factor(motor)E:factor(screw)B 5.40e-11 ***
## factor(pvclass)1:factor(motor)B:factor(screw)C 0.833765
## factor(pvclass)1:factor(motor)C:factor(screw)C 0.410101
## factor(pvclass)1:factor(motor)D:factor(screw)C 0.445626
## factor(pvclass)1:factor(motor)E:factor(screw)C 1.12e-10 ***
## factor(pvclass)1:factor(motor)B:factor(screw)D 0.748253
## factor(pvclass)1:factor(motor)C:factor(screw)D 0.001201 **
## factor(pvclass)1:factor(motor)D:factor(screw)D 0.503260
## factor(pvclass)1:factor(motor)E:factor(screw)D 4.72e-12 ***
## factor(pvclass)1:factor(motor)B:factor(screw)E 0.708094
## factor(pvclass)1:factor(motor)C:factor(screw)E 0.155960
## factor(pvclass)1:factor(motor)D:factor(screw)E NA
## factor(pvclass)1:factor(motor)E:factor(screw)E 5.90e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3667 on 118 degrees of freedom
## Multiple R-squared: 0.9607, Adjusted R-squared: 0.9447
## F-statistic: 60.12 on 48 and 118 DF, p-value: < 2.2e-16
While this regression is also a mess, it is cleaner than the
last iteration. By collapsing pgain and vgain into 1 categorical
variable, it is easier to see which variables appear to be significantly
impact class outcomes. The \(R^2\)
value is improved and adjusted \(R^2\)
is much improved as we’ve reduced on variable in the set of
interactions.
To build our final regression model, we use the previous messy
model to identify the most relevant predictors. On the basis of this we
introduce new dummy variables. These encode the significant categories
(to prediction) as binary variables. These new variables are then used
to build a reduced model that focuses on the most important predictive
components.
$motorD = 0
servo$motorD[servo$motor=='D'] = 1
servo
$motorE = 0
servo$motorE[servo$motor=='E'] = 1
servo
$screwA = 0
servo$screwA[servo$screw=='A'] = 1
servo
$motorC = 0
servo$motorC[servo$motor=='C'] = 1
servo
$motorEscrewA = 0
servo$motorEscrewA[servo$motor=='E' & servo$screw=='A'] = 1
servo
$motorCscrewD = 0
servo$motorCscrewD[servo$motor=='C' & servo$screw=='D'] = 1
servo
$motorCscrewE = 0
servo$motorCscrewE[servo$motor=='C' & servo$screw=='E'] = 1
servo
$motorCscrewC = 0
servo$motorCscrewC[servo$motor=='C' & servo$screw=='C'] = 1
servo
= lm(class ~ pvclass+screwA+pvclass:(motorD+motorE+motorC+screwA+motorEscrewA+motorCscrewD+motorCscrewE+motorCscrewC),data=servo)
reg2
summary(reg2)
##
## Call:
## lm(formula = class ~ pvclass + screwA + pvclass:(motorD + motorE +
## motorC + screwA + motorEscrewA + motorCscrewD + motorCscrewE +
## motorCscrewC), data = servo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39999 -0.19677 -0.04676 0.21085 1.39999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55302 0.03817 14.488 < 2e-16 ***
## pvclass 3.73421 0.09221 40.496 < 2e-16 ***
## screwA 0.19933 0.07299 2.731 0.007040 **
## pvclass:motorD -3.10999 0.13629 -22.819 < 2e-16 ***
## pvclass:motorE -3.03723 0.15009 -20.237 < 2e-16 ***
## pvclass:motorC -0.39415 0.19767 -1.994 0.047896 *
## pvclass:screwA 0.91449 0.16333 5.599 9.49e-08 ***
## pvclass:motorEscrewA 4.63629 0.31425 14.754 < 2e-16 ***
## pvclass:motorCscrewD -2.19308 0.31339 -6.998 7.23e-11 ***
## pvclass:motorCscrewE -1.19309 0.31339 -3.807 0.000202 ***
## pvclass:motorCscrewC -0.79309 0.31339 -2.531 0.012376 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3519 on 156 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.9491
## F-statistic: 310.5 on 10 and 156 DF, p-value: < 2.2e-16
This model tells us the following:
As a final assessment we look at the residual plots
library(car)
par(
mfrow=c(1,2),
bg ="darkgray"
)hist(
resid(reg2),
main = "Histogram of Residuals",
col = "green",
border = "darkgreen",
freq = FALSE
)qqPlot(
resid(reg2),
main = "qq Plot",
ylab = "sample quantiles"
)
c(127,130),]
servo[c(49,124),] servo[
## [1] 49 124
## motor screw pgain vgain class pvclass motorD motorE screwA motorC
## 127 C C 3 1 1.899990 1 0 0 0 1
## 130 C C 3 2 4.299977 1 0 0 0 1
## motorEscrewA motorCscrewD motorCscrewE motorCscrewC
## 127 0 0 0 1
## 130 0 0 0 1
## motor screw pgain vgain class pvclass motorD motorE screwA motorC
## 49 C E 3 2 4.099967 1 0 0 0 1
## 124 C E 3 1 1.299997 1 0 0 0 1
## motorEscrewA motorCscrewD motorCscrewE motorCscrewC
## 49 0 0 1 0
## 124 0 0 1 0
The histogram and residual plots look decent overall, with a
couple outlier points on either side. A little further exploration
yields the indices of these points (49,124,127,130). Examining we see
the issue. The first two points below to motor
‘C’ and screw
‘C’ with pvclass
of 1 while the last two points belong to motor
‘C’ and screw
‘E’ with pvclass
of 1. As there are only two points in the dataset with these
combinations of features, the model does not have much data to fit here.
Furthermore the two points present have significantly different class
values, as seen from the output above.
This artifact could be avoided by noting that both points differ
on the exact vgain
value. We could incorporate this into the model. But then we would only
have 1 data point per case, which would create a strong possibility for
overfitting. In general nonlinear prediction paradigms, such as decision
tree models, may have somewhat stronger predictive power. However with
such few cases in certain categories in this dataset, these models run a
higher risk of overfitting. The most satisfactory solution here, if
feasible, is to collect more data.
Despite having a high non-linear dataset with many possible
categories of predictors and many possible interaction effects, we have
been able to construct a linear model on a subset of predictors that is
parsimonious and has strong explanatory power.