Practical Machine Learning For Everyday Life
Mar 19, 2017
In this very practical R tutorial, we will see if we can use our machine learning skills to study something we enjoy in everyday life: wine.
We will use wine quality data from the UCI Machine Learning Repository. These two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. To start, read the data into R
.
wine1.url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
wine1 <- read.csv(wine1.url, header=TRUE, sep=';')
wine2.url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine2 <- read.csv(wine2.url, header=TRUE, sep=';')
We will use only a subset of the data for demonstrative purposes. Each row is a different wine. Columns are features, including physicochemical measurements such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol, as well as a quality score between 0 and 10 and whether the wine is a red or white.
wine <- rbind(cbind(wine1[1:100,], type='white'), cbind(wine2[1:100,], type='red'))
wine$type <- as.factor(wine$type)
head(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality type
## 1 6 white
## 2 6 white
## 3 6 white
## 4 6 white
## 5 6 white
## 6 6 white
Binary classification
First, let’s see if we can train a binary classifier to differentiate between white and red wines. The caret
package in R supports a huge number of models.
library(caret)
library(pROC)
head(names(getModelInfo()), n=30)
## [1] "ada" "AdaBag" "AdaBoost.M1" "adaboost"
## [5] "amdai" "ANFIS" "avNNet" "awnb"
## [9] "awtan" "bag" "bagEarth" "bagEarthGCV"
## [13] "bagFDA" "bagFDAGCV" "bam" "bartMachine"
## [17] "bayesglm" "bdk" "binda" "blackboost"
## [21] "blasso" "blassoAveraged" "Boruta" "bridge"
## [25] "brnn" "BstLm" "bstSm" "bstTree"
## [29] "C5.0" "C5.0Cost"
What ever model you choose, make sure it supports the type of modeling you want. We will use a gbm
, which supports both classification and regression.
getModelInfo()$gbm$type
## [1] "Regression" "Classification"
First, we will try a binary classification problem. Can we train our gbm
classifier to accurately distinguish red and white wines?
trait <- 'type'
features <- wine[, setdiff(colnames(wine), trait)]
head(features)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
class <- wine[, trait]
head(class)
## [1] white white white white white white
## Levels: white red
ctrl <- trainControl(
method="repeatedcv", ## 10 fold cross validation
repeats=5, ## 5 repetitions of cross validation
summaryFunction=twoClassSummary, ## two classes only
classProbs=TRUE,
savePred=TRUE
)
model <- train(
x=features,
y=class,
method="gbm",
trControl=ctrl,
verbose=FALSE
)
Based on the ROC curve, it looks like we did pretty well!
plot.roc(model$pred$obs, model$pred$red)
## double check on new data
predict(model, newdata=wine1[100:110, 1:(ncol(wine1)-1)])
## [1] white white white white white white white white white white white
## Levels: white red
predict(model, newdata=wine2[100:110, 1:(ncol(wine1)-1)])
## [1] red red red red red red red red red red red
## Levels: white red
What features did our classifier find important and useful?
importance <- varImp(model, scale=FALSE)
print(importance)
## gbm variable importance
##
## Overall
## chlorides 81.02693
## volatile.acidity 20.41412
## total.sulfur.dioxide 15.50752
## density 9.51392
## sulphates 5.20135
## free.sulfur.dioxide 4.06884
## residual.sugar 2.94265
## pH 1.73052
## fixed.acidity 0.25914
## citric.acid 0.15809
## alcohol 0.01027
## quality 0.00000
Our gbm
classifier found that chlorides
was very useful in distinguishing between red and white wines. Indeed, reds seem to have higher chloride levels than whites.
library(ggplot2)
ggplot(data=wine, aes(x=type, y=chlorides)) + geom_boxplot()
Regression
Now, let’s see if we can train a regression model to predict the wine’s quality given its various physicochemical measurements. We will use the red wine data only here since wine type is known to be a confounder for quality in this particular dataset.
trait <- 'quality'
features <- wine2[, setdiff(colnames(wine2), trait)]
head(features)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
class <- wine2[, trait]
head(class)
## [1] 5 5 5 6 5 5
ctrl <- trainControl(
method="repeatedcv", ## 10 fold cross validation
repeats=5, ## 5 repetitions of cross validation
savePred=TRUE
)
model <- train(
x=features,
y=class,
method="gbm",
trControl=ctrl,
verbose=FALSE
)
If we plot our predictions against the real quality score, we see a general positive correlation, but definitely less than perfect.
plot(model$pred$obs, model$pred$pred)
## double check by predicting on white wine data
p <- predict(model, newdata=wine1[, setdiff(colnames(wine2), trait)])
plot(wine1$quality, p)
What features did our model find important and useful?
importance <- varImp(model, scale=FALSE)
importance
## gbm variable importance
##
## Overall
## alcohol 727.07
## volatile.acidity 377.59
## sulphates 346.75
## total.sulfur.dioxide 198.10
## pH 125.44
## chlorides 118.79
## citric.acid 113.84
## fixed.acidity 96.45
## density 91.38
## residual.sugar 86.11
## free.sulfur.dioxide 72.66
Alcohol?! Indeed, we see a general correlation between the quality of the wine and its alcohol content. Perhaps unsurprising haha ;)
plot(wine2$quality, wine2$alcohol)
- Older
- Newer
RECENT POSTS
- Aligning 10X Visium spatial transcriptomics datasets using STalign with Reticulate in R on 05 November 2023
- Aligning single-cell spatial transcriptomics datasets simulated with non-linear disortions on 20 August 2023
- 10x Visium spatial transcriptomics data analysis with STdeconvolve in R on 29 May 2023
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve on 04 May 2023
- Aligning Spatial Transcriptomics Data With Stalign on 16 April 2023
- 3D animation of the brain in R on 08 November 2022
- Ethical Challenges in Biomedical Engineering - Data Collection, Analysis, and Interpretation on 15 October 2022
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule on 12 September 2022
- Annotating STdeconvolve Cell-Types with ASCT+B Tables on 30 August 2022
- Deconvolution vs Clustering Analysis: An exploration via simulation on 11 July 2022
TAGS
RELATED POSTS
- Aligning 10X Visium spatial transcriptomics datasets using STalign with Reticulate in R
- 10x Visium spatial transcriptomics data analysis with STdeconvolve in R
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve
- Aligning Spatial Transcriptomics Data With Stalign