Abstract
‘Practical Machine Learning’ course project. The course is kindly provided by Johns Hopkins University and Coursera. The project requires the use of machine learning techniques to analyze Human Activity Recognition (HAR) data and predict the activity ‘quality’ (classe column) performed by the wired user.
Methodology
We use caret to train a random forest model. We register one worker for each processor available, liasing on detectCores() from the parallel library.
Unlike the rpart simple-tree study, we primarly focus here on the features available in the testing dataset in order to prune relevant columns. A simple exploratory data on the testing dataset shows that many columns in the testing dataset are empty, so we remove the same columns from the training data. We also remove some features we consider not relevent, such as the name of the user, number of windows, etc.
PreProcess
testing <- read.csv('../pml-testing.csv', na.strings = c('#DIV/0!', '', 'NA'), stringsAsFactors = F)
training <- read.csv('../pml-training.csv', na.strings = c('#DIV/0!', '', 'NA'), stringsAsFactors = F)
uselessColumn <- function(column){
return(sum(is.na(column)) == 20)
}
res <- apply(testing, 2, uselessColumn)
testing <- testing[, !res]
training <- training[, !res]
# date, usernames, windows are not relevant
testing <- testing[, -c(1:7)]
training <- training[, -c(1:7)]
We also remove very correlated (>0.9) fetures from datasets, in order to reduce computational burden, which is in general relevant using random forests algorithms.
# Remove correlated predictors
findCorrelated <- function(dataset) {
M <- cor(dataset[, -53])
corr_columns <- findCorrelation(M, cutoff=0.9)
return(corr_columns)
}
badColumns <- findCorrelated(training)
training <- training[, -badColumns]
testing <- testing[, -badColumns]
Set seed
set.seed(8888)
We then split the training data and allocate a 25% of the dataset as a validation set, to assess Out of Sample prediction.
inTrain = createDataPartition(training$classe, p=0.75, list=FALSE)
training_final <- training[inTrain, ]
validation <- training[-inTrain, ]
Train
ptm <- proc.time()
forestFit <- train(classe ~ ., data = training_final, method='rf')
proc.time() - ptm
## user system elapsed
## 6953.774 26.434 1153.052
Out of Sample prediction
validation_preds <- predict(forestFit, validation)
confusionMatrix(validation_preds, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 2 0 0 0
## B 0 946 4 1 0
## C 0 1 847 8 0
## D 0 0 4 794 4
## E 0 0 0 1 897
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9925, 0.9967)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9968 0.9906 0.9876 0.9956
## Specificity 0.9994 0.9987 0.9978 0.9980 0.9998
## Pos Pred Value 0.9986 0.9947 0.9895 0.9900 0.9989
## Neg Pred Value 1.0000 0.9992 0.9980 0.9976 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1929 0.1727 0.1619 0.1829
## Detection Prevalence 0.2849 0.1939 0.1746 0.1635 0.1831
## Balanced Accuracy 0.9997 0.9978 0.9942 0.9928 0.9977
It is quite surprisingly to see a diagonal matrix, i.e. having the model to predict correctly all the activities. It is therefore unnecessary to proceed to refine the model, we will just eat the pudding and see, submitting prediction to the coursera web interface.
Predict for course exam
We then use the model to predict from the testing dataset, for which the classe column was not given (course exam).
ptm <- proc.time()
preds <- predict(forestFit, testing)
proc.time() - ptm
## user system elapsed
## 0.021 0.001 0.021
print(preds)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Results
This very basic procedure predicts correctly 100% of the cases in the testing dataset, as hinted by the calculated confusion matrix (0.9979 accuracy).
Misc
The most important predictor results to be the
plot(varImp(forestFit), top = 10)
Retro
We can for completeness perform a prediction using the rpart algorithm, as in here, using the same training dataset as defined in here
ptm <- proc.time()
treeFit <- train(classe ~ ., data = training_final, method='rpart')
proc.time() - ptm
## user system elapsed
## 1053.590 5.721 5.915
We can see that the accuracy for the rpart algorithm is smaller, even using the same training data (computation is much less though).
validation_preds_rpart <- predict(treeFit, validation)
confusionMatrix(validation_preds_rpart, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1250 385 393 304 212
## B 39 322 31 160 194
## C 82 208 340 100 199
## D 24 31 91 184 28
## E 0 3 0 56 268
##
## Overall Statistics
##
## Accuracy : 0.4821
## 95% CI : (0.468, 0.4961)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3236
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8961 0.33930 0.39766 0.22886 0.29745
## Specificity 0.6312 0.89279 0.85453 0.95756 0.98526
## Pos Pred Value 0.4914 0.43164 0.36598 0.51397 0.81957
## Neg Pred Value 0.9386 0.84921 0.87044 0.86362 0.86170
## Prevalence 0.2845 0.19352 0.17435 0.16395 0.18373
## Detection Rate 0.2549 0.06566 0.06933 0.03752 0.05465
## Detection Prevalence 0.5188 0.15212 0.18944 0.07300 0.06668
## Balanced Accuracy 0.7636 0.61605 0.62610 0.59321 0.64135
The tree obtained is however more detailed than the one obtained with the previous study
fancyRpartPlot(treeFit$finalModel)
Finally, the performance against the testing dataset is rather poor, coherently with the Out of Sample Error.
preds_rpart <- predict(treeFit, testing)
print(preds_rpart)
## [1] D A C A A C D A A A C C C A C A A A A C
## Levels: A B C D E
print(sum(preds == preds_rpart) / length(preds))
## [1] 0.45