Practical machine learning: HAR random forests prediction

Abstract

‘Practical Machine Learning’ course project. The course is kindly provided by Johns Hopkins University and Coursera. The project requires the use of machine learning techniques to analyze Human Activity Recognition (HAR) data and predict the activity ‘quality’ (classe column) performed by the wired user.

Methodology

We use caret to train a random forest model. We register one worker for each processor available, liasing on detectCores() from the parallel library.

Unlike the rpart simple-tree study, we primarly focus here on the features available in the testing dataset in order to prune relevant columns. A simple exploratory data on the testing dataset shows that many columns in the testing dataset are empty, so we remove the same columns from the training data. We also remove some features we consider not relevent, such as the name of the user, number of windows, etc.

PreProcess

testing <- read.csv('../pml-testing.csv', na.strings = c('#DIV/0!', '', 'NA'), stringsAsFactors = F) 
training <- read.csv('../pml-training.csv', na.strings = c('#DIV/0!', '', 'NA'), stringsAsFactors = F) 
uselessColumn <- function(column){
  return(sum(is.na(column)) == 20)
}
res <- apply(testing, 2, uselessColumn)
testing <- testing[, !res]
training <- training[, !res]
# date, usernames, windows are not relevant
testing <- testing[, -c(1:7)]
training <- training[, -c(1:7)]

We also remove very correlated (>0.9) fetures from datasets, in order to reduce computational burden, which is in general relevant using random forests algorithms.

# Remove correlated predictors
findCorrelated <- function(dataset) {
  M <- cor(dataset[, -53])
  corr_columns <- findCorrelation(M, cutoff=0.9)
  return(corr_columns)
}

badColumns <- findCorrelated(training)
training <- training[, -badColumns]
testing <- testing[, -badColumns]

Set seed

set.seed(8888)

We then split the training data and allocate a 25% of the dataset as a validation set, to assess Out of Sample prediction.

inTrain = createDataPartition(training$classe, p=0.75, list=FALSE)
training_final <- training[inTrain, ]
validation <- training[-inTrain, ]

Train

ptm <- proc.time()
forestFit <- train(classe ~ ., data = training_final, method='rf')
proc.time() - ptm

##     user   system  elapsed 
## 6953.774   26.434 1153.052

Out of Sample prediction

validation_preds <- predict(forestFit, validation)
confusionMatrix(validation_preds, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    2    0    0    0
##          B    0  946    4    1    0
##          C    0    1  847    8    0
##          D    0    0    4  794    4
##          E    0    0    0    1  897
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9925, 0.9967)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9968   0.9906   0.9876   0.9956
## Specificity            0.9994   0.9987   0.9978   0.9980   0.9998
## Pos Pred Value         0.9986   0.9947   0.9895   0.9900   0.9989
## Neg Pred Value         1.0000   0.9992   0.9980   0.9976   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1929   0.1727   0.1619   0.1829
## Detection Prevalence   0.2849   0.1939   0.1746   0.1635   0.1831
## Balanced Accuracy      0.9997   0.9978   0.9942   0.9928   0.9977

It is quite surprisingly to see a diagonal matrix, i.e. having the model to predict correctly all the activities. It is therefore unnecessary to proceed to refine the model, we will just eat the pudding and see, submitting prediction to the coursera web interface.

Predict for course exam

We then use the model to predict from the testing dataset, for which the classe column was not given (course exam).

ptm <- proc.time()
preds <- predict(forestFit, testing)
proc.time() - ptm

##    user  system elapsed 
##   0.021   0.001   0.021

print(preds)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Results

This very basic procedure predicts correctly 100% of the cases in the testing dataset, as hinted by the calculated confusion matrix (0.9979 accuracy).

Misc

The most important predictor results to be the

plot(varImp(forestFit), top = 10)

Retro

We can for completeness perform a prediction using the rpart algorithm, as in here, using the same training dataset as defined in here

ptm <- proc.time()
treeFit <- train(classe ~ ., data = training_final, method='rpart')
proc.time() - ptm

##     user   system  elapsed 
## 1053.590    5.721    5.915

We can see that the accuracy for the rpart algorithm is smaller, even using the same training data (computation is much less though).

validation_preds_rpart <- predict(treeFit, validation)
confusionMatrix(validation_preds_rpart, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1250  385  393  304  212
##          B   39  322   31  160  194
##          C   82  208  340  100  199
##          D   24   31   91  184   28
##          E    0    3    0   56  268
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4821         
##                  95% CI : (0.468, 0.4961)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3236         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8961  0.33930  0.39766  0.22886  0.29745
## Specificity            0.6312  0.89279  0.85453  0.95756  0.98526
## Pos Pred Value         0.4914  0.43164  0.36598  0.51397  0.81957
## Neg Pred Value         0.9386  0.84921  0.87044  0.86362  0.86170
## Prevalence             0.2845  0.19352  0.17435  0.16395  0.18373
## Detection Rate         0.2549  0.06566  0.06933  0.03752  0.05465
## Detection Prevalence   0.5188  0.15212  0.18944  0.07300  0.06668
## Balanced Accuracy      0.7636  0.61605  0.62610  0.59321  0.64135

The tree obtained is however more detailed than the one obtained with the previous study

fancyRpartPlot(treeFit$finalModel)

Finally, the performance against the testing dataset is rather poor, coherently with the Out of Sample Error.

preds_rpart <- predict(treeFit, testing)
print(preds_rpart)

##  [1] D A C A A C D A A A C C C A C A A A A C
## Levels: A B C D E

print(sum(preds == preds_rpart) / length(preds))

## [1] 0.45

Giovanni Giupponi