Practical machine learning: HAR data massaging and simple tree example

Abstract

‘Practical Machine Learning’ course project. The course is kindly provided by Johns Hopkins University and Coursera. The project requires the use of machine learning techniques to analyze Human Activity Recognition (HAR) data and predict the activity ‘quality’ (classe column) performed by the wired user.

An important and useful goal of the project is nonetheless to practice a mix of techniques such as k-fold cross validation, features selection, identification of correlated and zero variance predictors, that can be used to build a forecasting algorithm.

For this case, we use trees to predict the ‘classe’ feature and we report the Out of Sample prediction accuracy, which is not particularly good (0.47). In order to better the accuracy, we therefore resort to a different algorithm, random forest, tailoring the model on the features present in the testing set.

Methodology

Load raw training data

Before loading the data, a very quick inspection of the .csv training file suggests the definition of NA values and shows a set of columns that cannot be used as predictors (dates, names of users). We then load the .csv training data and remove uncorrelated columns.

# It is important to set correctly the NA strings
dataset <- read.csv('../pml-training.csv', na.strings = c('#DIV/0!', '', 'NA'), stringsAsFactors = F)
# date and usernames not relevant
dataset <- select(dataset, -c(1:5))

Preprocess

str(dataset), summary(dataset) reveal that dataframe is not dense, i.e. NA values are heavily scattered into the training set. This is due to the fact that a single observation is split in different rows of the dataset, so that the overall statistical information of the user session is only present in ‘new_window’ rows. The following code demonstrates it.

dataset <- select(dataset, -grep('var_', colnames(dataset)))
row_data_density <- function(field) {
  res <- !is.na(field)
  return(res)
}

data_density <- function(row) {
 sum(sapply(row, row_data_density), na.rm=T) / length(row)
}

new_window <- filter(dataset, new_window == 'yes')
stale_wind <- filter(dataset, new_window != 'yes')
summary(apply(new_window, 1, data_density))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7986  0.9281  0.9568  0.9379  0.9568  0.9568

summary(apply(stale_wind, 1, data_density))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3957  0.3957  0.3957  0.3957  0.3957  0.3957

As a consequence, we drop all the rows without important predictors. We also remove any window-related feature

# Use dense info as dataset
dataset <- new_window

# Remove window features
dataset <- dataset[,-c(1:2)]

We then leverage the caret package to remove predictors with low variance, i.e. with a low probability of being useful.

# Near zero variance predictors
nzv <- nearZeroVar(dataset)
# Remove zero variance columns
dataset <- dataset[, -nzv]

We finally calculate the correlation of the features, in order to purge the dataset from too-correlated (>0.9) predictors. As an example, there are features that are the standard deviation and the variance of the same data.

# Remove correlated predictors
M <- cor(dataset[-128], use='pairwise')
corr_columns <- findCorrelation(M, cutoff=0.9)
dataset <- dataset[,-corr_columns]

Set seed

set.seed(8888)

Build model and predict from fold

To obtain out of sample accuracy, we create 20 folds to be used for cross validation.

#Create k-folds (20 folds so to have sets testing sizes similar to the given one)
num_folds = 20
kfolds <- createFolds(y=dataset$classe, k=num_folds, list=T, returnTrain=T)

We then train, predict and calculate the models accuracy. We use simple trees (rpart package).

getAccuracyFromFold <- function(i){
  training <- dataset[ kfolds[[i]], ]
  testing  <- dataset[ -kfolds[[i]], ]
  treeFit <- train(classe ~ ., data = training, method='rpart')
  preds <- predict(treeFit, testing, na.action = 'na.pass')
  cMatrix <- confusionMatrix(testing$classe, preds)
  return(cMatrix$overall[1])
}

# Predict
ptm <- proc.time()
accs <- sapply(c(1:num_folds), getAccuracyFromFold)
proc.time() - ptm

##    user  system elapsed 
##  28.562   0.012  28.584

Expected out of sample accuracy

The estimated out of sample accuracy is the average (with standard deviation) over the results obtained with the different folds. Mean accuracy is 0.47, surely not a great results. However, this was not the goal of this study. Random forests reach 99% prediction accuracy

paste(round(mean(accs),2), round(sd(accs),2))

## [1] "0.47 0.06"

Example figure

We plot the tree obtained by training the model on all the testing sample with rpart algorithm.

treeFit <- train(classe ~ ., data = dataset, method='rpart')
fancyRpartPlot(treeFit$finalModel)

Giovanni Giupponi