Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.

All analysis was performed on a Lenovo IdeaPad Y500 with an i7 processor and 8GB RAM using RStudio on Ubuntu 15.10.

This is a R Markdown project, to see the source code please visit my Github repository.

Data Preprocessing

We will begin by loading the neccesarry packages for our analysis and register the number of cores to be used (for parallelized random forest processing).

library(caret)
library(ggplot2)
library(doMC)
registerDoMC(cores = 8)
set.seed(1248) # For reprodicibilty

Now we will load the two datasets. The first is to be used for all training/cross validating while the other consists of 20 submission testing instances. The subTest dataset is used to evaluate our model on the Practical Machine Learning webpage.

allData <- read.csv("./pml-training.csv", na.strings=c("","NA"))
subTest <- read.csv("./pml-testing.csv")

Cleaning the Data Set

First we will create a cleaned datasetallData.cleaned which removes several non-predictors (particpant name, row id, time stamps, etc.) from our dataset.

allData.cleaned <- subset(allData, select = -c(user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,num_window,new_window,X))

NA Removal

Now that we have removed several less useful variables, we are left with a dataset containing many NA entries. To address this, we will remove all columns which are more than 25% NA.

nObs <- nrow(allData.cleaned)
allData.cleaned <- allData.cleaned[,colSums(is.na(allData.cleaned))/nObs < 0.25]
dim(allData.cleaned)
## [1] 19622    53

Data Typing

All remaining entries should be treated as numeric. This loop transforms all columns (excet the `classe column) to numeric vectors.

for (i in names(allData.cleaned[,1:(ncol(allData.cleaned)-1)]))
{
    allData.cleaned[[i]] <- as.numeric(allData.cleaned[[i]])
}

Just as a preliminary step to show that there is some interesting variance to the remaing variables, Appendix 1 provides a histogram of each (scaled) column. It seems that each variable is non-uniform which might indicate that they would make a good predictor.

Data Partitioning

Finally we partition the data into training and testing sets (separate from the submission test set).

inTrain <- createDataPartition(y=allData.cleaned$classe,p=.7,list=FALSE)
training <- allData.cleaned[inTrain,]
testing <- allData.cleaned[-inTrain,]

Building our Model

Now that the data has been sufficiently preprocessed, we can set out to build our model. We will be using a Random Forest model to classify each instance (row) in the dataset as one of five different bicep excercises. Additionally we will be using k-Fold Cross Validation with k=5.

myControl <- trainControl(method="cv", 5)
modFit <- train(classe ~ ., method="rf", data=training, verbose=FALSE, trControl = myControl, ntree=250)

Making Predictions

Now that we have our model, we can use it to classify the data in our testing set. Here we do so and print out the resulting confusion matrix.

preds <- predict(modFit, testing)
confusionMatrix(preds,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    9    0    0    0
##          B    3 1123    7    3    0
##          C    0    7 1017   14    3
##          D    0    0    2  947    2
##          E    1    0    0    0 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9913          
##                  95% CI : (0.9886, 0.9935)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.989           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9860   0.9912   0.9824   0.9954
## Specificity            0.9979   0.9973   0.9951   0.9992   0.9998
## Pos Pred Value         0.9946   0.9886   0.9769   0.9958   0.9991
## Neg Pred Value         0.9990   0.9966   0.9981   0.9966   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1908   0.1728   0.1609   0.1830
## Detection Prevalence   0.2853   0.1930   0.1769   0.1616   0.1832
## Balanced Accuracy      0.9977   0.9916   0.9931   0.9908   0.9976

As we can see from the confusion matrix, the model performed extremely well, correctly classifying 99.13% of the testing data. Finally we apply the above model to the submission testing set subTest (after cleaning subTest in the same way we cleaned allData.cleaned).

myPredictors <- names(allData.cleaned)
myPredictors <- myPredictors[myPredictors!= "classe"]
subTest <- subTest[,myPredictors]
predsSub <- predict(modFit, subTest)
predsSub
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Here we can see our predictions on the submission test set. Submitting these values we found our classifier to be 100% accurate on the submission test set.

Appendix 1: Exploratory Variable Analysis

Here we plot the scaled histogram of each variable in the cleaned data set. Visually inspecting these histograms seems to indicate that each variable is sufficiently variable and may prove to be a useful predictor.

d <- melt(data.frame(scale(allData.cleaned[,1:(ncol(allData.cleaned)-1)])))
ggplot(d,aes(x = value)) + 
  facet_wrap(~variable,scales = "free_x", ncol = 4) + 
  geom_histogram() +
  ggtitle("Scaled Histograms of All Predictors")