Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.
In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.
ConvNets, therefore, are an important tool for most machine learning practitioners today. However, understanding ConvNets and learning to use them for the first time can sometimes be an intimidating experience. The primary purpose of this blog post is to develop an understanding of how Convolutional Neural Networks work on images.
If you are new to neural networks in general, I would recommend reading this short tutorial on Multi Layer Perceptrons to get an idea about how they work, before proceeding. Multi Layer Perceptrons are referred to as “Fully Connected Layers” in this post.
LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988 [3]. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.
Below, we will develop an intuition of how the LeNet architecture learns to recognize images. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.
The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks). As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories. The sum of all probabilities in the output layer should be one (explained later in this post).
There are four main operations in the ConvNet shown in Figure 3 above:
These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets. We will try to understand the intuition behind each of these operations below.
Essentially, every image can be represented as a matrix of pixel values.
Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels – red, green and blue – you can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255.
A grayscale image, on the other hand, has just one channel. For the purpose of this post, we will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 – zero indicating black and 255 indicating white.
ConvNets derive their name from the “convolution” operator. The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.
As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):
Also, consider another 3 x 3 matrix as shown below:
Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below:
Take a moment to understand how the computation above is being done. We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.
In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature detectors from the original input image.
It is evident from the animation above that different values of the filter matrix will produce different Feature Maps for the same input image. As an example, consider the following input image:
In the table below, we can see the effects of convolution of the above image with different filters. As shown, we can perform operations such as Edge Detection, Sharpen and Blur just by changing the numeric values of our filter matrix before the convolution operation [8] – this means that different filters can detect different features from an image, for example edges, curves etc. More such examples are available in Section 8.2.4 here.
Another good way to understand the Convolution operation is by looking at the animation in Figure 6 below:
A filter (with red outline) slides over the input image (convolution operation) to produce a feature map. The convolution of another filter (with the green outline), over the same image gives a different feature map as shown. It is important to note that the Convolution operation captures the local dependencies in the original image. Also notice how these two different filters generate different feature maps from the same original image. Remember that the image and the two filters above are just numeric matrices as we have discussed above.
In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
The size of the Feature Map (Convolved Feature) is controlled by three parameters [4] that we need to decide before the convolution step is performed:
An additional operation called ReLU has been used after every Convolution operation in Figure 3 above. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:
ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear (Convolution is a linear operation – element wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).
The ReLU operation can be understood clearly from Figure 9 below. It shows the ReLU operation applied to one of the feature maps obtained in Figure 6 above. The output feature map here is also referred to as the ‘Rectified’ feature map.
Other non linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.
Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.
In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.
Figure 10 shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.
We slide our 2 x 2 window by 2 cells (also called ‘stride’) and take the maximum value in each region. As shown in Figure 10, this reduces the dimensionality of our feature map.
In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).
Figure 12 shows the effect of Pooling on the Rectified Feature Map we received after the ReLU operation in Figure 9 above.
The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling
So far we have seen how Convolution, ReLU and Pooling work. It is important to understand that these layers are the basic building blocks of any CNN. As shown in Figure 13, we have two sets of Convolution, ReLU & Pooling layers – the 2nd Convolution layer performs convolution on the output of the first Pooling Layer using six filters to produce a total of six feature maps. ReLU is then applied individually on all of these six feature maps. We then perform Max Pooling operation separately on each of the six rectified feature maps.
Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].
The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.
The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post). The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. I recommend reading this post if you are unfamiliar with Multi Layer Perceptrons.
The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)
Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better [11].
The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.
As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classifier.
Note that in Figure 15 below, since the input image is a boat, the target probability is 1 for Boat class and 0 for other three classes, i.e.
The overall training process of the Convolution Network may be summarized as below:
The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.
When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.
Note 1: The steps above have been oversimplified and mathematical details have been avoided to provide intuition into the training process. See [4] and [12] for a mathematical formulation and thorough understanding.
Note 2: In the example above we used two sets of alternating Convolution and Pooling layers. Please note however, that these operations can be repeated any number of times in a single ConvNet. In fact, some of the best performing ConvNets today have tens of Convolution and Pooling layers! Also, it is not necessary to have a Pooling layer after every Convolutional Layer. As can be seen in the Figure 16 below, we can have multiple Convolution + ReLU operations in succession before having a Pooling operation. Also notice how each layer of the ConvNet is visualized in the Figure 16 below.
In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14]. This is demonstrated in Figure 17 below – these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).
Adam Harley created amazing visualizations of a Convolutional Neural Network trained on the MNIST Database of handwritten digits [13]. I highly recommend playing around with it to understand details of how a CNN works.
We will see below how the network works for an input ‘8’. Note that the visualization in Figure 18 does not show the ReLU operation separately.
The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image. As seen, using six different filters produces a feature map of depth six.
Convolutional Layer 1 is followed by Pooling Layer 1 that does 2 × 2 max pooling (with stride 2) separately over the six feature maps in Convolution Layer 1. You can move your mouse pointer over any pixel in the Pooling Layer and observe the 2 x 2 grid it forms in the previous Convolution Layer (demonstrated in Figure 19). You’ll notice that the pixel having the maximum value (the brightest one) in the 2 x 2 grid makes it to the Pooling layer.
Pooling Layer 1 is followed by sixteen 5 × 5 (stride 1) convolutional filters that perform the convolution operation. This is followed by Pooling Layer 2 that does 2 × 2 max pooling (with stride 2). These two layers use the same concepts as described above.
We then have three fully-connected (FC) layers. There are:
Notice how in Figure 20, each of the 10 nodes in the output layer are connected to all 100 nodes in the 2nd Fully Connected layer (hence the name Fully Connected).
Also, note how the only bright node in the Output Layer corresponds to ‘8’ – this means that the network correctly classifies our handwritten digit (brighter node denotes that the output from it is higher, i.e. 8 has the highest probability among all other digits).
The 3d version of the same visualization is available here.
Convolutional Neural Networks have been around since early 1990s. We discussed the LeNet above which was one of the very first convolutional neural networks. Some other influential architectures are listed below [3] [4].
In this post, I have tried to explain the main concepts behind Convolutional Neural Networks in simple terms. There are several details I have oversimplified / skipped, but hopefully this post gave you some intuition around how they work.
This post was originally inspired from Understanding Convolutional Neural Networks for NLP by Denny Britz (which I would recommend reading) and a number of explanations here are based on that post. For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.
All images and animations used in this post belong to their respective authors as listed in References section below.
The basic unit of computation in a neural network is the neuron, often called a node or unit. It receives input from some other nodes, or from an external source and computes an output. Each input has an associated weight (w), which is assigned on the basis of its relative importance to other inputs. The node applies a function f (defined below) to the weighted sum of its inputs as shown in Figure 1 below:
The above network takes numerical inputs X1 and X2 and has weights w1 and w2 associated with those inputs. Additionally, there is another input 1 with weight b (called the Bias) associated with it. We will learn more details about role of the bias later.
The output Y from the neuron is computed as shown in the Figure 1. The function f is non-linear and is called the Activation Function. The purpose of the activation function is to introduce non-linearity into the output of a neuron. This is important because most real world data is non linear and we want neurons to learn these non linear representations.
Every activation function (or non-linearity) takes a single number and performs a certain fixed mathematical operation on it [2]. There are several activation functions you may encounter in practice:
σ(x) = 1 / (1 + exp(−x))
tanh(x) = 2σ(2x) − 1
f(x) = max(0, x)
The below figures [2] show each of the above activation functions.
Importance of Bias: The main function of Bias is to provide every node with a trainable constant value (in addition to the normal inputs that the node receives). See this link to learn more about the role of bias in a neuron.
The feedforward neural network was the first and simplest type of artificial neural network devised [3]. It contains multiple neurons (nodes) arranged in layers. Nodes from adjacent layers have connections or edges between them. All these connections have weights associated with them.
An example of a feedforward neural network is shown in Figure 3.
A feedforward neural network can consist of three types of nodes:
In a feedforward network, the information moves in only one direction – forward – from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network [3] (this property of feed forward networks is different from Recurrent Neural Networks in which the connections between the nodes form a cycle).
Two examples of feedforward networks are given below:
Single Layer Perceptron – This is the simplest feedforward neural network [4] and does not contain any hidden layer. You can learn more about Single Layer Perceptrons in [4], [5], [6], [7].
Multi Layer Perceptron – A Multi Layer Perceptron has one or more hidden layers. We will only discuss Multi Layer Perceptrons below since they are more useful than Single Layer Perceptons for practical applications today.
A Multi Layer Perceptron (MLP) contains one or more hidden layers (apart from one input and one output layer). While a single layer perceptron can only learn linear functions, a multi layer perceptron can also learn non – linear functions.
Figure 4 shows a multi layer perceptron with a single hidden layer. Note that all connections have weights associated with them, but only three weights (w0, w1, w2) are shown in the figure.
Input Layer: The Input layer has three nodes. The Bias node has a value of 1. The other two nodes take X1 and X2 as external inputs (which are numerical values depending upon the input dataset). As discussed above, no computation is performed in the Input layer, so the outputs from nodes in the Input layer are 1, X1 and X2 respectively, which are fed into the Hidden Layer.
Hidden Layer: The Hidden layer also has three nodes with the Bias node having an output of 1. The output of the other two nodes in the Hidden layer depends on the outputs from the Input layer (1, X1, X2) as well as the weights associated with the connections (edges). Figure 4 shows the output calculation for one of the hidden nodes (highlighted). Similarly, the output from other hidden node can be calculated. Remember that f refers to the activation function. These outputs are then fed to the nodes in the Output layer.
Output Layer: The Output layer has two nodes which take inputs from the Hidden layer and perform similar computations as shown for the highlighted hidden node. The values calculated (Y1 and Y2) as a result of these computations act as outputs of the Multi Layer Perceptron.
Given a set of features X = (x1, x2, …) and a target y, a Multi Layer Perceptron can learn the relationship between the features and the target, for either classification or regression.
Lets take an example to understand Multi Layer Perceptrons better. Suppose we have the following student-marks dataset:
The two input columns show the number of hours the student has studied and the mid term marks obtained by the student. The Final Result column can have two values 1 or 0 indicating whether the student passed in the final term. For example, we can see that if the student studied 35 hours and had obtained 67 marks in the mid term, he / she ended up passing the final term.
Now, suppose, we want to predict whether a student studying 25 hours and having 70 marks in the mid term will pass the final term.
This is a binary classification problem where a multi layer perceptron can learn from the given examples (training data) and make an informed prediction given a new data point. We will see below how a multi layer perceptron learns such relationships.
The process by which a Multi Layer Perceptron learns is called the Backpropagation algorithm. I would recommend reading this Quora answer by Hemanth Kumar (quoted below) which explains Backpropagation clearly.
Backward Propagation of Errors, often abbreviated as BackProp is one of the several ways in which an artificial neural network (ANN) can be trained. It is a supervised training scheme, which means, it learns from labeled training data (there is a supervisor, to guide its learning).
To put in simple terms, BackProp is like “learning from mistakes“. The supervisor corrects the ANN whenever it makes mistakes.
An ANN consists of nodes in different layers; input layer, intermediate hidden layer(s) and the output layer. The connections between nodes of adjacent layers have “weights” associated with them. The goal of learning is to assign correct weights for these edges. Given an input vector, these weights determine what the output vector is.
In supervised learning, the training set is labeled. This means, for some given inputs, we know the desired/expected output (label).
BackProp Algorithm:
Initially all the edge weights are randomly assigned. For every input in the training dataset, the ANN is activated and its output is observed. This output is compared with the desired output that we already know, and the error is “propagated” back to the previous layer. This error is noted and the weights are “adjusted” accordingly. This process is repeated until the output error is below a predetermined threshold.Once the above algorithm terminates, we have a “learned” ANN which, we consider is ready to work with “new” inputs. This ANN is said to have learned from several examples (labeled data) and from its mistakes (error propagation).
Now that we have an idea of how Backpropagation works, lets come back to our student-marks dataset shown above.
The Multi Layer Perceptron shown in Figure 5 (adapted from Sebastian Raschka’s excellent visual explanation of the backpropagation algorithm) has two nodes in the input layer (apart from the Bias node) which take the inputs ‘Hours Studied’ and ‘Mid Term Marks’. It also has a hidden layer with two nodes (apart from the Bias node). The output layer has two nodes as well – the upper node outputs the probability of ‘Pass’ while the lower node outputs the probability of ‘Fail’.
In classification tasks, we generally use a Softmax function as the Activation Function in the Output layer of the Multi Layer Perceptron to ensure that the outputs are probabilities and they add up to 1. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one. So, in this case,
Probability (Pass) + Probability (Fail) = 1
Step 1: Forward Propagation
All weights in the network are randomly assigned. Lets consider the hidden layer node marked V in Figure 5 below. Assume the weights of the connections from the inputs to that node are w1, w2 and w3 (as shown).
The network then takes the first training example as input (we know that for inputs 35 and 67, the probability of Pass is 1).
Then output V from the node in consideration can be calculated as below (f is an activation function such as sigmoid):
V = f (1*w1 + 35*w2 + 67*w3)
Similarly, outputs from the other node in the hidden layer is also calculated. The outputs of the two nodes in the hidden layer act as inputs to the two nodes in the output layer. This enables us to calculate output probabilities from the two nodes in output layer.
Suppose the output probabilities from the two nodes in the output layer are 0.4 and 0.6 respectively (since the weights are randomly assigned, outputs will also be random). We can see that the calculated probabilities (0.4 and 0.6) are very far from the desired probabilities (1 and 0 respectively), hence the network in Figure 5 is said to have an ‘Incorrect Output’.
Step 2: Back Propagation and Weight Updation
We calculate the total error at the output nodes and propagate these errors back through the network using Backpropagation to calculate the gradients. Then we use an optimization method such as Gradient Descent to ‘adjust’ all weights in the network with an aim of reducing the error at the output layer. This is shown in the Figure 6 below (ignore the mathematical equations in the figure for now).
Suppose that the new weights associated with the node in consideration are w4, w5 and w6 (after Backpropagation and adjusting weights).
If we now input the same example to the network again, the network should perform better than before since the weights have now been adjusted to minimize the error in prediction. As shown in Figure 7, the errors at the output nodes now reduce to [0.2, -0.2] as compared to [0.6, -0.4] earlier. This means that our network has learnt to correctly classify our first training example.
We repeat this process with all other training examples in our dataset. Then, our network is said to have learnt those examples.
If we now want to predict whether a student studying 25 hours and having 70 marks in the mid term will pass the final term, we go through the forward propagation step and find the output probabilities for Pass and Fail.
I have avoided mathematical equations and explanation of concepts such as ‘Gradient Descent’ here and have rather tried to develop an intuition for the algorithm. For a more mathematically involved discussion of the Backpropagation algorithm, refer to this link.
Adam Harley has created a 3d visualization of a Multi Layer Perceptron which has already been trained (using Backpropagation) on the MNIST Database of handwritten digits.
The network takes 784 numeric pixel values as inputs from a 28 x 28 image of a handwritten digit (it has 784 nodes in the Input Layer corresponding to pixels). The network has 300 nodes in the first hidden layer, 100 nodes in the second hidden layer, and 10 nodes in the output layer (corresponding to the 10 digits) [15].
Although the network described here is much larger (uses more hidden layers and nodes) compared to the one we discussed in the previous section, all computations in the forward propagation step and backpropagation step are done in the same way (at each node) as discussed before.
Figure 8 shows the network when the input is the digit ‘5’.
A node which has a higher output value than others is represented by a brighter color. In the Input layer, the bright nodes are those which receive higher numerical pixel values as input. Notice how in the output layer, the only bright node corresponds to the digit 5 (it has an output probability of 1, which is higher than the other nine nodes which have an output probability of 0). This indicates that the MLP has correctly classified the input digit. I highly recommend playing around with this visualization and observing connections between nodes of different layers.
I have skipped important details of some of the concepts discussed in this post to facilitate understanding. I would recommend going through Part1, Part2, Part3 and Case Study from Stanford’s Neural Network tutorial for a thorough understanding of Multi Layer Perceptrons.
Let me know in the comments below if you have any questions or suggestions!
The functions currently included in the package are mentioned below:
There are 2 ways of installing xda:
The devtools
package needs to be installed first. To install devtools
, please follow instructions here. Then, use the following commands to install `xda`:
library(devtools) install_github(ujjwalkarn/xda)
xda
:install.packages(githubinstall) library(githubinstall) githubinstall(xda)
Read more about githubinstall
here.
Update: See usage instructions and latest updates to the package here. The package is constantly under development and more functionalities will be added soon. Will also add this to CRAN in the coming days. Pull requests to add more functions are welcome!
]]>subset
?]]>
]]>
See the full list on GitHub.
]]>
# Topics Covered # # 1. Reading data and Summary Statistics # 2. Determining the Optimal Number of Clusters # 3. Running Clustering Algorithm and Visualisations ############################################################################## #Reading data and Summary Statistics #change the working directory setwd("C:\\Users\\ujjwal.karn\\Desktop\\Classification & Clustering") mydata<-read.csv("data/kmeans_data.csv") head(mydata) str(mydata) summary(mydata) plot(mydata[c("Sepal.Length", "Sepal.Width")], main="Raw Data") #standardising the data mydata <- scale(mydata) ############################################################################## #Determining the Optimal Number of Clusters #http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters/ wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for(i in 1:25){wss[i] <- sum(kmeans(mydata, centers=i)$withinss)} plot(1:25, wss, type="b", xlab="No. of Clusters", ylab="wss") wss ############################################################################## #Running Clustering Algorithm # trying with 4 clusters clus4 <- kmeans(mydata, centers=4, nstart=30) #check between_SS / total_SS clus4 # get cluster means aggregate(mydata ,by=list(clus4$cluster), FUN=mean) # append cluster assignment mydata <- data.frame(mydata, clus4$cluster) #summary groups <- data.frame(clus4$cluster) table(groups) plot(mydata[c("Sepal.Length", "Sepal.Width")], col=clus4$cluster) points(clus4$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2) # trying with 3 clusters clus3 <- kmeans(mydata, centers=3, nstart=20) clus3 # get cluster means aggregate(mydata ,by=list(clus3$cluster), FUN=mean) # append cluster assignment mydata <- data.frame(mydata, clus3$cluster) #summary groups <- data.frame(clus3$cluster) table(groups) plot(mydata[c("Sepal.Length", "Sepal.Width")], col=clus3$cluster) points(clus3$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2)
]]>
# Topics Covered # # 1. Reading data and Summary Statistics # 2. Outlier Detection # 3. Missing Value Treatment # 4. Correlation and VIF # 5. Feature Selection Using IV # 6. Creating Training and validation Sets # 7. Running the Logistic Model on Training Set # 8. Evaluating Performance on Validation Set # a. ROC and AUC # b. Confusion Matrix # c. KS Statistic # 9. Scoring the test data ############################################################################## #Reading data and Summary Statistics #change the working directory setwd("C:\\Desktop\\Classification & Clustering") train_data<-read.csv("data/train.csv") test_data<-read.csv("data/test.csv") #Summary Statistics library(Hmisc) head(train_data) str(train_data) summary(train_data) describe(train_data) head(test_data) str(test_data) summary(test_data) describe(test_data) # 2-way contingency tables xtabs(~admit + prestige, data = train_data) ############################################################################## # Outlier Detection sapply(train_data[,1:3], function(x) quantile(x, c(.01,.05,.25,.5,.75,.90,.95, .99, 1),na.rm=TRUE) ) #gpa of 6.5 seems to be an outlier train_data$gpa[train_data$gpa > 4] <- 4 ############################################################################## # Missing Value Imputation sapply(train_data, function(x) sum(is.na(x)) ) train_data$gre[is.na(train_data$gre)] <- mean(train_data$gre, na.rm=TRUE) train_data2<-train_data sapply(train_data2, function(x) train_data2[,x][is.na(train_data2[,x])]<- mean(train_data2[,x], na.rm=TRUE)) ############################################################################## # Correlation and VIF cor(train_data[,1:3]) library(usdm) vif(train_data[,1:3]) ############################################################################## # Information Value library(plyr) library(sqldf) library(rpart) source("C:\\xyz.R") file.sources = list.files("others", full.names=TRUE) sapply(file.sources,source,.GlobalEnv) data <- train_data data$admit <- factor(data$admit, levels= c("1","0")) levels(data$admit) str(data) iv.mult(data, y="admit", vars=c("gre","gpa","prestige"), summary="TRUE") ############################################################################## # Create training and validation sets set.seed(123) smp_size <- floor(0.7 * nrow(train_data)) train_ind <- sample(seq_len(nrow(train_data)), size = smp_size) training <- train_data[train_ind, ] validation <- train_data[-train_ind, ] ############################################################################## # Running the Logistic Model on Training set ?lm ?describe ?glm admit ~ gre + gpa + prestige mylogit <- glm(admit ~ gre + gpa + prestige, data = training, family = "binomial") mylogit2 <- glm(admit ~ gpa + prestige, data = training, family = "binomial") summary(mylogit2) # See how prestige has been used as a dummy variable confint(mylogit, level=.90) # Caluclating Concordance # Refer to the blog here to see about Concordance # http://shashiasrblog.blogspot.in/2014/02/binary-logistic-regression-fast.html fastConc<-function(model){ # Get all actual observations and their fitted values into a frame fitted<-data.frame(cbind(model$y,model$fitted.values)) colnames(fitted)<-c('respvar','score') # Subset only ones ones<-fitted[fitted[,1]==1,] # Subset only zeros zeros<-fitted[fitted[,1]==0,] # Initialise all the values pairs_tested<-nrow(ones)*nrow(zeros) conc<-0 disc<-0 # Get the values in a for-loop for(i in 1:nrow(ones)) { conc<-conc + sum(ones[i,"score"]>zeros[,"score"]) disc<-disc + sum(ones[i,"score"]<zeros[,"score"]) } # Calculate concordance, discordance and ties concordance<-conc/pairs_tested discordance<-disc/pairs_tested ties_perc<-(1-concordance-discordance) return(list("Concordance"=concordance, "Discordance"=discordance, "Tied"=ties_perc, "Pairs"=pairs_tested)) } fastConc(mylogit) ############################################################################## #Check Performance on the Validation Set val <-predict(mylogit, validation, type="response") mydf <-cbind(validation,val) mydf$response <- as.factor(ifelse(mydf$val>0.5, 1, 0)) library(ROCR) logit_scores <- prediction(predictions=mydf$val, labels=mydf$admit) #PLOT ROC CURVE logit_perf <- performance(logit_scores, "tpr", "fpr") plot(logit_perf,col = "darkblue",lwd=2,xaxs="i",yaxs="i",tck=NA, main="ROC Curve") box() abline(0,1, lty = 300, col = "green") grid(col="aquamarine") ### AREA UNDER THE CURVE logit_auc <- performance(logit_scores, "auc") as.numeric(logit_auc@y.values) ##AUC Value #CONFUSION MATRIX library(caret) confusionMatrix(mydf$response,mydf$admit) ### KS STATISTIC logit_ks <- max(logit_perf@y.values[[1]]-logit_perf@x.values[[1]]) logit_ks ## LIFT CHART lift.obj <- performance(logit_scores, measure="lift", x.measure="rpp") plot(lift.obj, main="Lift Chart", xlab="% Populations", ylab="Lift", col="blue") abline(1,0,col="grey") #GAINS TABLE #install.packages("gains") library(gains) # gains table gains.cross <- gains(actual=mydf$admit , predicted=mydf$val, groups=10) print(gains.cross) ############################################################################## #Scoring the Test Data using the model we just created pred <- predict(mylogit, test_data, type="response") final <- cbind(test_data,pred) write.csv(final,"final_probs.csv") ############################################################################## #REFERENCE MATERIAL ## http://www.ats.ucla.edu/stat/r/dae/logit.htm ## http://www.unc.edu/courses/2010fall/ecol/563/001/notes/lecture21%20Rcode.html ## Caret Package: http://topepo.github.io/caret/ ## http://www.r-bloggers.com/gini-index-and-lorenz-curve-with-r/]]>