Recap 11/20/2019: Special Topic: Random Forest Using Python

For this week’s special topic, we demonstrated how to build a random forest model with real-world income data using Python and an example of how to predict a person’s income range with given information.

Covered in Special Topic:

  • Explanatory analysis in Python
  • Lambda function in Python
  • Python DataFrame
  • One-Hot Encoding vs Label Encoding
  • Training and Testing
  • Model Training
  • Model Prediction
  • Case Study of predicting one individual’s income range
  • Confusion Matrix
  • Differences of accuracy, precision, and recall

Special Topic Notebook

Recap 11/13/2019: Predictive Analytics Introduction Workshop

For this week’s workshop, we covered the basics of predictive analytics by introducing the concepts of explanatory analysis, linear regression, decision tree and random forest. The students during the workshop were able to learn how to explore a dataset in R and build a linear model to predict the future. Please see below for a summary of the workshop and link to the slides!

Covered in Workshop:

  • Went over basic R syntax & dplyr package
  • Real-world examples of predictive analytics
  • Importance & methods of cleaning data
  • Introduction of linear regression
  • Evaluation & interpretation of linear regression
  • Introduction of classification
    • Decision Trees
    • Random Forest
  • Books and classes recommendation to study data science

Workshop Slide

Machine Learning Introduction

Written by:

Daisy Du – AACUW Digital Marketing Lead, Nick Stoner  – AACUW President

What Is Machine Learning?

The easiest way to understand it is to understand it literally. Machine learning is a process that computer systems use algorithms to learn from a training data set, start recognizing patterns within the data and then apply this knowledge to something it has never seen before. It’s the same process we humans use to learn every day. It’s a process of observing, learning and applying.

As we discussed above, machine learning algorithms need training data to learn. And our training data will decide what type of learning algorithms we should use. There are mainly two types of machine learning algorithms: supervised learning and unsupervised learning.

Machine Learning Types

Supervised learning algorithms learn from a training data set that has both the input and desired output variables. In other words, we know exactly what our outcome is.

For example, when we use historical data to train our model and predict the house prices, our training data will contain both input variables such as the year the house is built, the house square foot, how many bathrooms are in the house etc., and the corresponding output variable which is the house price from the past for us to learn the patterns. The house price here, is our target variable to be predicted once we have new input data.

Unsupervised learning algorithms are the exact opposite. They learn from a training data set that doesn’t have the output variables, only the input variables. When we train the model, we don’t know what our outcome will be. There is no output variable in the training data set.

Picture1

One of the unsupervised learning algorithms is clustering, which categorizes all data points into different groups. And the groups we got in the end, are the only outputs of the models.

One example of clustering is how we categorize natural animals. Before the animals are not categorized, we study their characteristics such as whether they have horns, fins or wings etc. And we later cluster them into different groups based on the similar traits they have. For example, crows and parrots are both in cluster birds because they have wings and they can fly.

One thing to note here is: there is no underlying meaning of the clusters. The only information we get from clusters is that the data points are similar to each other within the clusters. We don’t know what each cluster means. But we can later assign meanings to the clusters based on our observation of the data points grouped within each cluster.

Regression

One of the most straight-forward and common machine learning algorithms is the regression model. We have all seen regression when we were little kids in primary schools. It involves two variables and one line.

Picture1

Consider the function y = 5x + 3. Once we have the value of x, we will get the value of y. And the end product of regression analysis, looks exactly like this. A function! And this is how we do prediction. The y here is our output variables.

In real world, the function will be much more complex than this one. But they follow the same idea: to fit the best line through the points. How do we know it’s the best line? When the distance from the real data point to the line we get is the smallest which means the estimation error will be the smallest. And that’s how we get the specific number of 5 and 3 above, by running an algorithm on the data and function y = ax + b to get a set of values for a and b to make the estimation error smallest.

There, you just learned the most important idea behind the regression model.

Random Forest

A random forest algorithm is most commonly used for supervised classification (although random forest can be used for regression as well—see random forest regressor). Before using random forest, we must understand its fundamental structure: decision trees.

A Decision tree in its simplest form describes a decision point and defines possible outcomes depending on what choice is made. Let’s look at the following simple example, which is useful for those of us living in the Seattle, Washington area.

Picture1.png

In the above tree, we see the root node, “wear a raincoat?”, then we see a split node at “rain?”, then finally the leaf node at “dry & needed raincoat”, etc. This terminology is important for our random forest explanation. A decision tree can be pruned  to improve its accuracy, but at the cost of possibly overfitting the model.

A random forest is made up of many decision trees (see “bagging”) and are often much more complex than the tree example above. There are many powerful uses for random forest. Some real-world applications are determining which videos or articles to recommend a user based on their past viewing history and engagement, deciding where to go on vacation based on preference and location, and determining a patient’s illness based on medical history and current symptoms.

How does random forest work?

To create a random forest, you must define which features you want to analyze, and which feature you want to predict from a training dataset. From a root node (“Wear a raincoat” above), the random forest algorithm will create random decision trees by splitting each feature we selected and run randomly. From there, the trees are split into subsets randomly and these are used to predict an outcome. The algorithm then “votes” on each tree’s outcome and the highest voted outcomes are combined into an average, and this becomes the final prediction for the random forest.

Picture1

From there, you take the random forest and test it on your testing data to test the model’s accuracy against the training set.

The benefits of random forest are that the algorithm’s predictive capabilities offer high performance with limited need for interpretation. Random forest is versatile in that it can be used for both classification and regression, and it can handle null data values, so you don’t have to clean your data as much. In addition, outliers are automatically handled with random forest and since the variance is averaged, the model is also low bias.

Drawbacks to random forests are that you can’t dive into the model it creates, they are often referred to as “black box” models. In addition, if you’re working with large datasets, the random forest algorithm can be quite computationally intensive.

Overall, random forest is an excellent choice for both classification and regression prediction, but you must keep in mind the obscurity of how the results are derived, and always exercise skepticism.

Machine Learning Applications

The applications of machine learning can be as broad as our imagination. From future prediction, sentiment analysis to computer vision, artificial intelligence. Machine learning leads the future of the tech industry and more and more companies make use of their user data to stand a chance in the current marketplace .

Machine learning tells you which users will unsubscribe next month, which email you receive is more likely to be spam, which group of customers are writing negative reviews for your restaurant, which movie you should recommend to different audiences. It is changing how we interact with each other, how companies target and sell their products to us and how machines detect human emotions.

Machine learning’s application is integrated to every aspect of our lives. You will see it, once you know it.

Reading List

  1. A visual introduction to machine learning

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

  1. Introduction to Machine Learning

https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fdb08

Reference

https://theanlim.rbind.io/post/clustering-k-means-k-means-and-gganimate/

https://towardsdatascience.com/understanding-random-forest-58381e0602d2