Machine Learning Introduction

Daisy Du – AACUW Digital Marketing Lead

Nick Stoner  – AACUW President

What Is Machine Learning?

The easiest way to understand it is to understand it literally. Machine learning is a process that computer systems use algorithms to learn from a training data set, start recognizing patterns within the data and then apply this knowledge to something it has never seen before. It’s the same process we humans use to learn every day. It’s a process of observing, learning and applying.

As we discussed above, machine learning algorithms need training data to learn. And our training data will decide what type of learning algorithms we should use. There are mainly two types of machine learning algorithms: supervised learning and unsupervised learning.

Machine Learning Types

Supervised learning algorithms learn from a training data set that has both the input and desired output variables. In other words, we know exactly what our outcome is.

For example, when we use historical data to train our model and predict the house prices, our training data will contain both input variables such as the year the house is built, the house square foot, how many bathrooms are in the house etc., and the corresponding output variable which is the house price from the past for us to learn the patterns. The house price here, is our target variable to be predicted once we have new input data.

Unsupervised learning algorithms are the exact opposite. They learn from a training data set that doesn’t have the output variables, only the input variables. When we train the model, we don’t know what our outcome will be. There is no output variable in the training data set.

Picture1

One of the unsupervised learning algorithms is clustering, which categorizes all data points into different groups. And the groups we got in the end, are the only outputs of the models.

One example of clustering is how we categorize natural animals. Before the animals are not categorized, we study their characteristics such as whether they have horns, fins or wings etc. And we later cluster them into different groups based on the similar traits they have. For example, crows and parrots are both in cluster birds because they have wings and they can fly.

One thing to note here is: there is no underlying meaning of the clusters. The only information we get from clusters is that the data points are similar to each other within the clusters. We don’t know what each cluster means. But we can later assign meanings to the clusters based on our observation of the data points grouped within each cluster.

Regression

One of the most straight-forward and common machine learning algorithms is the regression model. We have all seen regression when we were little kids in primary schools. It involves two variables and one line.

Picture1

Consider the function y = 5x + 3. Once we have the value of x, we will get the value of y. And the end product of regression analysis, looks exactly like this. A function! And this is how we do prediction. The y here is our output variables.

In real world, the function will be much more complex than this one. But they follow the same idea: to fit the best line through the points. How do we know it’s the best line? When the distance from the real data point to the line we get is the smallest which means the estimation error will be the smallest. And that’s how we get the specific number of 5 and 3 above, by running an algorithm on the data and function y = ax + b to get a set of values for a and b to make the estimation error smallest.

There, you just learned the most important idea behind the regression model.

Random Forrest

A random forest algorithm is most commonly used for supervised classification (although random forest can be used for regression as well—see random forest regressor). Before using random forest, we must understand its fundamental structure: decision trees.

A Decision tree in its simplest form describes a decision point and defines possible outcomes depending on what choice is made. Let’s look at the following simple example, which is useful for those of us living in the Seattle, Washington area.

Picture1.png

In the above tree, we see the root node, “wear a raincoat?”, then we see a split node at “rain?”, then finally the leaf node at “dry & needed raincoat”, etc. This terminology is important for our random forest explanation. A decision tree can be pruned  to improve its accuracy, but at the cost of possibly overfitting the model.

A random forest is made up of many decision trees (see “bagging”) and are often much more complex than the tree example above. There are many powerful uses for random forest. Some real-world applications are determining which videos or articles to recommend a user based on their past viewing history and engagement, deciding where to go on vacation based on preference and location, and determining a patient’s illness based on medical history and current symptoms.

How does random forest work?

To create a random forest, you must define which features you want to analyze, and which feature you want to predict from a training dataset. From a root node (“Wear a raincoat” above), the random forest algorithm will create random decision trees by splitting each feature we selected and run randomly. From there, the trees are split into subsets randomly and these are used to predict an outcome. The algorithm then “votes” on each tree’s outcome and the highest voted outcomes are combined into an average, and this becomes the final prediction for the random forest.

Picture1

From there, you take the random forest and test it on your testing data to test the model’s accuracy against the training set.

The benefits of random forest are that the algorithm’s predictive capabilities offer high performance with limited need for interpretation. Random forest is versatile in that it can be used for both classification and regression, and it can handle null data values, so you don’t have to clean your data as much. In addition, outliers are automatically handled with random forest and since the variance is averaged, the model is also low bias.

Drawbacks to random forests are that you can’t dive into the model it creates, they are often referred to as “black box” models. In addition, if you’re working with large datasets, the random forest algorithm can be quite computationally intensive.

Overall, random forest is an excellent choice for both classification and regression prediction, but you must keep in mind the obscurity of how the results are derived, and always exercise skepticism.

Machine Learning Applications

The applications of machine learning can be as broad as our imagination. From future prediction, sentiment analysis to computer vision, artificial intelligence. Machine learning leads the future of the tech industry and more and more companies make use of their user data to stand a chance in the current marketplace .

Machine learning tells you which users will unsubscribe next month, which email you receive is more likely to be spam, which group of customers are writing negative reviews for your restaurant, which movie you should recommend to different audiences. It is changing how we interact with each other, how companies target and sell their products to us and how machines detect human emotions.

Machine learning’s application is integrated to every aspect of our lives. You will see it, once you know it.

Reading List

  1. A visual introduction to machine learning

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

  1. Introduction to Machine Learning

https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fdb08

Reference

https://theanlim.rbind.io/post/clustering-k-means-k-means-and-gganimate/

https://towardsdatascience.com/understanding-random-forest-58381e0602d2

The Power of Community

Funny Boy Thomas, VP of AACUW

Hello reader,

My name is Thomas Khoo. I started Applied Analytics with Nick not only for his reasons stated before but also because of the power data has to bridge things together. After reading the articles posted on our Facebook page, it is clear that data plays an underlying role in many things. And the best thing: anyone can learn it.

The resources to learn visualization software, like Tableau and Power BI, and programming languages such as Python and SQL, are available at our website (aacuw.org). Under the Resources tab you’ll find links to languages and programs that can teach you how to use them. Applied Analytics provides the opportunity for individuals to enable themselves with the data analytics and machine learning skills employers are seeking.

You may ask at this point, what does that have to do with community? The community aspects are derived from the accessibility and availability. There are no financial barriers to consider, no discrimination or bias, no ambiguity about what you know. You have the skills they want. Period. Not many disciplines bear all three.

On top of the individuals benefit, many analytics require the specific understanding of the data’s nature. Data takes on any form. Data can predict how a protein will develop without having to wait months while on the other side data can predict when a woman is pregnant. Data give you incredible power. Understanding data’s cohesive nature will create synergy that will elevate everyone to higher levels and brighter skies.

Join us in our endeavor to lift the community!

Our Blog’s Purpose

“It is a capital mistake to theorize before one has data.”

Sherlock Holmes

Fictional character quote aside, data is utterly important, both then and now. Those with the data and the knowledge to interpret it have a significant advantage over those without. Today’s top companies, top executives, and top job candidates understand that a current data aptitude is required to excel today and to survive in the future.

This blog is an extension of the content curated and created by the Applied Analytics Club at UW (AACUW). AACUW was founded in the fall of 2018 to offer a broad exposure of data science and analytics to the University of Washington community. The purpose of this blog is to extend our mission further into the online space to make data education even more accessible. We will post original content including step-by-step tutorials, club updates, and opinion posts discussing data from university students’ lenses. Finally, we will re-post significant content from our other online outlets such as our YouTube page, Linkedin page, Instagram page, and Facebook page (all content including event details are posted on FB… we recommend you follow/like us).

Finally, we encourage data-passionate individuals to contribute by becoming writers on this blog! Contact uwappliedanalytics@gmail.com if you are interested in contributing; we welcome students and professionals alike.

Let’s build an online data community where students and professionals can collaborate and share ideas together!


Nick Stoner, President of AACUW