Recap 2/26/2020: Accounting & Finance Analytics Special Topic

This week, we covered a wide range of finance and accounting topics including:

  • Foundational Accounting Principles
  • Vertical Statement/”Common-size” Analysis
  • Fraud Detection
  • Importing and Analyzing Stock Data in Rstudio
  • Single Stock & Portfolio Return Calculations in R
  • Technical Analysis vs. Fundamental

We could spend hours covering each one of these topics in depth, please let us know if you’d be interested in that by emailing us.

For those who couldn’t make it, below are the slides and files we went over. The R markdown file has descriptions for each code block so you can follow along.

Slides, R Markdown File, Financial Statement Analysis (DAL)

Content Based Filtering in Recommendation Systems Recap

Content based filtering is one of the most common recommending approaches. It provides recommendation based on items user currently likes or uses. For example, if a user likes the movie Frozen, content based filtering will find movies similar to Frozen according to movie characteristics such as movie category, producers, actors and movie length etc.

“The steps in recommending products or contents to the user in content based filtering are as follows:

  • Identify the factors which describe and differentiate the products and the factors which might influence whether a user would buy the product or not,
  • Represent all the products in terms of those factors, descriptors or attributes,
  • Create a tuple or number vector for each product that represents the strength of each factors for the product,
  • Start to look at the users and their histories to create a user profile based on their history. It will have the same number of factors and their strength would indicate how much influenced the user is towards that factor,
  • Recommend the user those products that are nearest to them in terms of those factors.”

For more, please refer to the original article below:

Article Link: https://medium.com/@rabinpoudyal1995/content-based-filtering-in-recommendation-systems-8397a52025f0

Recap: Guest Speakers from Symetra

This week we welcomed Chris, Denise, Jake, and Jennie from Symetra to come speak about their company and their data analytics division! We learned about some of the projects they work on, the challenges they face, and the best things about working at Symetra.

Symetra is looking for interns this summer! You can learn more and apply here:

Symetra Internships

Recap 1/15/2020: Basics of Data Science Workshop

For this week’s workshop, we went over the data science realm in general and talked about various tools leveraged by data scientists and analysts. Please see below for a summary of the workshop and link to the slides!

Covered in Workshop:

  • Debuted a sleek new look for our slides and logo!
  • Discussed our club’s purpose
  • Overview of various aggregated resources
  • The data science profession and their toolset
  • Overview of data visualization
  • Shared student projects (links on slides)
  • Discussed machine setup steps
  • Went over basic R syntax & Dplyr package

Workshop Slides

Project Recap: BCG GAMMA

The BCG GAMMA Case Project was our biggest project as a club to date. 30 participants (six teams of five) gathered at BCG’s Seattle office on October 30th for the case kickoff. The case focused on a real-life challenge BCG consulted on in the past and teams were tasked with building random forest models to determine factors affecting customer churn. Over the three weeks following the kickoff, teams worked to develop models using R and Python primarily. On November 21st, the teams presented their recommendations to BCG consultants.

The case was a textbook example in fulfilling our club’s mission to bridge the gap between the traditionally technical and nontechnical disciplines. We accomplished this by balancing each team with both business/econ students and info/CS/data science students, and by pairing graduate and undergraduate students. The outcome? Everyone learned something new about their fellow huskies!

We couldn’t have been more impressed by our peers at the University of Washington, who presented some amazing case projects to BCG GAMMA last week! Thank you to Allen Chen, Spencer Barnes, and Annie Lai for judging the presentations, it was a blast to participate! A major shout out to Nam Pho for the initial connection and for mentoring club leadership throughout the process! A wonderful learning experience for all involved.

Recap 11/20/2019: Special Topic: Random Forest Using Python

For this week’s special topic, we demonstrated how to build a random forest model with real-world income data using Python and an example of how to predict a person’s income range with given information.

Covered in Special Topic:

  • Explanatory analysis in Python
  • Lambda function in Python
  • Python DataFrame
  • One-Hot Encoding vs Label Encoding
  • Training and Testing
  • Model Training
  • Model Prediction
  • Case Study of predicting one individual’s income range
  • Confusion Matrix
  • Differences of accuracy, precision, and recall

Special Topic Notebook

Recap 11/13/2019: Predictive Analytics Introduction Workshop

For this week’s workshop, we covered the basics of predictive analytics by introducing the concepts of explanatory analysis, linear regression, decision tree and random forest. The students during the workshop were able to learn how to explore a dataset in R and build a linear model to predict the future. Please see below for a summary of the workshop and link to the slides!

Covered in Workshop:

  • Went over basic R syntax & dplyr package
  • Real-world examples of predictive analytics
  • Importance & methods of cleaning data
  • Introduction of linear regression
  • Evaluation & interpretation of linear regression
  • Introduction of classification
    • Decision Trees
    • Random Forest
  • Books and classes recommendation to study data science

Workshop Slide

Machine Learning Introduction

Written by:

Daisy Du – AACUW Digital Marketing Lead, Nick Stoner  – AACUW President

What Is Machine Learning?

The easiest way to understand it is to understand it literally. Machine learning is a process that computer systems use algorithms to learn from a training data set, start recognizing patterns within the data and then apply this knowledge to something it has never seen before. It’s the same process we humans use to learn every day. It’s a process of observing, learning and applying.

As we discussed above, machine learning algorithms need training data to learn. And our training data will decide what type of learning algorithms we should use. There are mainly two types of machine learning algorithms: supervised learning and unsupervised learning.

Machine Learning Types

Supervised learning algorithms learn from a training data set that has both the input and desired output variables. In other words, we know exactly what our outcome is.

For example, when we use historical data to train our model and predict the house prices, our training data will contain both input variables such as the year the house is built, the house square foot, how many bathrooms are in the house etc., and the corresponding output variable which is the house price from the past for us to learn the patterns. The house price here, is our target variable to be predicted once we have new input data.

Unsupervised learning algorithms are the exact opposite. They learn from a training data set that doesn’t have the output variables, only the input variables. When we train the model, we don’t know what our outcome will be. There is no output variable in the training data set.

Picture1

One of the unsupervised learning algorithms is clustering, which categorizes all data points into different groups. And the groups we got in the end, are the only outputs of the models.

One example of clustering is how we categorize natural animals. Before the animals are not categorized, we study their characteristics such as whether they have horns, fins or wings etc. And we later cluster them into different groups based on the similar traits they have. For example, crows and parrots are both in cluster birds because they have wings and they can fly.

One thing to note here is: there is no underlying meaning of the clusters. The only information we get from clusters is that the data points are similar to each other within the clusters. We don’t know what each cluster means. But we can later assign meanings to the clusters based on our observation of the data points grouped within each cluster.

Regression

One of the most straight-forward and common machine learning algorithms is the regression model. We have all seen regression when we were little kids in primary schools. It involves two variables and one line.

Picture1

Consider the function y = 5x + 3. Once we have the value of x, we will get the value of y. And the end product of regression analysis, looks exactly like this. A function! And this is how we do prediction. The y here is our output variables.

In real world, the function will be much more complex than this one. But they follow the same idea: to fit the best line through the points. How do we know it’s the best line? When the distance from the real data point to the line we get is the smallest which means the estimation error will be the smallest. And that’s how we get the specific number of 5 and 3 above, by running an algorithm on the data and function y = ax + b to get a set of values for a and b to make the estimation error smallest.

There, you just learned the most important idea behind the regression model.

Random Forest

A random forest algorithm is most commonly used for supervised classification (although random forest can be used for regression as well—see random forest regressor). Before using random forest, we must understand its fundamental structure: decision trees.

A Decision tree in its simplest form describes a decision point and defines possible outcomes depending on what choice is made. Let’s look at the following simple example, which is useful for those of us living in the Seattle, Washington area.

Picture1.png

In the above tree, we see the root node, “wear a raincoat?”, then we see a split node at “rain?”, then finally the leaf node at “dry & needed raincoat”, etc. This terminology is important for our random forest explanation. A decision tree can be pruned  to improve its accuracy, but at the cost of possibly overfitting the model.

A random forest is made up of many decision trees (see “bagging”) and are often much more complex than the tree example above. There are many powerful uses for random forest. Some real-world applications are determining which videos or articles to recommend a user based on their past viewing history and engagement, deciding where to go on vacation based on preference and location, and determining a patient’s illness based on medical history and current symptoms.

How does random forest work?

To create a random forest, you must define which features you want to analyze, and which feature you want to predict from a training dataset. From a root node (“Wear a raincoat” above), the random forest algorithm will create random decision trees by splitting each feature we selected and run randomly. From there, the trees are split into subsets randomly and these are used to predict an outcome. The algorithm then “votes” on each tree’s outcome and the highest voted outcomes are combined into an average, and this becomes the final prediction for the random forest.

Picture1

From there, you take the random forest and test it on your testing data to test the model’s accuracy against the training set.

The benefits of random forest are that the algorithm’s predictive capabilities offer high performance with limited need for interpretation. Random forest is versatile in that it can be used for both classification and regression, and it can handle null data values, so you don’t have to clean your data as much. In addition, outliers are automatically handled with random forest and since the variance is averaged, the model is also low bias.

Drawbacks to random forests are that you can’t dive into the model it creates, they are often referred to as “black box” models. In addition, if you’re working with large datasets, the random forest algorithm can be quite computationally intensive.

Overall, random forest is an excellent choice for both classification and regression prediction, but you must keep in mind the obscurity of how the results are derived, and always exercise skepticism.

Machine Learning Applications

The applications of machine learning can be as broad as our imagination. From future prediction, sentiment analysis to computer vision, artificial intelligence. Machine learning leads the future of the tech industry and more and more companies make use of their user data to stand a chance in the current marketplace .

Machine learning tells you which users will unsubscribe next month, which email you receive is more likely to be spam, which group of customers are writing negative reviews for your restaurant, which movie you should recommend to different audiences. It is changing how we interact with each other, how companies target and sell their products to us and how machines detect human emotions.

Machine learning’s application is integrated to every aspect of our lives. You will see it, once you know it.

Reading List

  1. A visual introduction to machine learning

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

  1. Introduction to Machine Learning

https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fdb08

Reference

https://theanlim.rbind.io/post/clustering-k-means-k-means-and-gganimate/

https://towardsdatascience.com/understanding-random-forest-58381e0602d2

RECAP 11/6/2019: Tableau Workshop

Tableau-Prep-Blog-1
Tableau is a popular request from our members, so we try to hold at least one workshop using Tableau per quarter. This quarter, we introduced the concepts of Tableau using crash data. We collaborated with UW Actuarial Club for this workshop! Please see below for a summary of this workshop and download links for the slides.

Covered in Workshop:

  • Installing Tableau for Free
  • Loading Data
  • Creating & Manipulating Basic Visualizations
  • Publishing Dashboards
  • Map Demo
  • Demo by UW Actuarial Club

    Workshop Slides

RECAP 10/23/19: SQL Workshop

In this fall quarter workshop, we introduced SQL and some of its main syntax, as well as how to download MySQL. About 30 members attended, but for those who missed it, here is a brief overview of what we discussed. For your reference, we have also included links to MySQL and the complete slides from the workshop.

Covered in Workshop:

  • Database Management System Overview
    • Hierarchical vs. relational
  • Syntax (Insert, Update, Select, etc.)
  • Joining Techniques
  • Keys, Composite Keys, Foreign Keys

MySQL Download,   Workshop Slides