Datalab 00: Regression Algorithms

We start with the Yelp Regression mini-project from Codecademy to put the knowledge and skills we gained yesterday to the test. Then we apply our knowlegde and skills to the research challenge/question we set ourselves for the creative brief. It's all quite straightforward really!

Learning Objectives:

  1. Create an appropriate multiple regression model for a given research question.

Table of contents:

  1. Stand-up: 0.5 hours
  2. Q&A: 0.5 hours
  3. Mini-Project: Yelp Regression: 3.5 hours
  4. Oosterhout Dataset: Multiple Regression: 3 hours
  5. Day-reflection: 0.5 hours

Questions or issues?

If you have any questions or issues regarding the course material after the Q&A, please first ask your peers or ask us if you can't figure it out together!

Good luck!

0) Stand-up

We start by hosting a stand-up. Form groups of ~ 5 and run on-another through the following points:

  • What progress have you made up since last datalab?
  • What progress do you anticipate to make today?
  • What impediments are you facing or expecting?
  • With what could you use help or support?

Open your worklog and plan your day informed by the stand-up and today's schedule

1) Q & A

We start by briefly reflecting on what we learned about supervised vs. unsupervised learning and about regression algorithms. Do you have any questions? Now is the time to ask them!

2) Yelp Regression Project

Now, we're introduced to multiple linear regression; it's time to apply these fundamentals by doing a workshop. Open the Basics of Machine Learning course on Codecademy and complete the module: Yelp Regression Project, specifically:

  • Info: Yelp Rating Predictor Cumulative Project
  • Article: Yelp Dataset Terms of Use

3) Oosterhout Dataset: Multiple Regression

Document your code

Write your argumentation down in a in-line comments; and for every line of code: write an in-line comment explaining what the line of code does exactly. Figure 1. below is a good demonstration of documented code.

Figure 1.


The example script I provided in Data Science 1 is also a good example of how to document your code; albeit that one was done in R.

  1. Download the template python script over here and fill it in according to the guidelines described in there. To download the script, open the raw file, right-click and ‘save as' into a location of choice. Please fill in you student number and name. In this class there are no students called: ‘FirstName', ‘LastName' or students with the following student number: ‘StudentNumber', I checked.
  2. Load in the youthcare dataset you created in Business Intelligence. Then save your file to your GitHub repository.
  3. Open your research design and use in-line comments to formulate a multiple regression analysis based on your research question (or when not answerable using multiple regression: related to your research question). Start by listing the variables which you think could predict the outcome variable you're interested in and motivate why you think they might predict your outcome variable.
  4. Create your fully fitted model (so the model containing all variables you wrote down in step 2) under the python code you just wrote.
  5. Test, re-fit and validate your model. Create a new model on a new line for every re-fit. Keep track of any predictor variables you exclude from the full model when re-fitting. Motivate why you are excluding; or including new variables using in-line comments.
  6. Continue till 16:00, or stop when you feel you can no longer improve the model.Then save your file to your GitHub repository.

4) Day-reflection

At 16:30, there's a meeting you're encouraged to take part in to ask questions and to discuss our progress and reflect on today activities.

Up Next!

Next week, we will cover classification algorithms!

Resources