Datalab 02: Tree-Based Analyses

We start with applying decision trees to the Oosterhout dataset. Please use the same research question you developed for your classification datalab.

Table of contents:

Stand-up: 0.5 hours
Q&A: 0.5 hours
Oosterhout Dataset: Tree based Analysis: 3.5 hours
Random forests: 3 hours
Day-reflection: 0.5 hours

Questions or issues?

If you have any questions or issues regarding the course material after the Q&A, please first ask your peers or ask us if you can't figure it out together!

Good luck!

0) Stand-up

We start by hosting a stand-up. Form groups of ~ 5 and run on-another through the following points:

What progress have you made up since last datalab?
What progress do you anticipate to make today?
What impediments are you facing or expecting?
With what could you use help or support?

Open your worklog and plan your day informed by the stand-up and today's schedule

1) Q&A

We start by briefly reflecting on what we learned about classification algoritms, overfitting and the bias-variance trade-off. Do you have any questions? Now is the time to ask them!

2) Oosterhout Dataset: Classification using decision trees

Document your code

Write your argumentation down in a in-line comments; and for every line of code: write an in-line comment explaining what the line of code does exactly. Figure 1. below is a good demonstration of documented code.

The example script I provided in Data Science 1 is also a good example of how to document your code; albeit that one was done in R.

Open your python file (MachineLearning_OosterhoutModels_…) used for the final delivery of your model.
Load in the youthcare dataset you created in Business Intelligence if you haven't done so already. Load in any other data you might need. Then save your file to your GitHub repository.
Open your research design and use in-line comments to formulate a classification analysis using decision trees based on your research question (or when not answerable using this type of analysis: perform an analysis related to your research question). Start by listing the variables which you think could predict the outcome variable you're interested in and motivate why you think they might predict your outcome variable.
Create your fully fitted model (so the model containing all variables you wrote down in step 2) under the python code you just wrote.
Test, re-fit and validate your model. Create a new model on a new line for every re-fit. Keep track of any predictor variables you exclude from the full model when re-fitting. Motivate why you are excluding; or including new variables using in-line comments.
Continue till 16:00, or stop when you feel you can no longer improve the model. Then save your file to your GitHub repository.

3) Random Forests

When you have completed with your analyses on the Oosterhout data, please open the Basics of Machine Learning course on Codecademy and complete the module Random Forests.

4) Day-Reflection

At 16:00, there's a meeting you're encouraged to take part in to ask questions and to discuss our progress and reflect on today activities.

Resources

Codecademy