Explaining Predictions Global

Recap We’ve covered various approaches in explaining model predictions globally. Today we will learn about another model specific post hoc analysis. We will learn to understand the workings of gradient boosting predictions. Like past posts, the Clevaland heart dataset as well as tidymodels principle will be used. Refer to the first post of this series for more details. Gradient Boosting Besides random forest introduced in a past post, another tree-based ensemble model is gradient boosting.

Recap This is a continuation on the explanation of machine learning model predictions. Specifically, random forest models. We can depend on the random forest package itself to explain predictions based on impurity importance or permutation importance. Today, we will explore external packages which aid in explaining random forest predictions. External packages There are external a few packages which offer to calculate variable importance for random forest models apart from the conventional measurements found within the random forest package.

Intro Recap There are 2 approaches to explaining models Use simple interpretable models. This approach was covered in the previous posts where we looked at logistic regression and decision trees as examples of white box models. Conduct post-hoc interpretation on models. There are two are two types of post-hoc analysis which can be done, model specific and model agonistic. Direction of post In the next few posts, we will look at model specific post-hoc analysis which involves ranking the variables according to importance to the model.

Introduction This is a follow up post of using simple models to explain machine learning predictions. In the last post, we introduced logistic regression and in today’s entry we will learn about decision tree. We will continue to use the Cleveland heart dataset and use tidymodels principles where possible. The details of the Cleveland heart dataset was also described in the last post. #library library(tidyverse) library(tidymodels) #import heart<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = F) # Renaming var colnames(heart)<- c("age", "sex", "rest_cp", "rest_bp", "chol", "fast_bloodsugar","rest_ecg","ex_maxHR","ex_cp", "ex_STdepression_dur", "ex_STpeak","coloured_vessels", "thalassemia","heart_disease") #elaborating cat var ##simple ifelse conversion heart<-heart %>% mutate(sex= ifelse(sex=="1", "male", "female"),fast_bloodsugar= ifelse(fast_bloodsugar=="1", ">120", "<120"), ex_cp=ifelse(ex_cp=="1", "yes", "no"), heart_disease=ifelse(heart_disease=="0", "no", "yes")) ## complex ifelse conversion using `case_when` heart<-heart %>% mutate( rest_cp=case_when(rest_cp== "1" ~ "typical",rest_cp=="2" ~ "atypical", rest_cp== "3" ~ "non-CP pain",rest_cp== "4" ~ "asymptomatic"), rest_ecg=case_when(rest_ecg=="0" ~ "normal",rest_ecg=="1" ~ "ST-T abnorm",rest_ecg=="2" ~ "LV hyperthrophy"), ex_STpeak=case_when(ex_STpeak=="1" ~ "up/norm", ex_STpeak== "2" ~ "flat",ex_STpeak== "3" ~ "down"), thalassemia=case_when(thalassemia=="3.

Introduction The rise of machine learning In this current 4th industrial revolution, data science has penetrated all industries and healthcare is no exception. There has been an exponential use of machine learning in clinical research in the past decade and it is expected to continue to grow at an even faster rate in the following decade. Many machine learning techniques are considered as black box algorithms as the intrinsic workings of the models are too complex in justifying the reasons for the predictions.

Explaining Predictions Global

Explaining Predictions: Boosted Trees Post-hoc Analysis (Xgboost)

Explaining Predictions: Random Forest Post-hoc Analysis (randomForestExplainer package)

Explaining Predictions: Random Forest Post-hoc Analysis (permutation & impurity variable importance)

Explaining Predictions: Interpretable models (decision tree)

Explaining Predictions: Interpretable models (logistic regression)