Predicting pneumonia outcomes: Feature Engineering
Intro
This post is a supplementary material for an assignment. The assignment is part of the Augmented Machine Learning unit for a Specialised Diploma in Data Science for Business. The aim of the assignment is to use DataRobot
for predictive modelling. Exploratory data analysis and feature engineering will be done here in R
before the data is imported into DataRobot
. The aim of this project is to classify if patients with Community Acquired Pneumonia (CAP) became better after seeing a doctor or became worse despite seeing a doctor.
In the 2 previous posts (part1, part2), a dataset on Community Acquired Pneumonia was explored and cleaned up. In this post, some feature engineering will be done.
library(tidyverse)
#https://github.com/notast/pneumonia-outcomes/blob/master/pneumonia_EDA2.Rmd
load("CAP_EDA2.RData")
theme_set(theme_light())
plt_common<- function(datafr, var, glued_title, s_title){
datafr %>% mutate(var_reordered =fct_reorder({{var}}, rel))%>%
ggplot(aes(var_reordered, rel, fill=rel)) + geom_col() + coord_flip() + scale_fill_viridis_c(option = "cividis", alpha = .5) +
scale_y_continuous(labels=scales::percent_format()) + labs(title= glue::glue(glued_title), subtitle = s_title, x="")}
Antibiotics used
The goal is to have a frequency breakdown on the types of antibiotics used and lump the infrequent types of antibiotics used.
Patients who were not prescribed antibiotics will be separated for this analysis
df_abx<-df %>% filter(Abx_no!=0)
df_abxNo<-df %>% filter(Abx_no==0)
Currently, the type of antibiotics used is displayed as binary variables in a wide format.
df %>% select(Abx_AmoxicillinSulbactam:Abx_OtherYN) %>% head()
## # A tibble: 6 x 17
## Abx_Amoxicillin~ Abx_Amoxicillin~ Abx_Amoxicillin~ Abx_Ampicillin
## <chr> <chr> <chr> <chr>
## 1 No No No No
## 2 Yes No No No
## 3 Unavailable Unavailable Unavailable Unavailable
## 4 No No No No
## 5 No No No No
## 6 No No No No
## # ... with 13 more variables: Abx_AmpicillinSulbactam <chr>,
## # Abx_Azithromycin <chr>, Abx_Ceftriaxone <chr>, Abx_Cefotaxime <chr>,
## # Abx_ClarithromycinOral <chr>, Abx_Cefepime <chr>,
## # Abx_ClarithromycinIV <chr>, Abx_Doxycycline <chr>, Abx_Levofloxacin <chr>,
## # Abx_Moxifloxacin <chr>, Abx_Piperacillin <chr>, Abx_Trimethoprim <chr>,
## # Abx_OtherYN <chr>
Additionally, the details of other types of antibiotics used is stored in a different column, Abx_OtherDetail
.
df %>% select(Abx_OtherYN, Abx_OtherDetail) %>% tail(10)
## # A tibble: 10 x 2
## Abx_OtherYN Abx_OtherDetail
## <chr> <chr>
## 1 No <NA>
## 2 Yes Cefuroxime
## 3 Yes Clindamycin
## 4 No <NA>
## 5 No <NA>
## 6 No <NA>
## 7 No <NA>
## 8 No <NA>
## 9 No <NA>
## 10 No <NA>
The dataframe will need to be pivoted to a long dataframe. The values of Abx_OtherDeail
will be used during wrangling to have a complete set on types of antibiotics used in the study.
long_df<-df %>% pivot_longer(cols = Abx_AmoxicillinSulbactam:Abx_OtherYN, "Abx_type", values_to= "Used") %>%
#abx prescribed
filter(Used=="Yes") %>%
# integrate details from Abx_OtherDetail to expand type of abx used for "Abx_OtherYN"
mutate(Abx_type=case_when(
Abx_type=="Abx_OtherYN" & !is.na(Abx_OtherDetail) ~ Abx_OtherDetail,
Abx_type=="Abx_OtherYN" & is.na(Abx_OtherDetail) ~ "Abx_unknown",
T~Abx_type
),
Abx_type= str_remove(Abx_type, "Abx_"),
Abx_type= str_glue("Abx_{Abx_type}")) %>%
# `Abx_OtherDetail` is now redundant
select(-Abx_OtherDetail)
50 antibiotics were used in the study and most of them were rarely used.
# plot
(long_df %>%
# rel frequ
count(Abx_type) %>% mutate(rel=prop.table(n)) %>%
plt_common(Abx_type, "{n_distinct(long_df %>% pull(Abx_type))} antibiotics were prescribed", ""))
# tabuluar
(long_df %>% filter(Used=="Yes") %>% count(Abx_type) %>% mutate(rel=prop.table(n)) %>%
arrange(-rel) %>% select(Abx_type, rel))
## # A tibble: 50 x 2
## Abx_type rel
## <glue> <dbl>
## 1 Abx_AmpicillinSulbactam 0.279
## 2 Abx_AmoxicillinSulbactamOral 0.237
## 3 Abx_ClarithromycinOral 0.128
## 4 Abx_ClarithromycinIV 0.0962
## 5 Abx_Levofloxacin 0.0501
## 6 Abx_Ceftriaxone 0.0466
## 7 Abx_AmoxicillinSulbactam 0.0330
## 8 Abx_Azithromycin 0.0276
## 9 Abx_Oseltamivir 0.0187
## 10 Abx_Piperacillin 0.0165
## # ... with 40 more rows
Only the top 9 antibiotics will be kept and the less frequent antibiotics will be lumped together.
long_df<- long_df %>% mutate(Abx_type= fct_lump_n(Abx_type, n=9, other_level = "Abx_LowFreq"))
long_df%>% filter(Used=="Yes") %>% count(Abx_type) %>% mutate(rel=prop.table(n)) %>%
plt_common(Abx_type, "Top 9 antibiotics", "Low frequency antibiotics were outside the top 9 and were lumped together")
Class of empirical antibiotics used
Similarly, the goal is to lump the infrequent empirical antibiotics class. 30 empirical antibiotics class were used in the study, most of them were rarely used.
(long_df %>% count(Abx_ClassUpdated) %>% mutate(rel=prop.table(n)) %>%
plt_common(Abx_ClassUpdated, "{n_distinct(long_df %>% pull(Abx_ClassUpdated))} empirical antibiotics class used", ""))
Only the top 4 classes will be kept and the less frequent classes will be lumped together. There is a small proportion of NA
antibiotics class which will be imputed. An alternative was to review the antibiotics prescribed and use clinical knowledge to deduce the empirical antibiotic prescribed and therefore providing information of the class of empirical antibiotics used.
# lump abx_class
long_df<- long_df %>% mutate(Abx_ClassUpdated= fct_lump_n(Abx_ClassUpdated , n= 4, other_level = "Low freq"))
# re-plot
long_df %>% count(Abx_ClassUpdated) %>% mutate(rel=prop.table(n)) %>%
plt_common(Abx_ClassUpdated, "Top 4 empirical antibiotic class", "Low frequency class were outside the top 4 and were lumped together")
Lastly, the long dataframe will be spread into the original wide format and the patients who were not prescribed antibiotics are added back into the dataframe
df2<-long_df %>% pivot_wider(names_from = Abx_type, values_from=Used,
# may have multiple `yes` for Abx_LowFreq due to lumping. need to summarise with `unique` to remove duplicate `yes` OR use `length` to to count the number of low freq abx used
values_fn={Abx_LowFreq= length},
values_fill=0)
# add pt w/o abx
df2<-bind_rows(df_abxNo %>% select(-starts_with("Abx")),
df2) %>%
mutate(Abx_ClassUpdated=replace_na(as.character(Abx_ClassUpdated), "No Abx"),
across(.cols= (Abx_Duration:Abx_ClarithromycinIV), .fns = ~replace_na(.x, 0)),
# num binary -> y/n binary for abx
across(.cols=(c(Abx_Ceftriaxone, Abx_AmoxicillinSulbactam:Abx_ClarithromycinIV)), .fns=~if_else(.x==1, "Yes", "No")))
Feature Enginnering
Additional features will be created based on the available variables and clinical knowledge.
Reasearch Site
The research site Pt_site
will be expanded to the actual cities and countries for better interpretation. The weather and climate for each city will be a new feature as there is research to suggestion a relationship between climate and CAP.
# Get the breakdown of Pt_site
eda_c<- function(datafr,x){
datafr %>% select(starts_with(x, ignore.case = F)) %>% map(~ table(.x, useNA = "always"))
}
(eda_c(df2, "Pt_Site"))
## $Pt_Site
## .x
## Location A Location B Location C <NA>
## 241 1042 829 0
# map the site
df2<- df2 %>% mutate(Pt_Site= case_when(
Pt_Site=="Location A"~ "Concepción_PY",
Pt_Site=="Location B"~"GeneralRoca_AR",
Pt_Site=="Location C"~ "Rivera_UY"
)) %>% # include weather
mutate(Pt_climate=case_when(
Pt_Site == "Concepción_PY" | Pt_Site=="Rivera_UY" ~"subtropical",
Pt_Site== "GeneralRoca_AR"~"cold windy"
)) %>% relocate(Pt_climate, .after=Pt_Site)
(df2 %>% select(Pt_Site, Pt_climate) %>% sample_n(10))
## # A tibble: 10 x 2
## Pt_Site Pt_climate
## <chr> <chr>
## 1 Concepción_PY subtropical
## 2 Rivera_UY subtropical
## 3 Rivera_UY subtropical
## 4 GeneralRoca_AR cold windy
## 5 GeneralRoca_AR cold windy
## 6 Rivera_UY subtropical
## 7 Rivera_UY subtropical
## 8 GeneralRoca_AR cold windy
## 9 GeneralRoca_AR cold windy
## 10 GeneralRoca_AR cold windy
Comorbidities
Based on the patient’s past medical history Hx_
, the number of comorbidities can be calculated. calculated. Patients with CAP and comorbidities have been shown to have poorer outcomes. However, there are missing values in Hx
which need to be imputed first.
plt_na<-function(dfr, X, t, st){
dfr %>% summarise(across(starts_with(X), ~mean(is.na(.)))) %>% pivot_longer(cols = everything(), names_to= "Variables" , values_to="pct_na") %>% mutate(Variables= fct_reorder(Variables, pct_na)) %>% ggplot(aes(x=Variables, y=pct_na, fill= pct_na))+ geom_col() + coord_flip() + scale_y_continuous(labels=scales::percent_format()) + scale_fill_viridis_c(option = "plasma") + labs(title = t, subtitle = st)}
plt_na(df2, "Hx", "before imputation", NULL)
library(recipes)
set.seed(69)
df3<-recipe(Outcome ~., data= df2) %>% update_role(Pt_CaseNumber, new_role = "id variable") %>% step_bagimpute(starts_with("Hx")) %>% prep() %>% juice()
plt_na(df3,"Hx","post imputation", "imputed with decision trees")
The comorbidities can now be calculated
df3<-df3 %>% mutate(
# collapse various heart disease to `yes`
Hx_heart_type= if_else(Hx_heart_type =="None", "None", "Yes"),
# convert y/n to binary numbers. need numeric for calculation
across(.cols=starts_with("Hx"), .fns = ~if_else(.x=="Yes", 1,0))) %>%
# calculate comorbidities
rowwise() %>%
mutate(Hx_comorbidities= sum(c_across(starts_with("Hx")))) %>%
# in order to preserve categorical nature of `Hx_` variables need to extract comorbidities and join back to df
select(Pt_CaseNumber, Hx_comorbidities) %>% left_join(df3, by="Pt_CaseNumber")
Mean arterial pressure
Conventionally, blood pressure readings include both systolic PE_BP_S
and diastolic blood pressure PE_BP_D
.
plt_na(df3, "PE_BP", "There are no missing blood pressure values", "")
Both of these values can be used to calculate another means to measure blood pressure, Mean arterial pressure PE_BP_MAP
. After calculating, PE_BP_MAP
, PE_BP_S
and PE_BP_D
can be removed.
#MAP
df3<- df3 %>% mutate(PE_BP_MAP= 1/3*PE_BP_S + 2/3*PE_BP_D, .keep="unused")
Done!
The orginial dataset had 2302 rows and 176 columns, after EDA the dataset has 2112 rows and 78 columns. After feature engineering, the dataset has 2112 rows and 71 variables.
(dim(df3))
Before importing the data into DataRobot
to be used for modelling, 10% of the data was randomly carved out to be treated as unseen data to determine how the selected model’s performance will perform in the real world.
# 10% as unseen
library(rsample)
set.seed(69)
s_clean<-df3 %>% initial_split(prop = 9/10, strata = Outcome)
s_cleanDR<- training(s_clean)
s_cleanUnseen<-testing(s_clean)
write_excel_csv(s_cleanDR, file.path("CleanDR.csv"))
write_excel_csv(s_cleanUnseen, file.path("CleanUnseen.csv"))