Predicting pneumonia outcomes: EDA part 1

August 05, 2020

This post is a supplementary material for an assignment. The assignment is part of the Augmented Machine Learning unit for a Specialised Diploma in Data Science for Business. The aim of the assignment is to use DataRobot for predictive modelling. Exploratory data analysis and feature engineering will be done here in R before the data is imported into DataRobot.

Intro

The dataset is from a prospective population-based surveillance study. For reproducibility, use this spreadsheet as I have appended extra sheets to the orginal spreadsheet used by the researchers. Metadata to sheet 2 (metadata) and a description of each category to sheet 3 (Category). The observational study was conducted over 3 different South America cities across 3 different countries over a 3-year period to investigate the incidence rate of Community Acquired Pneumonia (CAP). The dataset has a wealth of variables which can be used for predictive modelling, there is no known predictive analysis published using this dataset. The aim of this project is to classify if patients with CAP became better after seeing a doctor or became worse despite seeing a doctor.

library(tidyverse)
theme_set(theme_light())

raw<- readxl::read_excel("Incidence rate of community-acquired pneumonia in adults a population-based prospective active surveillance study in three cities in South America.xls")

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in EL1372 / R1372C142: got '2014-09-02'

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in EM1372 / R1372C143: got '2014-09-08'

The dataset consists of 2302 rows and 176 columns.

dim(raw)

## [1] 2302  176

The original column names were the description of the variables (e.g. Received flu shot in the last 12 months). Based on these descriptions/column names, the columns can be classified into 13 categories (see table below). Most of the categories ae clinically related variables.

categories13<- readxl::read_excel("Incidence rate of community-acquired pneumonia in adults a population-based prospective active surveillance study in three cities in South America.xls", sheet=3)

categories13 %>%  DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))

Data dictionary

The column names are renamed to shorter column names (e.g. Received flu shot in the last 12 months-> flu) with prefixes to identify which of the above 13 categories they belong to (e.g. flu-> V_flu. Prefix V_ stands for vaccine against the flu.).

metadata<- readxl::read_excel("Incidence rate of community-acquired pneumonia in adults a population-based prospective active surveillance study in three cities in South America.xls", sheet=2)

# metadata
metadata %>%  DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))

# rename col names 
colnames(raw)<- metadata$`New column name` %>% t()

EDA blueprint

EDA will be iterated for each of the 13 categories as there are too many columns to do the EDA at once. Also, there may be some association among the variables for each category. EDA includes exploring (i) the types of variables for each category (ii) the number of missing values (iii) the number of outliers (iv) and data cleaning if needed. Customized functions were created to facilitate EDA:

dtype provides the number of columns beginning with the prefix (e.g. dtype(dataframe, “Pt”) will list all the columns related to patient (pt). The types of variables are also provided.
eda_c breaks down the labels for columns beginning with the prefix. Used mostly for categorical variables.
eda_n_NAplt plots the percentage of NA/missing values for each column beginning with the prefix. Used mostly for numeric variables.
eda_n_NAcutoff provides a vector of variable names with acceptable NA values. Used mostly for numeric variables. The ball park maximum amount of missing values is 20% though higher proportion of missing values may be included after inspecting the plot generated by eda_n_NAplt
eda_n_outlier plots boxplots for numeric variables beginning with the prefix. Variables with large number of outliers can be isolated for further investigation.

dtype<- function(datafr, x){
datafr %>% select(starts_with(x, ignore.case = F)) %>% str()
}

eda_c<- function(datafr,x){
  datafr %>% select(starts_with(x, ignore.case = F)) %>%  map(~ table(.x, useNA = "always"))
}

eda_n_NAplt<- function (datafr, x){
  datafr %>% select(starts_with(x, ignore.case = F)) %>% summarise(across(starts_with(x), ~mean(is.na(.)))) %>% pivot_longer(cols = everything(), names_to= "Variables" , values_to="pct_na") %>% mutate(Variables= fct_reorder(Variables, pct_na)) %>% ggplot(aes(x=Variables, y=pct_na, fill= pct_na))+ geom_col() + coord_flip() + scale_y_continuous(labels=scales::percent_format()) + scale_fill_viridis_c(option = "plasma")}

eda_n_NAcutoff<- function(datafr, x, low, high){
  datafr%>% select(starts_with(x, ignore.case = F)) %>% summarise(across(starts_with(x), ~mean(is.na(.)))) %>% pivot_longer(cols = everything(), names_to="Variables", values_to="pct_na") %>% filter((pct_na>low & pct_na<high)) %>% pull(Variables)}

eda_n_outlier<-function(datafr, x_selected){
# nested df with plots
  plt<-datafr %>% select(all_of(x_selected)) %>% pivot_longer(cols=everything(),names_to="Variables", values_to="values") %>% nest(-Variables) %>% mutate(plot= map2(.x= data, .y= Variables, 
~ggplot(data=.x, aes(x= values)) + geom_boxplot() + labs(title = .y)
)) 
# print the plots
  for (i in 1:length(x_selected)){
    p<-plt[[3]][[i]]
    print(p)}
  }

Outcome

The outcome will be Other_Outcome. As the prediction is whether the patient was better or worse after seeking medical treatment, a binary classification is warranted here. However the Other_Outcome has 4 values, namely cure, improvement, unfavourable and death.

eda_c(raw, "Other_Outcome")

## $Other_Outcome
## .x
##        Cure       death Improvement unfavorable        <NA> 
##         799         277        1179          26          21

cure and improvement will be collapsed as better and unfavourable and death will be collapsed as worse. 6 times as many patients became better after seeking medical help. While encouraging from the doctor’s and patient’s perspective, it results in an imbalanced dataset for prediction. The imbalanced dataset will be addressed much later.
After removing 21 NA outcomes, there are 2281 observations remaining.

# collapse 4 categories into 2 
raw$Other_Outcome<-fct_collapse(raw$Other_Outcome, better=c("Cure", "Improvement")) 
raw$Other_Outcome<-fct_collapse(raw$Other_Outcome, worse=c("unfavorable", "death"))

# remove na 
df<-raw %>%  filter(!is.na(Other_Outcome))  
eda_c(df, "Other_Outcome")

## $Other_Outcome
## .x
## better  worse   <NA> 
##   1978    303      0

Discard the noise

There are column names with the prefix rm_ in front of the category prefix (e.g. rm_Other_). These columns are removed for numerous reasons.

dtype(df,"rm")

## tibble [2,281 x 36] (S3: tbl_df/tbl/data.frame)
##  $ rm_R_CXR            : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_R_CT             : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_R_CT_date        : chr [1:2281] NA NA NA NA ...
##  $ rm_SS               : logi [1:2281] NA NA NA NA NA NA ...
##  $ rm_SS_infilterate   : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_SS_WBC           : chr [1:2281] "Yes" "Yes" "No" "No" ...
##  $ rm_HCAP             : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_PE               : logi [1:2281] NA NA NA NA NA NA ...
##  $ rm_PE_O2            : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Lab              : logi [1:2281] NA NA NA NA NA NA ...
##  $ rm_Lab_RBC          : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Lab_Hb           : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Lab_WBC          : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Lab_NeuImu       : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_Lab_NeuImuDate   : chr [1:2281] NA NA NA NA ...
##  $ rm_Lab_Neu          : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Lab_plt          : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Lab_Na           : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_Lab_NaDate       : chr [1:2281] NA NA NA NA ...
##  $ rm_Lab_urea         : chr [1:2281] "Yes" "No" "Yes" "Yes" ...
##  $ rm_Lab_Cr           : chr [1:2281] "Yes" "No" "Yes" "Yes" ...
##  $ rm_Lab_Bicarb       : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_Lab_BicarbDate   : chr [1:2281] NA NA NA NA ...
##  $ rm_Lab_Sugar        : chr [1:2281] "Yes" "No" "Yes" "Yes" ...
##  $ rm_Lab_Alb          : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_Lab_AlbDate      : chr [1:2281] NA NA NA NA ...
##  $ rm_Lab_lactate      : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_Lab_lactateDate  : chr [1:2281] NA NA NA NA ...
##  $ rm_Lab_CRP          : chr [1:2281] "Yes" "Yes" "No" "Yes" ...
##  $ rm_Lab_ABG          : chr [1:2281] "No" "No" "No" "No" ...
##  $ rm_Lab_ABGDate      : chr [1:2281] NA NA NA NA ...
##  $ rm_Abx              : logi [1:2281] NA NA NA NA NA NA ...
##  $ rm_Care_ICUdate     : chr [1:2281] NA NA NA NA ...
##  $ rm_Other_phone      : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ rm_Other_1yearstatus: chr [1:2281] "dead after 1 year" NA "dead after 1 year" "dead after 1 year" ...
##  $ rm_Abx_AbxDuration  : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...

For instance, 1 year status contains information about the patient’s status one year post CAP which is a data leakage as it reveals the patient’s outcome from the CAP.

df %>% select(rm_Other_1yearstatus) %>% tail()

## # A tibble: 6 x 1
##   rm_Other_1yearstatus
##   <chr>               
## 1 alive               
## 2 dead after 1 year   
## 3 alive               
## 4 alive               
## 5 alive               
## 6 alive

Other columns contained duplicated information. For lab results, there is a column, which indicates if the specific biochemical was tested (eg rm_Lab_urea), and another column of the result (eg Lab_urea). If the biochemical was not tested, the column will indicate No test being done and the result column will be blank. Keeping only the results column will suffice. The reasons for removing specific rm_ columns is described in the above data dictionary.

df %>% select(rm_Lab_urea, Lab_urea) %>% head()

## # A tibble: 6 x 2
##   rm_Lab_urea Lab_urea
##   <chr>          <dbl>
## 1 Yes             60  
## 2 No              NA  
## 3 Yes             99  
## 4 Yes             56  
## 5 Yes            143  
## 6 Yes             56.3

The dataframe of 176 columns ends up with 140 columns after discarding rm_ columns.

df<-df %>% select(-starts_with("rm"))

ncol(df)

## [1] 140

1 `Other_` related category

After removing redundant Other_ variables, only Other_Outcome remains. Rename it to just Outcome for readability.

dtype(df, "Other")

## tibble [2,281 x 1] (S3: tbl_df/tbl/data.frame)
##  $ Other_Outcome: Factor w/ 2 levels "better","worse": 1 1 2 1 2 1 1 1 1 2 ...

df<-df %>% rename(Outcome=Other_Outcome)

2 `Pt_` Patient related category

dtype(df, "Pt")

## tibble [2,281 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Pt_Site      : chr [1:2281] "Location A" "Location A" "Location A" "Location A" ...
##  $ Pt_CaseNumber: num [1:2281] 1 2 3 4 5 6 7 9 10 11 ...
##  $ Pt_Age       : num [1:2281] 95 79 89 93 81 83 88 65 27 37 ...
##  $ Pt_incorrect : logi [1:2281] NA NA NA NA NA NA ...
##  $ Pt_correct   : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...

Appropriate patients

Pt_incorrect and Pt_correct are columns to indicate if the patients enrolled met the criteria for the study. All the subjects met the criteria for the study. Pt_incorrect and Pt_correct can be dropped.

(df %>% count(Pt_incorrect))

## # A tibble: 1 x 2
##   Pt_incorrect     n
##   <lgl>        <int>
## 1 NA            2281

(df %>% count(Pt_correct))

## # A tibble: 1 x 2
##   Pt_correct     n
##   <chr>      <int>
## 1 Yes         2281

df<-df %>% select(- c(Pt_correct, Pt_incorrect))

Case_number

Case numbers, Pt_CaseNumber are not distinct to the entire dataset. they are only distinct to the research site. Assign new unique case number for entire dataset.

# are case number distinct
(df %>% pull(Pt_CaseNumber) %>%  n_distinct())

## [1] 1231

# explore why case numbers are not distinct
(df %>% filter(Pt_CaseNumber==1))

## # A tibble: 3 x 138
##   Pt_Site Pt_CaseNumber Pt_Age R_CXR_infiltrate R_CXR_cavitation R_CXR_effusion
##   <chr>           <dbl>  <dbl> <chr>            <chr>            <chr>         
## 1 Locati~             1     95 Yes              No               No            
## 2 Locati~             1     26 Yes              No               No            
## 3 Locati~             1     32 Yes              No               No            
## # ... with 132 more variables: R_CXR_effusionSite <chr>, R_CT_inflitrate <chr>,
## #   R_CT_cavitation <chr>, R_CT_effusion <chr>, R_CT_effusionSite <chr>,
## #   SS_cough <chr>, SS_phlegm <chr>, SS_lungSounds <chr>, SS_temp <chr>,
## #   SS_breathing <chr>, SS_daysOfRespSymp <dbl>, Hx_mass <chr>, Hx_heart <chr>,
## #   Hx_stroke <chr>, Hx_kidney <chr>, Hx_liver <chr>, Hx_brainMental <chr>,
## #   Hx_diabetes <chr>, Hx_pastCAP <chr>, Hx_asp <chr>, Hx_alcohol <chr>,
## #   Hx_immune <chr>, Hx_COPD <chr>, Social_drugs <chr>,
## #   Social_overcrowded <chr>, Hx_heart_type <chr>, Social_smoke <chr>,
## #   Social_smoke_duration <chr>, Hx_HIV <chr>, Hx_HIV_CD4 <dbl>,
## #   Hx_HIV_viralLoad <chr>, Hx_HIV_Medicine <chr>, HCAP_hospStay <chr>,
## #   HCAP_IVAbx <chr>, HCAP_Chemo <chr>, HCAP_diaylsis <chr>, HCAP_injury <chr>,
## #   PE_AMS <chr>, PE_HR <dbl>, PE_RR <dbl>, PE_BP_S <dbl>, PE_BP_D <dbl>,
## #   PE_temp <dbl>, PE_O2 <chr>, Lab_RBC <dbl>, Lab_Hb <dbl>, Lab_WBC <dbl>,
## #   Lab_NeuImu <dbl>, Lab_Neu <dbl>, Lab_plt <dbl>, Lab_Na <chr>,
## #   Lab_urea <dbl>, Lab_Cr <dbl>, Lab_Bicarb <dbl>, Lab_Sugar <dbl>,
## #   Lab_Alb <dbl>, Lab_lactate <dbl>, Lab_lactateHigh <chr>, Lab_CRP <dbl>,
## #   Lab_CRPHigh <chr>, Lab_pH <dbl>, Lab_CO2 <dbl>, Lab_O2 <dbl>,
## #   Lab_FiO2 <dbl>, CS_Resp <chr>, CS_Blood <chr>, CS_Urine <chr>,
## #   CS_screen <chr>, CS_agent <chr>, CS_Organism1 <chr>,
## #   CS_Organism1Blood <chr>, CS_Organism1Sputum <chr>,
## #   CS_Organism1Tracheal <chr>, CS_Organism1BAL <chr>, CS_Organism1Urine <chr>,
## #   CS_Organism1Sero <chr>, CS_Organism1Other <chr>,
## #   CS_Organism1Comments <chr>, CS_Organism2 <chr>, CS_Orgainsim2Blood <chr>,
## #   CS_Organism2Sputum <chr>, CS_Organism2Tracheal <chr>,
## #   CS_Organism2BAL <chr>, CS_Organism2Urine <chr>, CS_OrganismSero <chr>,
## #   CS_OrganismOther <chr>, CS_OrganismComments <chr>,
## #   Abx_AmoxicillinSulbactam <chr>, Abx_AmoxicillinSulbactamOral <chr>,
## #   Abx_AmoxicillinSulbactamNonoral <chr>,
## #   Abx_AmoxicillinSulbactamNonoralStart <chr>,
## #   Abx_AmoxicillinSulbactamNonoralEnd <chr>, Abx_Ampicillin <chr>,
## #   Abx_AmpicillinStart <chr>, Abx_AmpicillinEnd <chr>,
## #   Abx_AmpicillinSulbactam <chr>, Abx_Azithromycin <chr>,
## #   Abx_Ceftriaxone <chr>, Abx_Cefotaxime <chr>, Abx_ClarithromycinOral <chr>,
## #   ...

(df %>% filter(Pt_CaseNumber==11))

## # A tibble: 3 x 138
##   Pt_Site Pt_CaseNumber Pt_Age R_CXR_infiltrate R_CXR_cavitation R_CXR_effusion
##   <chr>           <dbl>  <dbl> <chr>            <chr>            <chr>         
## 1 Locati~            11     37 Yes              No               Yes           
## 2 Locati~            11     37 Yes              No               No            
## 3 Locati~            11     76 Yes              Unavailable      Unavailable   
## # ... with 132 more variables: R_CXR_effusionSite <chr>, R_CT_inflitrate <chr>,
## #   R_CT_cavitation <chr>, R_CT_effusion <chr>, R_CT_effusionSite <chr>,
## #   SS_cough <chr>, SS_phlegm <chr>, SS_lungSounds <chr>, SS_temp <chr>,
## #   SS_breathing <chr>, SS_daysOfRespSymp <dbl>, Hx_mass <chr>, Hx_heart <chr>,
## #   Hx_stroke <chr>, Hx_kidney <chr>, Hx_liver <chr>, Hx_brainMental <chr>,
## #   Hx_diabetes <chr>, Hx_pastCAP <chr>, Hx_asp <chr>, Hx_alcohol <chr>,
## #   Hx_immune <chr>, Hx_COPD <chr>, Social_drugs <chr>,
## #   Social_overcrowded <chr>, Hx_heart_type <chr>, Social_smoke <chr>,
## #   Social_smoke_duration <chr>, Hx_HIV <chr>, Hx_HIV_CD4 <dbl>,
## #   Hx_HIV_viralLoad <chr>, Hx_HIV_Medicine <chr>, HCAP_hospStay <chr>,
## #   HCAP_IVAbx <chr>, HCAP_Chemo <chr>, HCAP_diaylsis <chr>, HCAP_injury <chr>,
## #   PE_AMS <chr>, PE_HR <dbl>, PE_RR <dbl>, PE_BP_S <dbl>, PE_BP_D <dbl>,
## #   PE_temp <dbl>, PE_O2 <chr>, Lab_RBC <dbl>, Lab_Hb <dbl>, Lab_WBC <dbl>,
## #   Lab_NeuImu <dbl>, Lab_Neu <dbl>, Lab_plt <dbl>, Lab_Na <chr>,
## #   Lab_urea <dbl>, Lab_Cr <dbl>, Lab_Bicarb <dbl>, Lab_Sugar <dbl>,
## #   Lab_Alb <dbl>, Lab_lactate <dbl>, Lab_lactateHigh <chr>, Lab_CRP <dbl>,
## #   Lab_CRPHigh <chr>, Lab_pH <dbl>, Lab_CO2 <dbl>, Lab_O2 <dbl>,
## #   Lab_FiO2 <dbl>, CS_Resp <chr>, CS_Blood <chr>, CS_Urine <chr>,
## #   CS_screen <chr>, CS_agent <chr>, CS_Organism1 <chr>,
## #   CS_Organism1Blood <chr>, CS_Organism1Sputum <chr>,
## #   CS_Organism1Tracheal <chr>, CS_Organism1BAL <chr>, CS_Organism1Urine <chr>,
## #   CS_Organism1Sero <chr>, CS_Organism1Other <chr>,
## #   CS_Organism1Comments <chr>, CS_Organism2 <chr>, CS_Orgainsim2Blood <chr>,
## #   CS_Organism2Sputum <chr>, CS_Organism2Tracheal <chr>,
## #   CS_Organism2BAL <chr>, CS_Organism2Urine <chr>, CS_OrganismSero <chr>,
## #   CS_OrganismOther <chr>, CS_OrganismComments <chr>,
## #   Abx_AmoxicillinSulbactam <chr>, Abx_AmoxicillinSulbactamOral <chr>,
## #   Abx_AmoxicillinSulbactamNonoral <chr>,
## #   Abx_AmoxicillinSulbactamNonoralStart <chr>,
## #   Abx_AmoxicillinSulbactamNonoralEnd <chr>, Abx_Ampicillin <chr>,
## #   Abx_AmpicillinStart <chr>, Abx_AmpicillinEnd <chr>,
## #   Abx_AmpicillinSulbactam <chr>, Abx_Azithromycin <chr>,
## #   Abx_Ceftriaxone <chr>, Abx_Cefotaxime <chr>, Abx_ClarithromycinOral <chr>,
## #   ...

#assign unique case numbers 
df<-df %>% mutate(Pt_CaseNumber=1:nrow(df))
(df %>% pull(Pt_CaseNumber) %>%  n_distinct())

## [1] 2281

Age

There is no missing age and mostly elderly >60 years across all three sites developed CAP.

# any m/s age 
(eda_n_NAplt(df, "Pt_Age"))

# distribution of age 
(ggplot(df, aes(Pt_Age)) + geom_histogram(aes(fill=Pt_Site),alpha=.5,bins=round(sqrt(nrow(df)))/3)) +labs(title = "Distribution of Age") + facet_wrap(.~Pt_Site) +theme(legend.position="none") + geom_vline(xintercept = 60, size=1)

3 `R_` Radiology related category

dtype(df, "R")

## tibble [2,281 x 8] (S3: tbl_df/tbl/data.frame)
##  $ R_CXR_infiltrate  : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ R_CXR_cavitation  : chr [1:2281] "No" "No" "Yes" "No" ...
##  $ R_CXR_effusion    : chr [1:2281] "No" "No" "No" "No" ...
##  $ R_CXR_effusionSite: chr [1:2281] NA NA NA NA ...
##  $ R_CT_inflitrate   : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ...
##  $ R_CT_cavitation   : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ...
##  $ R_CT_effusion     : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ...
##  $ R_CT_effusionSite : chr [1:2281] NA NA NA NA ...

Effusion and effusion site

R_CXR_effusion and R_CT_effusion indicates if effusion was seen on the radiological imaging while R_CXR_effusionSite and R_CT_effusionSite records the location of the effusion site. These columns contain different facets of the same information; the values of these columns can be integrated into single columns.

On chest x-ray ( `R_CXR_effusion`, `R_CXR_effusionSite`)

2080 effusion sites on chest x-rays , R_CXR_effusionSite, were recorded as NA not because they are truly missing but because these x-rays have no effusion sites in the begin with. They will be relabelled as nil effusion sites. 43 effusions sites were recorded as NA and whether effusion sites were found the x-rays are unknown. These observations will retain their NA values. 9 effusion sites were recorded as NA but effusions were detected on x-rays. They will be relabelled as effusion but ? site.

After relabelling R_CXR_effusionSite using information from R_CXR_effusion, the latter column will be removed.

(table(df$R_CXR_effusion, df$R_CXR_effusionSite, useNA = "always"))

##              
##               Bilateral Left Right <NA>
##   No                  0    0     0 2080
##   Unavailable         0    0     0   43
##   Yes                27   52    70    9
##   <NA>                0    0     0    0

# relabel effusion sites
df<-df %>% mutate(R_CXR_effusionSite=case_when(
   R_CXR_effusion=='No' & is.na(R_CXR_effusionSite)~ "Nil", 
    R_CXR_effusion=='Unavailable' & is.na(R_CXR_effusionSite) ~ "Unavailable",
    R_CXR_effusion=="Yes" & is.na(R_CXR_effusionSite) ~"Effusion but ?site", 
    T~as.character(R_CXR_effusionSite)
  ))

(table(df$R_CXR_effusion, df$R_CXR_effusionSite, useNA = "always"))

##              
##               Bilateral Effusion but ?site Left  Nil Right Unavailable <NA>
##   No                  0                  0    0 2080     0           0    0
##   Unavailable         0                  0    0    0     0          43    0
##   Yes                27                  9   52    0    70           0    0
##   <NA>                0                  0    0    0     0           0    0

# remove R_CXR_effusion 
df<-df %>% select(- R_CXR_effusion)

On CT chest (`R_CT_effusion`, `R_CT_effusionSite`)

Repeat the same for effusion related variables on CT chest.

(table(df$R_CT_effusion, df$R_CT_effusionSite, useNA = "always"))

##              
##               Bilateral Left Right <NA>
##   No                  0    0     0   45
##   Unavailable         0    0     0 2205
##   Yes                15    6    10    0
##   <NA>                0    0     0    0

# relabel effusion site
df<-df %>% mutate(R_CT_effusionSite=case_when(
    R_CT_effusion=='No' & is.na(R_CT_effusionSite)~ "Nil", 
    R_CT_effusion=='Unavailable' & is.na(R_CT_effusionSite) ~ "Unavailable",
    T~as.character(R_CT_effusionSite)
  )) 

(table(df$R_CT_effusion, df$R_CT_effusionSite, useNA = "always"))

##              
##               Bilateral Left  Nil Right Unavailable <NA>
##   No                  0    0   45     0           0    0
##   Unavailable         0    0    0     0        2205    0
##   Yes                15    6    0    10           0    0
##   <NA>                0    0    0     0           0    0

#remove CT_effusion 
df<-df %>% select(-R_CT_effusion)

4 `SS_` Category related to signs and symptoms of CAP

99 days of respiratory symptoms SS_daysOfRespSymp are outliers and likely represents missing values Relabel 99 as NA. The usage of 99 occurs frequently for numeric variables in this dataset, more examples to follow in later sections.

(dtype(df, "SS"))

## tibble [2,281 x 6] (S3: tbl_df/tbl/data.frame)
##  $ SS_cough         : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ SS_phlegm        : chr [1:2281] "No" "Yes" "Yes" "Yes" ...
##  $ SS_lungSounds    : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ SS_temp          : chr [1:2281] "Yes" "Yes" "No" "Yes" ...
##  $ SS_breathing     : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
##  $ SS_daysOfRespSymp: num [1:2281] 1 2 5 3 3 4 3 4 NA NA ...

## NULL

(eda_c(df, "SS"))

## $SS_cough
## .x
##   No  Yes <NA> 
##  145 2136    0 
## 
## $SS_phlegm
## .x
##   No  Yes <NA> 
##  536 1745    0 
## 
## $SS_lungSounds
## .x
##   No  Yes <NA> 
##  267 2014    0 
## 
## $SS_temp
## .x
##   No  Yes <NA> 
##  798 1483    0 
## 
## $SS_breathing
## .x
##   No  Yes <NA> 
##  532 1749    0 
## 
## $SS_daysOfRespSymp
## .x
##    0    1    2    3    4    5    6    7    8    9   10   12   13   14   15   20 
##    2  203  414  397  224  267   69  357   43    1   69    2    1    3   55   12 
##   21   22   23   30   99 <NA> 
##    1    1    1    6    4  149

df<-df %>% mutate(SS_daysOfRespSymp=na_if(SS_daysOfRespSymp, 99))

5 `Hx_` medical history category

(dtype(df, "Hx"))

## tibble [2,281 x 17] (S3: tbl_df/tbl/data.frame)
##  $ Hx_mass         : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_heart        : chr [1:2281] "No" "No" "Yes" "Yes" ...
##  $ Hx_stroke       : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_kidney       : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_liver        : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_brainMental  : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_diabetes     : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_pastCAP      : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_asp          : chr [1:2281] "Yes" "No" "No" "No" ...
##  $ Hx_alcohol      : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_immune       : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_COPD         : chr [1:2281] "No" "Yes" "No" "No" ...
##  $ Hx_heart_type   : chr [1:2281] NA NA NA NA ...
##  $ Hx_HIV          : chr [1:2281] "No" "No" "No" "No" ...
##  $ Hx_HIV_CD4      : num [1:2281] NA NA NA NA NA NA NA NA NA NA ...
##  $ Hx_HIV_viralLoad: chr [1:2281] NA NA NA NA ...
##  $ Hx_HIV_Medicine : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ...

## NULL

(eda_c(df,"Hx"))

## $Hx_mass
## .x
##        No Uncertain       Yes      <NA> 
##      2152         9       117         3 
## 
## $Hx_heart
## .x
##        No Uncertain       Yes      <NA> 
##      1273        19       983         6 
## 
## $Hx_stroke
## .x
##        No Uncertain       Yes      <NA> 
##      2105        11       161         4 
## 
## $Hx_kidney
## .x
##        No Uncertain       Yes      <NA> 
##      2110         7       156         8 
## 
## $Hx_liver
## .x
##        No Uncertain       Yes      <NA> 
##      2214         4        58         5 
## 
## $Hx_brainMental
## .x
##        No Uncertain       Yes      <NA> 
##      1907        28       341         5 
## 
## $Hx_diabetes
## .x
##        No Uncertain       Yes      <NA> 
##      1907        13       358         3 
## 
## $Hx_pastCAP
## .x
##        No Uncertain       Yes      <NA> 
##      1985         5       287         4 
## 
## $Hx_asp
## .x
##        No Uncertain       Yes      <NA> 
##      2202        15        59         5 
## 
## $Hx_alcohol
## .x
##        No Uncertain       Yes      <NA> 
##      2117        18       135        11 
## 
## $Hx_immune
## .x
##        No Uncertain       Yes      <NA> 
##      2130         4       139         8 
## 
## $Hx_COPD
## .x
##        No Uncertain       Yes      <NA> 
##      1879        50       343         9 
## 
## $Hx_heart_type
## .x
##              Arrhythmia                     CHF            Hypertension 
##                      23                      43                     220 
## Isquemic cardiomyopathy                   Other                    <NA> 
##                      20                       1                    1974 
## 
## $Hx_HIV
## .x
##          No Unavailable         Yes        <NA> 
##        2134         105          42           0 
## 
## $Hx_HIV_CD4
## .x
##   14   41   74   99  127  175  245  291  324  349  485  493  844 1104 <NA> 
##    1    1    1    3    1    1    1    1    1    1    1    1    1    1 2265 
## 
## $Hx_HIV_viralLoad
## .x
## < Detection limit > Detection limit             Value              <NA> 
##                 5                 3                 4              2269 
## 
## $Hx_HIV_Medicine
## .x
##          No Unavailable         Yes        <NA> 
##         596        1671          14           0

HIV details

Remove Hx_HIV_CD4 & Hx_HIV_viralLoad as there are too many NA. Remove Hx_HIV_Medicine as there are too many Unavailable.

df<-df %>% select(-contains("Hx_HIV_"))

Heart disease

Hx_heart indicates whether the patient has heart disease or not. Hx_heart_type provides details on the type of heart disease. Both the variables can be integrated into a variable.
1273 observations for heart disease details, Hx_heart_type were labelled as NA not because they are truly missing but because these patients have No heart disease. These values will be labelled as None. 19 observations for heart disease details were labelled as NA when the heart disease status is Uncertain. These values will be labelled as Query heart disease. 676 observations for heart disease details were labelled as NA but were classified to have a heart disease. As these patients have heart diseases but no details can be obtained, the NA values will be treated as Other types of heart disease.
After using Hx_heart to expand Hx_heart_type, Hx_heart will be dropped.

(table(df$Hx_heart, df$Hx_heart_type, useNA="always"))

##            
##             Arrhythmia  CHF Hypertension Isquemic cardiomyopathy Other <NA>
##   No                 0    0            0                       0     0 1273
##   Uncertain          0    0            0                       0     0   19
##   Yes               23   43          220                      20     1  676
##   <NA>               0    0            0                       0     0    6

# relabel
df<- df %>% mutate(Hx_heart_type=case_when(
  Hx_heart== "No" & is.na(Hx_heart_type) ~ "None",
  Hx_heart=="Uncertain" & is.na(Hx_heart_type) ~ "Query heart disease", 
  Hx_heart=="Yes" & is.na(Hx_heart_type) ~ "Other",
  T~ Hx_heart_type)) 

(table(df$Hx_heart, df$Hx_heart_type, useNA="always"))

##            
##             Arrhythmia  CHF Hypertension Isquemic cardiomyopathy None Other
##   No                 0    0            0                       0 1273     0
##   Uncertain          0    0            0                       0    0     0
##   Yes               23   43          220                      20    0   677
##   <NA>               0    0            0                       0    0     0
##            
##             Query heart disease <NA>
##   No                          0    0
##   Uncertain                  19    0
##   Yes                         0    0
##   <NA>                        0    6

# remove `hx_heart`
df <- df %>% select (-Hx_heart)

6 `Social_` social history category

(dtype(df, "Social"))

## tibble [2,281 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Social_drugs         : chr [1:2281] "No" "No" "No" "No" ...
##  $ Social_overcrowded   : chr [1:2281] "No" "No" "No" "No" ...
##  $ Social_smoke         : chr [1:2281] "No" "Yes" "No" "No" ...
##  $ Social_smoke_duration: chr [1:2281] NA "Previous to the last 5 years" NA NA ...

## NULL

(eda_c(df, "Social"))

## $Social_drugs
## .x
##        No Uncertain       Yes      <NA> 
##      2263         3        10         5 
## 
## $Social_overcrowded
## .x
##        No Uncertain       Yes      <NA> 
##      2193         7        51        30 
## 
## $Social_smoke
## .x
##          No Unavailable         Yes        <NA> 
##        1276         175         830           0 
## 
## $Social_smoke_duration
## .x
##                      current          In the last 5 years 
##                          394                          200 
## Previous to the last 5 years                         <NA> 
##                          236                         1451

smoking

Social_smoke indicates if the patient smokes or not. Social_smoke_duration records how long the patients was smoking. These variables contain different facets of the same information; the values can be integrated into a single column.
1276 observations for smoking duration, Social_smoke_duration were labelled as NA but these values were not truly missing. These patients did not smoke. The values will be relabelled as non-smoker. 175 observations for smoking duration were labelled as NA but the information if they smoked was Unavailable. These values will be relabelled as Unavailable. All the patients who smoked had the duration of their smoking habit recorded. The bins of smoking duration were relabelled to terms that are more intuitive. current was relabelled as still smokes, In the last 5 years was relabelled as smoked in the last 5y, Previous to the last 5 years was relabelled as smoked >5y ago.
After using Social_smoke to expand Social_smoke_duration, Social_smoke is dropped.

(table(df$Social_smoke, df$Social_smoke_duration, useNA = "always"))

##              
##               current In the last 5 years Previous to the last 5 years <NA>
##   No                0                   0                            0 1276
##   Unavailable       0                   0                            0  175
##   Yes             394                 200                          236    0
##   <NA>              0                   0                            0    0

# relabel 
df<-df %>% mutate(Social_smoke=case_when(
  Social_smoke=="Yes" & Social_smoke_duration=="current"~ "still smokes", 
  Social_smoke=="Yes" & Social_smoke_duration=="In the last 5 years"~ "smoked in last 5y", 
  Social_smoke=="Yes" & Social_smoke_duration=="Previous to the last 5 years"~ "smoked >5y ago", 
  T~as.character(Social_smoke)
)) 

(df %>% count(Social_smoke))

## # A tibble: 5 x 2
##   Social_smoke          n
##   <chr>             <int>
## 1 No                 1276
## 2 smoked >5y ago      236
## 3 smoked in last 5y   200
## 4 still smokes        394
## 5 Unavailable         175

# remove 
df <- df %>% select(-Social_smoke_duration)

7 `HCAP_` healthcare associated pneumonia category

No data cleaning is needed for this category.

(dtype(df,"HCAP"))

## tibble [2,281 x 5] (S3: tbl_df/tbl/data.frame)
##  $ HCAP_hospStay: chr [1:2281] "Yes" "No" "No" "No" ...
##  $ HCAP_IVAbx   : chr [1:2281] "No" "No" "No" "No" ...
##  $ HCAP_Chemo   : chr [1:2281] "No" "No" "No" "No" ...
##  $ HCAP_diaylsis: chr [1:2281] "No" "No" "No" "No" ...
##  $ HCAP_injury  : chr [1:2281] "No" "No" "No" "No" ...

## NULL

(eda_c(df, "HCAP"))

## $HCAP_hospStay
## .x
##        No Uncertain       Yes      <NA> 
##      2039         3       236         3 
## 
## $HCAP_IVAbx
## .x
##        No Uncertain       Yes      <NA> 
##      2074         1       203         3 
## 
## $HCAP_Chemo
## .x
##        No Uncertain       Yes      <NA> 
##      2240         3        33         5 
## 
## $HCAP_diaylsis
## .x
##        No Uncertain       Yes      <NA> 
##      2234         1        41         5 
## 
## $HCAP_injury
## .x
##        No Uncertain       Yes      <NA> 
##      2206         3        63         9

8 `PE_` observations during physical examination category

(dtype(df, "PE"))

## tibble [2,281 x 7] (S3: tbl_df/tbl/data.frame)
##  $ PE_AMS : chr [1:2281] "No" "No" "No" "Unavailable" ...
##  $ PE_HR  : num [1:2281] 88 92 100 95 95 110 85 85 110 85 ...
##  $ PE_RR  : num [1:2281] 26 24 48 30 30 28 26 25 26 28 ...
##  $ PE_BP_S: num [1:2281] 100 110 140 140 120 140 100 120 120 100 ...
##  $ PE_BP_D: num [1:2281] 50 60 80 80 60 90 60 60 80 60 ...
##  $ PE_temp: num [1:2281] 38 38 36 37 36 39 37 37 40 37 ...
##  $ PE_O2  : chr [1:2281] NA NA NA NA ...

## NULL

# explore categorical 
(eda_c(df, "PE_AMS"))

## $PE_AMS
## .x
##          No Unavailable         Yes        <NA> 
##        1832          29         420           0

Oxygen levels, PE_O2 are calculated in the form of percentage. In this case, there is a mixture of pure numbers and numbers ending with % resulting in the variable to be treated as a character variable. % will be omitted and the variable will be converted to a numeric variable.

(eda_c(df, "PE_O2"))

## $PE_O2
## .x
##                100                 37                 55                 58 
##                  3                  1                  1                  1 
##                 60                 63                 65                 67 
##                  3                  1                  3                  1 
##                  7                 70                 73                 74 
##                  1                  9                  3                  6 
##                 75                 76                 77                 78 
##                  3                  2                  3                 10 
##                 79                 80 80.099999999999994                 82 
##                  4                 23                  1                 18 
##                 83                 84                 85               85 % 
##                 12                 17                 23                  2 
##                 86               86,7 86.599999999999994                 87 
##                 21                  1                  1                 19 
##                 88                88%               88,9 88.900000000000006 
##                 55                  1                  1                  1 
##                 89                 90                 91 91.299999999999997 
##                 50                116                 62                  1 
##                 92                92% 92.400000000000006                 93 
##                124                  1                  1                107 
##               93 % 93.400000000000006 93.599999999999994                 94 
##                  1                  1                  1                153 
##               94 % 94.299999999999997               94.5                 95 
##                  1                  1                  1                153 
##               95.5 95.609999999999999 95.900000000000006                 96 
##                  1                  1                  1                191 
##               96 %               96.5 96.700000000000003                 97 
##                  1                  1                  1                142 
##               97.5                 98                 99               <NA> 
##                  1                 97                 35                784

# clean  up `PE_O2` 
df<-df %>% mutate(PE_O2= as.numeric(str_replace_all(PE_O2,pattern="%", replacement = "")))

## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion

Missing `PE_` values

Now, all the PE_ variables are in numeric form and the proportion of NA can be appropriately calculated. Oxygen levels PE_O2 has the highest proportion of missing values, >30% values are missing. PE_O2 will be dropped.

(eda_n_NAplt(df,"PE"))

df<-df %>% select(-PE_O2)

Outlier `PE_` values

The following variables have unrealistic outliers:

Temperature, PE_temp. Outliers >50’C will be explored
Breathing rate, PE_RR. Outliers of >50 breaths per minute will be explored
Diastolic blood pressure, PE_BP_D. There is only one observation with a diastolic blood pressure >300, this observation will be removed.

pe_selected<-eda_n_NAcutoff(df, "PE", 0, 0.3)

(eda_n_outlier(df,pe_selected))

## Warning: All elements of `...` must be named.
## Did you want `data = c(values)`?

## Warning: Removed 176 rows containing non-finite values (stat_boxplot).

## Warning: Removed 201 rows containing non-finite values (stat_boxplot).

## Warning: Removed 167 rows containing non-finite values (stat_boxplot).

## Warning: Removed 168 rows containing non-finite values (stat_boxplot).

## Warning: Removed 167 rows containing non-finite values (stat_boxplot).

## NULL

df<-df %>% filter(PE_BP_D<300)

Further investigation of outliers

The 90-ish values stand out for both PE_temp and PE_RR. A frequency count will be done.

(df %>% filter(PE_temp>50) %>% ggplot(aes(PE_temp)) + geom_histogram(bins=round(sqrt(nrow(df)))))

(df %>% filter(PE_RR>50) %>% ggplot(aes(PE_RR)) + geom_histogram(bins=round(sqrt(nrow(df)))))

99 is the most common value. Similar to previous numeric variables where 99 is an outlier, it will be converted to NA. For PE_temp, the values 363 and 384, are likely missing a decimal point (It is more likely your body’s temperature is 36.3’C instead of 363’C) .

# PE_temp
(df %>% filter(PE_temp>50) %>% group_by(PE_temp) %>% summarise(n(), .groups="drop"))

## # A tibble: 6 x 2
##   PE_temp `n()`
##     <dbl> <int>
## 1      67     1
## 2      80     1
## 3      92     1
## 4      99    98
## 5     363     1
## 6     384     1

# PE_RR
(df %>% filter(PE_RR>50) %>% group_by(PE_RR) %>% summarise(n(), .groups="drop"))

## # A tibble: 10 x 2
##    PE_RR `n()`
##    <dbl> <int>
##  1    70     2
##  2    80     1
##  3    84     1
##  4    90     1
##  5    99   138
##  6   100     1
##  7   114     1
##  8   120     1
##  9   126     1
## 10   130     1

The rest of the outliers will take a plausible maximum value based on the 90th-95th percentile.

# 90-ish percentile 
(quantile(df$PE_temp, probs = seq(0,1,.05), na.rm = T))

##   0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%  75% 
##   32   36   36   36   36   36   36   37   37   37   37   37   37   38   38   38 
##  80%  85%  90%  95% 100% 
##   38   38   39   40  384

(quantile(df$PE_RR, probs = seq(0,1,.05), na.rm = T))

##   0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%  75% 
##    7   16   16   18   18   20   20   20   21   22   23   24   24   24   26   27 
##  80%  85%  90%  95% 100% 
##   28   30   35   99  130

# clean up PE_temp and PE_RR
df<-df %>% mutate(PE_temp=na_if(PE_temp, 99), PE_RR=na_if(PE_RR, 99),
          PE_temp=if_else(PE_temp==363, 36.3, PE_temp), PE_temp=if_else(PE_temp==384, 38.4, PE_temp),
          PE_temp=if_else(PE_temp>50, 40, PE_temp), PE_RR=if_else(PE_RR>50, 35, PE_RR))

To be continued….

Before this post becomes too long, EDA for the remaining categories

lab findings Lab_
cultures CS_
antibiotics Abx_
continiuum of care state Care_
vaccine V_

will be done in the next post.

save(df, file = "CAP_EDA1.RData")

Intro

Data dictionary

EDA blueprint

Outcome

Discard the noise

1 Other_ related category

2 Pt_ Patient related category

Appropriate patients

Case_number

Age

3 R_ Radiology related category

Effusion and effusion site

On chest x-ray ( R_CXR_effusion, R_CXR_effusionSite)

On CT chest (R_CT_effusion, R_CT_effusionSite)

4 SS_ Category related to signs and symptoms of CAP

5 Hx_ medical history category