Using AssocBin to map complex data

Chris Salahub

The AssocBin vignette gives some guidance on how to control the core algorithm at the heart of the AssocBin package for applications to a single pair, but the functionality of AssocBin does not stop there. The package also includes higher level functionality which can apply recursive binning automatically to a complex data set. This vignette outlines howhttp://127.0.0.1:22179/graphics/681bfdce-bb25-484e-afd9-1fe303480217.png to run, interpret, and customize the pariwiseAssociation function that operates this high level functionality.

library(AssocBin)

Let’s take a look at the heart disease data from the UCI machine learning data repository (https://archive.ics.uci.edu/dataset/45/heart+disease):

data(heart)
summary(heart)
##       age            sex               cp         trestbps          chol      
##  Min.   :28.00   female:194   atypical  :174   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:47.00   male  :726   non-angina:204   1st Qu.:120.0   1st Qu.:175.0  
##  Median :54.00                none      :496   Median :130.0   Median :223.0  
##  Mean   :53.51                typical   : 46   Mean   :132.1   Mean   :199.1  
##  3rd Qu.:60.00                                 3rd Qu.:140.0   3rd Qu.:268.0  
##  Max.   :77.00                                 Max.   :200.0   Max.   :603.0  
##                                                NA's   :59      NA's   :30     
##     fbs                 restecg       thalach        exang        
##  Mode :logical   hypertrophy:188   Min.   : 60.0   Mode :logical  
##  FALSE:692       normal     :551   1st Qu.:120.0   FALSE:528      
##  TRUE :138       stt        :179   Median :140.0   TRUE :337      
##  NA's :90        NA's       :  2   Mean   :137.5   NA's :55       
##                                    3rd Qu.:157.0                  
##                                    Max.   :202.0                  
##                                    NA's   :55                     
##     oldpeak         slope        ca              thal     num    
##  Min.   :-2.6000   down: 63   0   :181   normal    :196   0:411  
##  1st Qu.: 0.0000   flat:345   1   : 67   fixed     : 46   1:265  
##  Median : 0.5000   up  :203   2   : 41   reversable:192   2:109  
##  Mean   : 0.8788   NA's:309   3   : 20   NA's      :486   3:107  
##  3rd Qu.: 1.5000              NA's:611                    4: 28  
##  Max.   : 6.2000                                                 
##  NA's   :62                                                      
##     study          
##  Length:920        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

This example data has some missing values, with variables like slope, ca, and thal particularly troublesome. Dropping these variables which are mostly missing and considering complete cases only:

heartClean <- heart
heartClean$thal <- NULL
heartClean$ca <- NULL
heartClean$slope <- NULL
heartClean <- na.omit(heartClean)
str(heartClean)
## 'data.frame':    740 obs. of  12 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "atypical","non-angina",..: 4 3 3 2 1 1 3 3 3 3 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ restecg : Factor w/ 3 levels "hypertrophy",..: 1 1 1 2 1 2 1 2 1 1 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ num     : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...
##  $ study   : chr  "cleveland" "cleveland" "cleveland" "cleveland" ...
##  - attr(*, "na.action")= 'omit' Named int [1:180] 306 331 335 338 348 369 376 379 385 390 ...
##   ..- attr(*, "names")= chr [1:180] "306" "331" "335" "338" ...

This leaves 740 observations over 12 variables to be binned. Running the binning process is very straightforward. First, stop criteria are defined. Next, the function inDep is called. This function performs pairwise binning on all variable pairs in this data set and returns an inDep object, which contains these binnings alongside data about them that allow us to assess the relationships between these pairs.

stopCrits <- makeCriteria(depth >= 6, n < 1, expn <= 10)
assocs <- inDep(heartClean, stopCriteria = stopCrits)

The resulting object can be explored using plot and summary methods.

summary(assocs)
## All 66 pairs in heartClean recursively binned with type distribution: 
## 
##   factor:factor  factor:numeric numeric:numeric 
##              21              35              10 
## 35 pairs are significant at 5% and 33 pairs are significant at 1%
plot(assocs) # by default this displays the 5 strongest relationships

## by specifying which indices should be displayed, others can be plotted
## the binnings are returned in increasing order of p-value, so the indices
## chosen give the rank of the strength a particular pair's relationship
plot(assocs, which = 6:10) # like the next 5 strongest

plot(assocs, which = 62:66) # or the 5 weakest relationships

In the first two columns, points are jittered within categorical variable levels to reduce overplotting. The different categories are separated by dashed lines for visual clarity. Space is added in the first column for additional separation of groups, but this is condensed in the second column to reflect the appearance of the point configuration under ranks with random tie-breaking.

The shadings in the final column of each of these plots highlight the cells with individually significant Pearson residuals under the null hypothesis. Asymptotically, the Pearson residuals are normally distributed, and so values above the 0.975 standard normal or below the 0.025 standard normal quantile are shaded. The cells below the lower quantile (with significantly fewer observations than expected) are shaded blue while those above (with significantly more observations than expected) are shaded red.

The \(p\)-values reported are computed assuming the \(\chi^2\) statistics for each pair are \(\chi^2_{df}\)-distributed with \(df\) given by the number of cells less the number of constraints in estimation. By default, binning is therefore done at random to ensure this approximation is accurate. When other splitting methods are used these default \(p\)-values are no longer accurate. Of course, such splitting methods may still be desirable for better plotting qualities and so we can customize the bins by changing the splitting logic.

maxCatCon <- function(bn) uniMaxScoreSplit(bn, chiScores)
maxConCon <- function(bn) maxScoreSplit(bn, chiScores)
maxPairs <- inDep(data = heartClean, stopCriteria = stopCrits, 
                  catCon = maxCatCon, conCon = maxCatCon)
summary(maxPairs)
## All 66 pairs in heartClean recursively binned with type distribution: 
## 
##   factor:factor  factor:numeric numeric:numeric 
##              21              35              10 
## 54 pairs are significant at 5% and 50 pairs are significant at 1%

The drastic change in the significance levels in the summary demonstrates how maximized \(\chi^2\) splitting, for example, leads to inflated significance of pairs compared to the more accurate random splitting.

plot(maxPairs)

plot(maxPairs, which = 6:10)

Despite this change, both methods agree on the most significant relationships in the data. For each pair, we can get a sense of its importance by comparing its rank under both random and maximized binning in a single plot.

randOrd <- match(assocs$pairs, assocs$pairs)
maxOrd <- match(assocs$pairs, maxPairs$pairs)
plot(randOrd, maxOrd, xlab = "Random rank", ylab = "Max chi rank", 
     main = "Rankings of pair significance between methods")

Both methods agree on the most important relationships in the data, but tend to disagree about the less strongly related variables. Note that this plot was produced by matching the pairs ranked by \(p\)-values, which are only accurate for random binning.

Noting the strong impact of study on these results, we might repeat this analysis with the Cleveland data alone.

heartCleve <- heartClean[heartClean$study == "cleveland", ]
heartCleve$study <- NULL
cleveAssoc <- inDep(heartCleve, stopCriteria = stopCrits, catCon = maxCatCon,
                    conCon = maxConCon)
summary(cleveAssoc)
## All 55 pairs in heartCleve recursively binned with type distribution: 
## 
##   factor:factor  factor:numeric numeric:numeric 
##              15              30              10 
## 28 pairs are significant at 5% and 21 pairs are significant at 1%
plot(cleveAssoc)

plot(cleveAssoc, which = 6:10)

plot(cleveAssoc, which = 11:15)