Last time I said I was probably done with data visualization. Saw an example using lmplot(). Seemed to have real value in the right situations. So, thought I might try to see if I could find a right situation. So, going to try visualizing the data in one or more toy datasets.

Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other.

Visualizing regression models

Keep in mind that Seaborn’s regplot() and lmplot() are used to visualize a linear relationship between two features via regression. regplot() is in fact a subset of lmplot() and their parameters are specified differently. More importantly lmplot() can be used to see how the relationship between two variables change as a function of a third variable. Potentially a source of considerable insight — one imagines.

Oh, yes. Just so you know, I have pretty much switched to mostly using Seaborn for visualizations. Though there will likely be times I use Matplotlib or perhaps the visualizations provided by pandas.

Diabetes Dataset

I was going to use the Scikit-Learn diabetes dataset (sklearn.datasets.load_diabetes). But it has been “normalized”. I was able to find what I think is the source data for the scikit version — diabetes data from the LARS paper. See: Boos-Stefanski Variable Selection Home.

Data Set Characteristics

  • Number of Instances: 442

  • Number of Attributes: First 10 columns are numeric predictive values

  • Target: Column 11, Y, is a quantitative measure of disease progression one year after baseline

  • Attribute Information:

    • AGE age in years
    • SEX
    • BMI body mass index
    • BP average blood pressure
    • S1 tc, T-Cells (a type of white blood cells)
    • S2 ldl, low-density lipoproteins
    • S3 hdl, high-density lipoproteins
    • S4 tch, thyroid stimulating hormone
    • S5 ltg, lamotrigine
    • S6 glu, blood sugar level

Let’s Set Things Up

Ok, let’s set up the notebook, load the dataset and have a quick look at it’s basic characteristics.

In [2]:
# let's set things up
from IPython.core.interactiveshell import InteractiveShell
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_diabetes

InteractiveShell.ast_node_interactivity = "all" %matplotlib inline plt.style.use('default') sns.set() pd.options.display.float_format = '{:,.2f}'.format

In [3]:
# original data source: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
db_loc = "./data/diabetes.tab.txt"
db_data = pd.read_csv(db_loc, sep='\t', header=(0))
In [4]:
print(type(db_data))
db_data.head()
db_data.describe()
Out[4]:
AGESEXBMIBPS1S2S3S4S5S6Y
059232.10101.0015793.2038.004.004.8687151
148121.6087.00183103.2070.003.003.896975
272230.5093.0015693.6041.004.004.6785141
324125.3084.00198131.4040.005.004.8989206
450123.00101.00192125.4052.004.004.2980135
Out[4]:
AGESEXBMIBPS1S2S3S4S5S6Y
count442.00442.00442.00442.00442.00442.00442.00442.00442.00442.00442.00
mean48.521.4726.3894.65189.14115.4449.794.074.6491.26152.13
std13.110.504.4213.8334.6130.4112.931.290.5211.5077.09
min19.001.0018.0062.0097.0041.6022.002.003.2658.0025.00
25%38.251.0023.2084.00164.2596.0540.253.004.2883.2587.00
50%50.001.0025.7093.00186.00113.0048.004.004.6291.00140.50
75%59.002.0029.27105.00209.75134.5057.755.005.0098.00211.50
max79.002.0042.20133.00301.00242.4099.009.096.11124.00346.00
In [5]:
db_data.dtypes
db_data.info()
Out[5]:
AGE      int64
SEX      int64
BMI    float64
BP     float64
S1       int64
S2     float64
S3     float64
S4     float64
S5     float64
S6       int64
Y        int64

dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):

Column Non-Null Count Dtype


0 AGE 442 non-null int64
1 SEX 442 non-null int64
2 BMI 442 non-null float64 3 BP 442 non-null float64 4 S1 442 non-null int64
5 S2 442 non-null float64 6 S3 442 non-null float64 7 S4 442 non-null float64 8 S5 442 non-null float64 9 S6 442 non-null int64
10 Y 442 non-null int64

dtypes: float64(6), int64(5) memory usage: 38.1 KB

Some Basic Visualizations

We’ll start with a look at the distribution of the disease progression field (‘Y’ in our dataset).

In [6]:
# let's have a look at the target data
g = sns.displot(data=db_data, x='Y', kde=True, rug=True)
#g.set_title('Disease Progression Distribution');
plot of the distribution of the disease progression field

Now, we’ll split that up by the sex field.

In [7]:
# and lets look at that by sex, not that I know which is which
sns.displot(data=db_data, x="Y", kde=True, col="SEX");
plot of the distribution of the disease progression field by sex

Similar shapes for both sexes. But there seems to be less variablility in the numbers at each level of progression for sex = 2. The distribution for sex 1 looks to be much more skewed right.

I’d like to look at the influence of BMI on the progression value. But, the above approach won’t work given the continuous, rather than categorical, nature of the BMI feature. So, let’s look at distribution of BMI by sex first.

In [8]:
sns.displot(data=db_data, x="BMI", kde=True, col="SEX");
plot of the distribution of the BMI field by sex

Kind of funny. The BMI distribution for sex 1 looks similar to the progression distribution for sex 2. And vice versa. Though the distribution of BMI for sex 2 looks to be much less skewed that the distribution for outcome for sex 1.

Just so you know, I am guessing ‘sex 2’ is ‘male’.

Now let’s look at the distribtuion of blood pressures by sex.

In [9]:
# how about sex and blood pressure
sns.displot(data=db_data, x="BP", kde=True, col="SEX");
plot of the distribution of the blood pressure field by sex

And, back we go. The distribution of blood pressure by sex is much more similar to the distribution of disease progression than to BMI. I was actually expecting BP and BMI to be related and therefore look similar. But would seem to indicate that perhaps blood pressure is more influential in disease progression than is BMI.

But, that’s purely conjecture on my part. But one would think that similar looking distributions would indicate some meaningful relationship. But I have to admit that none of my reading so far has implied any such thing.

Now, we’ll plot the basic distributions for all the dataset attributes. I.E. ignoring the target which we looked at above.

In [10]:
fig=plt.figure(figsize=(20,20))
for i,col in enumerate(db_data.drop(['Y'],axis=1)):
    ax=fig.add_subplot(5,2,i+1);
    sns.histplot(db_data[col]);
distribution plots for all dataset attributes

Once again, not really sure what those plots tell us.

But, there appears to be some mild skew in a few of the features. With S4 (tch/tsh) showing the greatest skew and a very weird looking plot. Likely due to how the data was determined/collected/recorded. Could be why the scikit-learn data was mean centered and scaled by the standard deviation times the number of samples (i.e. the sum of squares of each column totals 1).

Only appears to be one categorical feature (Sex).

And, BMI, BP, S2 (ldl), S3 (hdl), S4 (tch/tsh) and S5 (ltg) have distributions that are the most similar to that of disease progression (the target value).

After a wee side step, we’ll take a look at the box plots for the continuous attributes.

Wee Aside — S4

Apparently, The American Board of Internal Medicine uses an adult TSH reference range of 0.5–4.0 mIU/L. How many values are outside this range?

In [23]:
# count by sex
c_sx = db_data.SEX.count()
c_s1 = db_data[db_data['SEX'] == 1.0].SEX.count()
print(f"sex 1: {c_s1}, sex 2: {c_sx - c_s1}\n")

s4 = db_data.S4.unique() # sorted(s4) print(f's4 unique: {len(s4)}') # The American Board of Internal Medicine uses an adult TSH reference range of 0.5–4.0 mIU/L. # how many values are outside this range s4_low = db_data[db_data['S4'] < 0.5] s4_high = db_data[db_data['S4'] > 4.0] c_s4_h_all = len(s4_high) print(f's4_low: {len(s4_low)}, s4 ok: {c_sx - len(s4_low) - len(s4_high)}, s4_high: {c_s4_h_all}\n')

# high by sex #s4_h_1 = db_data[db_data['S4'] > 4.0 and db_data['SEX'] == 1.0] s4_h_1 = db_data.query("S4 > 4.0 and SEX == 1.0") c_s4_h_1 = len(s4_h_1) print(f"s4 high sex 1: {c_s4_h_1}; s4 high sex 2: {c_s4_h_all - c_s4_h_1}")

sex 1: 235, sex 2: 207

s4 unique: 66 s4_low: 0, s4 ok: 286, s4_high: 156

s4 high sex 1: 56; s4 high sex 2: 100

Perhaps one could look at turning this into a categorical value? low, ok, high or -1, 0, 1

Box Plot

Now onto box plots. A quote from the Seaborn site to start.

I have split the attributes into groups so that the ranges on the box plots are similar for each group. When I did them all on one plot, some of the boxplots were unreadable.

I am adding the means (small green triangle) to the plots.

In [24]:
boxp = pd.DataFrame(data=db_data, columns=['AGE', 'BMI', 'S3'])
sns.boxplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True);
box plots for age, BMI, S3

Okay for the next two groupings let’s play with the colour scheme and change the orientation.

In [25]:
boxp2 = pd.DataFrame(data=db_data, columns=['BP', 'S1', 'S2', 'S6'])
sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3");
box plots for BP, S1, S2, S6
In [26]:
boxp3 = pd.DataFrame(data=db_data, columns=['S4', 'S5'])
sns.boxplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2");
box plots for S4, S5

I kind of like the horizontal display. But, either view does the job.

Clearly some outliers in the data for a few of the features. But on the other hand, there doesn’t appear to be a significant amount of skewness in most of the feature distributions.

Now, let’s look at a similar visual: the violin plot.

Violin Plot

For comparison purposes, I will show the boxplots next to the violin plots for each group. For the last row, I also used a scatterplot to show the observations for the attributes included in that grouping.

In [27]:
# display boxplot and violin plot side by side for comparison
#fig, axs = plt.subplots(3, 2, figsize=(16,16), sharey='row')
fig, axs = plt.subplots(3, 2, figsize=(16,16))
#axs[0,1].ylim(0, 120)
sns.violinplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True, ax=axs[0,0]);
# yticks = axs[0, 0].get_yticks()
# print(yticks)
# axs[0, 1].set_yticks(yticks)
sns.boxplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True, ax=axs[0,1]);
axs[0, 1].sharey(axs[0, 0])
# spent a bunch of time trying to get y-axis on top row to have same ylim and to show tick values
#   turned out to be simpler than I thought, also made x-axis on lower rows match up
#   don't know if this could be done someway easier in Seaborn, don't think so
#plt.setp(axs[0,1], xticklabels=[])
# axs[0, 0].tick_params(axis='both', which='both', labelsize=7, labelbottom=True)
# axs[0, 1].tick_params(axis='both', which='both', left=True, labelsize=7)
# for tick in axs[0, 1].get_yticklabels():
#     tick.set_visible(True)

sns.violinplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,0]); l1, r1 = axs[1, 0].get_xlim() axs[1, 1].set_xlim(left=l1, right=r1) sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,1]);

sns.violinplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2", ax=axs[2,0]); l2, r2 = axs[2, 0].get_xlim() axs[2, 1].set_xlim(left=l2, right=r2) sns.boxplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2", ax=axs[2,1]);

# just for fun let's add a plot of the actual data to the last row sns.stripplot(x="value", y="variable", data=pd.melt(boxp3), marker="o", alpha=0.3, color="black", ax=axs[2,0]); sns.stripplot(x="value", y="variable", data=pd.melt(boxp3), marker="o", alpha=0.3, color="black", ax=axs[2,1]);

violin and box plots for the continuous attributes

I really at this point have no idea how to use/interpret the violin plots. And, perhaps, only slightly more about the value of the boxplots.

However, while doing some research, I came across the following quote.

Best Use Case for Boxplots

So, let’s have look at some possible comparisons. For example BMI and/or BP by sex; as we did with histograms above. I am using the notched plot type.

In [28]:
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
cnt_x1 = db_data[db_data['SEX'] == 1].SEX.count()
cnt_x2 = db_data[db_data['SEX'] == 2].SEX.count()
print(f"sex 1: {cnt_x1}; sex 2: {cnt_x2}")
#sns.boxplot(x='BMI', y='Y', data=boxp3, notch=True, hue="SEX")
#sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,1]);
sns.boxplot(x="SEX", y="BMI", data=db_data, palette="Set3", notch=True, ax=axs[0, 0]);
sns.boxplot(x="SEX", y="BP", data=db_data, palette="Set3", notch=True, ax=axs[0, 1]);
sns.boxplot(x="SEX", y="AGE", data=db_data, palette="Set3", notch=True, ax=axs[1,0]);
sns.boxplot(x="SEX", y="S4", data=db_data, palette="Set3", notch=True, ax=axs[1,1]);
axs[1, 1].set_ylabel("TSH (tch)");
sex 1: 235; sex 2: 207
box plots for age, BMI, BP and S4 by sex

Ok! That is possibly more interesting.

The median BMI for both sexes is similar. And, as the notches overlap, we definitely can not say that the true medians are different. But sex 1 appears to have more variablility in BMI than sex 2. With the sex 2 distribution perhaps a bit more skewed.

With respect to blood pressure, sex 2 has a higher median pressure. And, as the notches do not overlap, we can conclude, with 95% confidence, that the true medians do differ. The variability between the sexes seems similar. Perhaps a bit more skew for sex 1.

Sex 2 has a higher median age. And the true median age for the sexes would appear to be different. Variability once again similar.

TSH certainly looks stranger. Would appear to very definitely be some skewness in the distribution for both sexes. Again variability similar. But the mean TSH for sex 2 definitely higher than for sex 1. Looks like almost 75% of sex 2 have a TSH higher tham 75% of sex 1.

Not sure this would help with any attempt to model the data, but—

Done for Now!

This post has taken me quite some time. Lot’s of hunting the web and reading as much as I could easily understand. Slightly better grasp of the value of boxplots. Lots more work to be done on understanding violin plots. And, never did get around to looking at any form of regression plots. Will tackle the latter in the next post.

And, if I manage to get a better understanding of violin plots before finishing the next post, will add whatever I had managed to learn.

If you wish to play with the above, feel free to download my notebook covering the contents of this post (well and a bit of what I was planning to work on).

Note: I had been saving the notebook with all the cells executed. All the images were saved in the notebook file as data:image/png;base64. The result was a notebook of ~975 KB. When I cleared all the output cells and saved the file it was now 14 KB. If you download it to play, you will need to execute the cells to see any results. But, the download will be relatively small. Sorry, I didn’t notice this before. The notebooks for other posts are quite large. I may try to correct that down the road.

Resources