Last time I said I was probably done with data visualization. Saw an example using lmplot(). Seemed to have real value in the right situations. So, thought I might try to see if I could find a right situation. So, going to try visualizing the data in one or more toy datasets.

Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other.

Visualizing regression models

Keep in mind that Seaborn’s regplot() and lmplot() are used to visualize a linear relationship between two features via regression. regplot() is in fact a subset of lmplot() and their parameters are specified differently. More importantly lmplot() can be used to see how the relationship between two variables change as a function of a third variable. Potentially a source of considerable insight — one imagines.

Note: as I start this, I have no idea if it will actually be completed and posted. I do hope so, but…

As I have previously mentioned, I have been working on the Kaggle 30 Days of Machine Learning invitational competition (at the time I am writing this which is not the publish data). It was affecting my efforts to produce blog posts. I like to have several posts written and in the queue for publishing. Sadly, I am currently down to none, just the one for this coming Monday, September 6, 2021.

And, unfortunately, I did not get as much out of the competition, and related training, as I had hoped to learn. The concept was good. But, I personally needed a little more guidance/enlightenment during the competition. I don't mind trial and error, but just couldn't get motivated to do so for the competition. Possibly because I am not really competitive and was taking part hoping to learn more about machine learning.

Oh well…

Oh, yes. Just so you know, I have pretty much switched to mostly using Seaborn for visualizations. Though there will likely be times I use Matplotlib or perhaps the visualizations provided by pandas.

Diabetes Dataset

I was going to use the Scikit-Learn diabetes dataset (sklearn.datasets.load_diabetes). But it has been “normalized”. I was able to find what I think is the source data for the scikit version — diabetes data from the LARS paper. See: Boos-Stefanski Variable Selection Home.

Data Set Characteristics

Number of Instances: 442
Number of Attributes: First 10 columns are numeric predictive values
Target: Column 11, Y, is a quantitative measure of disease progression one year after baseline
Attribute Information:
- AGE age in years
- SEX
- BMI body mass index
- BP average blood pressure
- S1 tc, T-Cells (a type of white blood cells)
- S2 ldl, low-density lipoproteins
- S3 hdl, high-density lipoproteins
- S4 tch, thyroid stimulating hormone
- S5 ltg, lamotrigine
- S6 glu, blood sugar level

Let’s Set Things Up

Ok, let’s set up the notebook, load the dataset and have a quick look at it’s basic characteristics.

In [2]:

# let's set things up
from IPython.core.interactiveshell import InteractiveShell
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_diabetes
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format

In [3]:

# original data source: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
db_loc = "./data/diabetes.tab.txt"
db_data = pd.read_csv(db_loc, sep='\t', header=(0))

In [4]:

print(type(db_data))
db_data.head()
db_data.describe()

Out[4]:

	AGE	SEX	BMI	BP	S1	S2	S3	S4	S5	S6	Y
0	59	2	32.10	101.00	157	93.20	38.00	4.00	4.86	87	151
1	48	1	21.60	87.00	183	103.20	70.00	3.00	3.89	69	75
2	72	2	30.50	93.00	156	93.60	41.00	4.00	4.67	85	141
3	24	1	25.30	84.00	198	131.40	40.00	5.00	4.89	89	206
4	50	1	23.00	101.00	192	125.40	52.00	4.00	4.29	80	135

Out[4]:

	AGE	SEX	BMI	BP	S1	S2	S3	S4	S5	S6	Y
count	442.00	442.00	442.00	442.00	442.00	442.00	442.00	442.00	442.00	442.00	442.00
mean	48.52	1.47	26.38	94.65	189.14	115.44	49.79	4.07	4.64	91.26	152.13
std	13.11	0.50	4.42	13.83	34.61	30.41	12.93	1.29	0.52	11.50	77.09
min	19.00	1.00	18.00	62.00	97.00	41.60	22.00	2.00	3.26	58.00	25.00
25%	38.25	1.00	23.20	84.00	164.25	96.05	40.25	3.00	4.28	83.25	87.00
50%	50.00	1.00	25.70	93.00	186.00	113.00	48.00	4.00	4.62	91.00	140.50
75%	59.00	2.00	29.27	105.00	209.75	134.50	57.75	5.00	5.00	98.00	211.50
max	79.00	2.00	42.20	133.00	301.00	242.40	99.00	9.09	6.11	124.00	346.00

In [5]:

db_data.dtypes
db_data.info()

Out[5]:

AGE int64 SEX int64 BMI float64 BP float64 S1 int64 S2 float64 S3 float64 S4 float64 S5 float64 S6 int64 Y int64

dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
Column  Non-Null Count  Dtype

0   AGE     442 non-null    int64

1   SEX     442 non-null    int64

2   BMI     442 non-null    float64
3   BP      442 non-null    float64
4   S1      442 non-null    int64

5   S2      442 non-null    float64
6   S3      442 non-null    float64
7   S4      442 non-null    float64
8   S5      442 non-null    float64
9   S6      442 non-null    int64

10  Y       442 non-null    int64
dtypes: float64(6), int64(5)
memory usage: 38.1 KB

Some Basic Visualizations

We’ll start with a look at the distribution of the disease progression field (‘Y’ in our dataset).

In [6]:

# let's have a look at the target data
g = sns.displot(data=db_data, x='Y', kde=True, rug=True)
#g.set_title('Disease Progression Distribution');

plot of the distribution of the disease progression field

Now, we’ll split that up by the sex field.

In [7]:

# and lets look at that by sex, not that I know which is which
sns.displot(data=db_data, x="Y", kde=True, col="SEX");

plot of the distribution of the disease progression field by sex

Similar shapes for both sexes. But there seems to be less variablility in the numbers at each level of progression for sex = 2. The distribution for sex 1 looks to be much more skewed right.

I’d like to look at the influence of BMI on the progression value. But, the above approach won’t work given the continuous, rather than categorical, nature of the BMI feature. So, let’s look at distribution of BMI by sex first.

In [8]:

sns.displot(data=db_data, x="BMI", kde=True, col="SEX");

plot of the distribution of the BMI field by sex

Kind of funny. The BMI distribution for sex 1 looks similar to the progression distribution for sex 2. And vice versa. Though the distribution of BMI for sex 2 looks to be much less skewed that the distribution for outcome for sex 1.

Just so you know, I am guessing ‘sex 2’ is ‘male’.

Now let’s look at the distribtuion of blood pressures by sex.

In [9]:

# how about sex and blood pressure
sns.displot(data=db_data, x="BP", kde=True, col="SEX");

plot of the distribution of the blood pressure field by sex

And, back we go. The distribution of blood pressure by sex is much more similar to the distribution of disease progression than to BMI. I was actually expecting BP and BMI to be related and therefore look similar. But would seem to indicate that perhaps blood pressure is more influential in disease progression than is BMI.

But, that’s purely conjecture on my part. But one would think that similar looking distributions would indicate some meaningful relationship. But I have to admit that none of my reading so far has implied any such thing.

Now, we’ll plot the basic distributions for all the dataset attributes. I.E. ignoring the target which we looked at above.

In [10]:

fig=plt.figure(figsize=(20,20))
for i,col in enumerate(db_data.drop(['Y'],axis=1)):
    ax=fig.add_subplot(5,2,i+1);
    sns.histplot(db_data[col]);

distribution plots for all dataset attributes

Once again, not really sure what those plots tell us.

But, there appears to be some mild skew in a few of the features. With S4 (tch/tsh) showing the greatest skew and a very weird looking plot. Likely due to how the data was determined/collected/recorded. Could be why the scikit-learn data was mean centered and scaled by the standard deviation times the number of samples (i.e. the sum of squares of each column totals 1).

Only appears to be one categorical feature (Sex).

And, BMI, BP, S2 (ldl), S3 (hdl), S4 (tch/tsh) and S5 (ltg) have distributions that are the most similar to that of disease progression (the target value).

After a wee side step, we’ll take a look at the box plots for the continuous attributes.

Wee Aside — S4

Apparently, The American Board of Internal Medicine uses an adult TSH reference range of 0.5–4.0 mIU/L. How many values are outside this range?

In [23]:

# count by sex
c_sx = db_data.SEX.count()
c_s1 = db_data[db_data['SEX'] == 1.0].SEX.count()
print(f"sex 1: {c_s1}, sex 2: {c_sx - c_s1}\n")
s4 = db_data.S4.unique()
# sorted(s4)
print(f's4 unique: {len(s4)}')
# The American Board of Internal Medicine uses an adult TSH reference range of 0.5–4.0 mIU/L.
# how many values are outside this range
s4_low = db_data[db_data['S4'] < 0.5]
s4_high = db_data[db_data['S4'] > 4.0]
c_s4_h_all = len(s4_high)
print(f's4_low: {len(s4_low)}, s4 ok: {c_sx - len(s4_low) - len(s4_high)}, s4_high: {c_s4_h_all}\n')
# high by sex
#s4_h_1 = db_data[db_data['S4'] > 4.0 and db_data['SEX'] == 1.0]
s4_h_1 = db_data.query("S4 > 4.0 and SEX == 1.0")
c_s4_h_1 = len(s4_h_1)
print(f"s4 high sex 1: {c_s4_h_1}; s4 high sex 2: {c_s4_h_all - c_s4_h_1}")

sex 1: 235, sex 2: 207

s4 unique: 66 s4_low: 0, s4 ok: 286, s4_high: 156

s4 high sex 1: 56; s4 high sex 2: 100

Perhaps one could look at turning this into a categorical value? low, ok, high or -1, 0, 1

Box Plot

Now onto box plots. A quote from the Seaborn site to start.

I have split the attributes into groups so that the ranges on the box plots are similar for each group. When I did them all on one plot, some of the boxplots were unreadable.

I am adding the means (small green triangle) to the plots.

In [24]:

boxp = pd.DataFrame(data=db_data, columns=['AGE', 'BMI', 'S3'])
sns.boxplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True);

Okay for the next two groupings let’s play with the colour scheme and change the orientation.

In [25]:

boxp2 = pd.DataFrame(data=db_data, columns=['BP', 'S1', 'S2', 'S6'])
sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3");

In [26]:

boxp3 = pd.DataFrame(data=db_data, columns=['S4', 'S5'])
sns.boxplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2");

I kind of like the horizontal display. But, either view does the job.

Clearly some outliers in the data for a few of the features. But on the other hand, there doesn’t appear to be a significant amount of skewness in most of the feature distributions.

Now, let’s look at a similar visual: the violin plot.

Violin Plot

For comparison purposes, I will show the boxplots next to the violin plots for each group. For the last row, I also used a scatterplot to show the observations for the attributes included in that grouping.

In [27]:

# display boxplot and violin plot side by side for comparison
#fig, axs = plt.subplots(3, 2, figsize=(16,16), sharey='row')
fig, axs = plt.subplots(3, 2, figsize=(16,16))
#axs[0,1].ylim(0, 120)
sns.violinplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True, ax=axs[0,0]);
# yticks = axs[0, 0].get_yticks()
# print(yticks)
# axs[0, 1].set_yticks(yticks)
sns.boxplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True, ax=axs[0,1]);
axs[0, 1].sharey(axs[0, 0])
# spent a bunch of time trying to get y-axis on top row to have same ylim and to show tick values
#   turned out to be simpler than I thought, also made x-axis on lower rows match up
#   don't know if this could be done someway easier in Seaborn, don't think so
#plt.setp(axs[0,1], xticklabels=[])
# axs[0, 0].tick_params(axis='both', which='both', labelsize=7, labelbottom=True)
# axs[0, 1].tick_params(axis='both', which='both', left=True, labelsize=7)
# for tick in axs[0, 1].get_yticklabels():
#     tick.set_visible(True)
sns.violinplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,0]);
l1, r1 = axs[1, 0].get_xlim()
axs[1, 1].set_xlim(left=l1, right=r1)
sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,1]);
sns.violinplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2", ax=axs[2,0]);
l2, r2 = axs[2, 0].get_xlim()
axs[2, 1].set_xlim(left=l2, right=r2)
sns.boxplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2", ax=axs[2,1]);
# just for fun let's add a plot of the actual data to the last row
sns.stripplot(x="value", y="variable", data=pd.melt(boxp3), marker="o", alpha=0.3, color="black", ax=axs[2,0]);
sns.stripplot(x="value", y="variable", data=pd.melt(boxp3), marker="o", alpha=0.3, color="black", ax=axs[2,1]);

violin and box plots for the continuous attributes

I really at this point have no idea how to use/interpret the violin plots. And, perhaps, only slightly more about the value of the boxplots.

However, while doing some research, I came across the following quote.

Best Use Case for Boxplots

So, let’s have look at some possible comparisons. For example BMI and/or BP by sex; as we did with histograms above. I am using the notched plot type.

In [28]:

fig, axs = plt.subplots(2, 2, figsize=(10, 10))
cnt_x1 = db_data[db_data['SEX'] == 1].SEX.count()
cnt_x2 = db_data[db_data['SEX'] == 2].SEX.count()
print(f"sex 1: {cnt_x1}; sex 2: {cnt_x2}")
#sns.boxplot(x='BMI', y='Y', data=boxp3, notch=True, hue="SEX")
#sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,1]);
sns.boxplot(x="SEX", y="BMI", data=db_data, palette="Set3", notch=True, ax=axs[0, 0]);
sns.boxplot(x="SEX", y="BP", data=db_data, palette="Set3", notch=True, ax=axs[0, 1]);
sns.boxplot(x="SEX", y="AGE", data=db_data, palette="Set3", notch=True, ax=axs[1,0]);
sns.boxplot(x="SEX", y="S4", data=db_data, palette="Set3", notch=True, ax=axs[1,1]);
axs[1, 1].set_ylabel("TSH (tch)");

sex 1: 235; sex 2: 207

box plots for age, BMI, BP and S4 by sex

Ok! That is possibly more interesting.

The median BMI for both sexes is similar. And, as the notches overlap, we definitely can not say that the true medians are different. But sex 1 appears to have more variablility in BMI than sex 2. With the sex 2 distribution perhaps a bit more skewed.

With respect to blood pressure, sex 2 has a higher median pressure. And, as the notches do not overlap, we can conclude, with 95% confidence, that the true medians do differ. The variability between the sexes seems similar. Perhaps a bit more skew for sex 1.

Sex 2 has a higher median age. And the true median age for the sexes would appear to be different. Variability once again similar.

TSH certainly looks stranger. Would appear to very definitely be some skewness in the distribution for both sexes. Again variability similar. But the mean TSH for sex 2 definitely higher than for sex 1. Looks like almost 75% of sex 2 have a TSH higher tham 75% of sex 1.

Not sure this would help with any attempt to model the data, but—

Done for Now!

This post has taken me quite some time. Lot’s of hunting the web and reading as much as I could easily understand. Slightly better grasp of the value of boxplots. Lots more work to be done on understanding violin plots. And, never did get around to looking at any form of regression plots. Will tackle the latter in the next post.

And, if I manage to get a better understanding of violin plots before finishing the next post, will add whatever I had managed to learn.

If you wish to play with the above, feel free to download my notebook covering the contents of this post (well and a bit of what I was planning to work on).

Note: I had been saving the notebook with all the cells executed. All the images were saved in the notebook file as data:image/png;base64. The result was a notebook of ~975 KB. When I cleared all the output cells and saved the file it was now 14 KB. If you download it to play, you will need to execute the cells to see any results. But, the download will be relatively small. Sorry, I didn’t notice this before. The notebooks for other posts are quite large. I may try to correct that down the road.

Resources

seaborn.boxplot
seaborn.FacetGrid
matplotlib.axes.Axes.get_xlim
matplotlib.axes.Axes.set_xlim
matplotlib.axes.Axes.boxplot
Seaborn Tutorial in Python For Beginners
14 Data Visualization Plots of Seaborn
Data Visualization Tutorial Using Seaborn
How to use Seaborn Data Visualization for Machine Learning
DataCamp: Introduction to Data Visualization in Python, very complete? and lengthy
Seaborn Box Plot - Tutorial and Examples, nothing on why to use boxplot and what you learn but lots on how to display various ways
What a Boxplot Can Tell You about a Statistical Data Set
Understanding Boxplots
A Complete Guide to Box Plots
Interpreting data: boxplots and tables
Visualizing the box and whisker plot
What does a box plot tell you? (McLeod, S. A. (2019, July 19))

Too Old To Code

Data Science Basics: Data Visualization, Part III