Last time I said I was probably done with data visualization. Saw an example using lmplot()
. Seemed to have real value in the right situations. So, thought I might try to see if I could find a right situation. So, going to try visualizing the data in one or more toy datasets.
Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other.
Visualizing regression models
Keep in mind that Seaborn’s regplot() and lmplot() are used to visualize a linear relationship between two features via regression. regplot()
is in fact a subset of lmplot()
and their parameters are specified differently. More importantly lmplot()
can be used to see how the relationship between two variables change as a function of a third variable. Potentially a source of considerable insight — one imagines.
Oh, yes. Just so you know, I have pretty much switched to mostly using Seaborn for visualizations. Though there will likely be times I use Matplotlib or perhaps the visualizations provided by pandas.
Diabetes Dataset
I was going to use the Scikit-Learn diabetes dataset (sklearn.datasets.load_diabetes). But it has been “normalized”. I was able to find what I think is the source data for the scikit version — diabetes data from the LARS paper. See: Boos-Stefanski Variable Selection Home.
Data Set Characteristics
Number of Instances: 442
Number of Attributes: First 10 columns are numeric predictive values
Target: Column 11, Y, is a quantitative measure of disease progression one year after baseline
Attribute Information:
- AGE age in years
- SEX
- BMI body mass index
- BP average blood pressure
- S1 tc, T-Cells (a type of white blood cells)
- S2 ldl, low-density lipoproteins
- S3 hdl, high-density lipoproteins
- S4 tch, thyroid stimulating hormone
- S5 ltg, lamotrigine
- S6 glu, blood sugar level
Let’s Set Things Up
Ok, let’s set up the notebook, load the dataset and have a quick look at it’s basic characteristics.
# let's set things up
from IPython.core.interactiveshell import InteractiveShell
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_diabetes
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format
# original data source: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
db_loc = "./data/diabetes.tab.txt"
db_data = pd.read_csv(db_loc, sep='\t', header=(0))
print(type(db_data))
db_data.head()
db_data.describe()
db_data.dtypes
db_data.info()
Some Basic Visualizations
We’ll start with a look at the distribution of the disease progression field (‘Y’ in our dataset).
# let's have a look at the target data
g = sns.displot(data=db_data, x='Y', kde=True, rug=True)
#g.set_title('Disease Progression Distribution');
Now, we’ll split that up by the sex field.
# and lets look at that by sex, not that I know which is which
sns.displot(data=db_data, x="Y", kde=True, col="SEX");
Similar shapes for both sexes. But there seems to be less variablility in the numbers at each level of progression for sex = 2. The distribution for sex 1 looks to be much more skewed right.
I’d like to look at the influence of BMI on the progression value. But, the above approach won’t work given the continuous, rather than categorical, nature of the BMI feature. So, let’s look at distribution of BMI by sex first.
sns.displot(data=db_data, x="BMI", kde=True, col="SEX");
Kind of funny. The BMI distribution for sex 1 looks similar to the progression distribution for sex 2. And vice versa. Though the distribution of BMI for sex 2 looks to be much less skewed that the distribution for outcome for sex 1.
Just so you know, I am guessing ‘sex 2’ is ‘male’.
Now let’s look at the distribtuion of blood pressures by sex.
# how about sex and blood pressure
sns.displot(data=db_data, x="BP", kde=True, col="SEX");
And, back we go. The distribution of blood pressure by sex is much more similar to the distribution of disease progression than to BMI. I was actually expecting BP and BMI to be related and therefore look similar. But would seem to indicate that perhaps blood pressure is more influential in disease progression than is BMI.
But, that’s purely conjecture on my part. But one would think that similar looking distributions would indicate some meaningful relationship. But I have to admit that none of my reading so far has implied any such thing.
Now, we’ll plot the basic distributions for all the dataset attributes. I.E. ignoring the target which we looked at above.
fig=plt.figure(figsize=(20,20))
for i,col in enumerate(db_data.drop(['Y'],axis=1)):
ax=fig.add_subplot(5,2,i+1);
sns.histplot(db_data[col]);
Once again, not really sure what those plots tell us.
But, there appears to be some mild skew in a few of the features. With S4 (tch/tsh) showing the greatest skew and a very weird looking plot. Likely due to how the data was determined/collected/recorded. Could be why the scikit-learn data was mean centered and scaled by the standard deviation times the number of samples (i.e. the sum of squares of each column totals 1).
Only appears to be one categorical feature (Sex).
And, BMI, BP, S2 (ldl), S3 (hdl), S4 (tch/tsh) and S5 (ltg) have distributions that are the most similar to that of disease progression (the target value).
After a wee side step, we’ll take a look at the box plots for the continuous attributes.
Wee Aside — S4
Apparently, The American Board of Internal Medicine uses an adult TSH reference range of 0.5–4.0 mIU/L. How many values are outside this range?
# count by sex
c_sx = db_data.SEX.count()
c_s1 = db_data[db_data['SEX'] == 1.0].SEX.count()
print(f"sex 1: {c_s1}, sex 2: {c_sx - c_s1}\n")
s4 = db_data.S4.unique()
# sorted(s4)
print(f's4 unique: {len(s4)}')
# The American Board of Internal Medicine uses an adult TSH reference range of 0.5–4.0 mIU/L.
# how many values are outside this range
s4_low = db_data[db_data['S4'] < 0.5]
s4_high = db_data[db_data['S4'] > 4.0]
c_s4_h_all = len(s4_high)
print(f's4_low: {len(s4_low)}, s4 ok: {c_sx - len(s4_low) - len(s4_high)}, s4_high: {c_s4_h_all}\n')
# high by sex
#s4_h_1 = db_data[db_data['S4'] > 4.0 and db_data['SEX'] == 1.0]
s4_h_1 = db_data.query("S4 > 4.0 and SEX == 1.0")
c_s4_h_1 = len(s4_h_1)
print(f"s4 high sex 1: {c_s4_h_1}; s4 high sex 2: {c_s4_h_all - c_s4_h_1}")
Perhaps one could look at turning this into a categorical value? low, ok, high or -1, 0, 1
Box Plot
Now onto box plots. A quote from the Seaborn site to start.
I have split the attributes into groups so that the ranges on the box plots are similar for each group. When I did them all on one plot, some of the boxplots were unreadable.
I am adding the means (small green triangle) to the plots.
boxp = pd.DataFrame(data=db_data, columns=['AGE', 'BMI', 'S3'])
sns.boxplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True);
Okay for the next two groupings let’s play with the colour scheme and change the orientation.
boxp2 = pd.DataFrame(data=db_data, columns=['BP', 'S1', 'S2', 'S6'])
sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3");
boxp3 = pd.DataFrame(data=db_data, columns=['S4', 'S5'])
sns.boxplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2");
I kind of like the horizontal display. But, either view does the job.
Clearly some outliers in the data for a few of the features. But on the other hand, there doesn’t appear to be a significant amount of skewness in most of the feature distributions.
Now, let’s look at a similar visual: the violin plot.
Violin Plot
For comparison purposes, I will show the boxplots next to the violin plots for each group. For the last row, I also used a scatterplot to show the observations for the attributes included in that grouping.
# display boxplot and violin plot side by side for comparison
#fig, axs = plt.subplots(3, 2, figsize=(16,16), sharey='row')
fig, axs = plt.subplots(3, 2, figsize=(16,16))
#axs[0,1].ylim(0, 120)
sns.violinplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True, ax=axs[0,0]);
# yticks = axs[0, 0].get_yticks()
# print(yticks)
# axs[0, 1].set_yticks(yticks)
sns.boxplot(x="variable", y="value", data=pd.melt(boxp), showmeans=True, ax=axs[0,1]);
axs[0, 1].sharey(axs[0, 0])
# spent a bunch of time trying to get y-axis on top row to have same ylim and to show tick values
# turned out to be simpler than I thought, also made x-axis on lower rows match up
# don't know if this could be done someway easier in Seaborn, don't think so
#plt.setp(axs[0,1], xticklabels=[])
# axs[0, 0].tick_params(axis='both', which='both', labelsize=7, labelbottom=True)
# axs[0, 1].tick_params(axis='both', which='both', left=True, labelsize=7)
# for tick in axs[0, 1].get_yticklabels():
# tick.set_visible(True)
sns.violinplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,0]);
l1, r1 = axs[1, 0].get_xlim()
axs[1, 1].set_xlim(left=l1, right=r1)
sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,1]);
sns.violinplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2", ax=axs[2,0]);
l2, r2 = axs[2, 0].get_xlim()
axs[2, 1].set_xlim(left=l2, right=r2)
sns.boxplot(x="value", y="variable", data=pd.melt(boxp3), showmeans=True, palette="Set2", ax=axs[2,1]);
# just for fun let's add a plot of the actual data to the last row
sns.stripplot(x="value", y="variable", data=pd.melt(boxp3), marker="o", alpha=0.3, color="black", ax=axs[2,0]);
sns.stripplot(x="value", y="variable", data=pd.melt(boxp3), marker="o", alpha=0.3, color="black", ax=axs[2,1]);
I really at this point have no idea how to use/interpret the violin plots. And, perhaps, only slightly more about the value of the boxplots.
However, while doing some research, I came across the following quote.
Best Use Case for Boxplots
So, let’s have look at some possible comparisons. For example BMI and/or BP by sex; as we did with histograms above. I am using the notched plot type.
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
cnt_x1 = db_data[db_data['SEX'] == 1].SEX.count()
cnt_x2 = db_data[db_data['SEX'] == 2].SEX.count()
print(f"sex 1: {cnt_x1}; sex 2: {cnt_x2}")
#sns.boxplot(x='BMI', y='Y', data=boxp3, notch=True, hue="SEX")
#sns.boxplot(x="value", y="variable", data=pd.melt(boxp2), showmeans=True, orient="h", palette="Set3", ax=axs[1,1]);
sns.boxplot(x="SEX", y="BMI", data=db_data, palette="Set3", notch=True, ax=axs[0, 0]);
sns.boxplot(x="SEX", y="BP", data=db_data, palette="Set3", notch=True, ax=axs[0, 1]);
sns.boxplot(x="SEX", y="AGE", data=db_data, palette="Set3", notch=True, ax=axs[1,0]);
sns.boxplot(x="SEX", y="S4", data=db_data, palette="Set3", notch=True, ax=axs[1,1]);
axs[1, 1].set_ylabel("TSH (tch)");
Ok! That is possibly more interesting.
The median BMI for both sexes is similar. And, as the notches overlap, we definitely can not say that the true medians are different. But sex 1 appears to have more variablility in BMI than sex 2. With the sex 2 distribution perhaps a bit more skewed.
With respect to blood pressure, sex 2 has a higher median pressure. And, as the notches do not overlap, we can conclude, with 95% confidence, that the true medians do differ. The variability between the sexes seems similar. Perhaps a bit more skew for sex 1.
Sex 2 has a higher median age. And the true median age for the sexes would appear to be different. Variability once again similar.
TSH certainly looks stranger. Would appear to very definitely be some skewness in the distribution for both sexes. Again variability similar. But the mean TSH for sex 2 definitely higher than for sex 1. Looks like almost 75% of sex 2 have a TSH higher tham 75% of sex 1.
Not sure this would help with any attempt to model the data, but—
Done for Now!
This post has taken me quite some time. Lot’s of hunting the web and reading as much as I could easily understand. Slightly better grasp of the value of boxplots. Lots more work to be done on understanding violin plots. And, never did get around to looking at any form of regression plots. Will tackle the latter in the next post.
And, if I manage to get a better understanding of violin plots before finishing the next post, will add whatever I had managed to learn.
If you wish to play with the above, feel free to download my notebook covering the contents of this post (well and a bit of what I was planning to work on).
Note: I had been saving the notebook with all the cells executed. All the images were saved in the notebook file as data:image/png;base64. The result was a notebook of ~975 KB. When I cleared all the output cells and saved the file it was now 14 KB. If you download it to play, you will need to execute the cells to see any results. But, the download will be relatively small. Sorry, I didn’t notice this before. The notebooks for other posts are quite large. I may try to correct that down the road.
Resources
- seaborn.boxplot
- seaborn.FacetGrid
- matplotlib.axes.Axes.get_xlim
- matplotlib.axes.Axes.set_xlim
- matplotlib.axes.Axes.boxplot
- Seaborn Tutorial in Python For Beginners
- 14 Data Visualization Plots of Seaborn
- Data Visualization Tutorial Using Seaborn
- How to use Seaborn Data Visualization for Machine Learning
- DataCamp: Introduction to Data Visualization in Python, very complete? and lengthy
- Seaborn Box Plot - Tutorial and Examples, nothing on why to use boxplot and what you learn but lots on how to display various ways
- What a Boxplot Can Tell You about a Statistical Data Set
- Understanding Boxplots
- A Complete Guide to Box Plots
- Interpreting data: boxplots and tables
- Visualizing the box and whisker plot
- What does a box plot tell you? (McLeod, S. A. (2019, July 19))