Slow going, but hopefully we will get to an end of this subject in this post or the next.

I am not sure I made the best choice for a dataset to use for a discussion of visualization options and techniques. But, have decided to continue using it. Push comes to shove, I will try another.

Note of Caution

This attempt at visualizing a regression model did not go the way I expected. Did not, for me, provide a real step forward in understanding data visualiztion. But, it did allow me to look at a couple of Seaborn plot functions. And, I did learn a bit more about how they work. And, what I might need to do to get what I think I want to see.

So, for the sake of recording my steps, I am going to proceed with the post. But, be warned, you may find it of little real value.

Setup

I am not going to list all the set up cells in this post. But I do intend to add the created categorical features we used in the last post after I open the dataset. Refer to previous posts or the associated notebook (link at the bottom of the post). Though not sure at this point that those categorical attributes will be of any value in the visualizations I will be looking at.

lmplot() vs regplot()

I had intended to focus on lmplot(), as I thought the ability to easily display the interactions of continuous attributes with categorical attributes was pretty useful. And, it is. But doesn’t always work as one might want. So, again contrary to what I thought, things did take a bit of a turn at one point.

And, one should note that lmplot() is just a wrapper around regplot() and facetgrid. In fact lmplot() returns a facetgrid object.

The Basics

First a quick look at a few rows of the modified dataset.

In [5]:
# for reference
db_data.head()
Out[5]:
AGESEXBMIBPS1S2S3S4S5S6Ytshbmi_class
059232.10101.0015793.2038.004.004.8687151ref highobese
148121.6087.00183103.2070.003.003.896975ref highnormal
272230.5093.0015693.6041.004.004.6785141ref highobese
324125.3084.00198131.4040.005.004.8989206highoverweight
450123.00101.00192125.4052.004.004.2980135ref highnormal

BMI Attribute vs Disease Progression

Okay let’s start with a simple regplot() and lmplot() of the BMI attribute against disease progression.

Note the use of trailing ; to try and prevent the display of the output from the plot functions. But didn’t always seem to work.

In [6]:
# let's start with something simple
# I am going to show both regplot and lmplot
sns.regplot(x='BMI', y="Y", data=db_data);
sns.lmplot(x='BMI', y="Y", data=db_data);
regplot() of BMI vs disease progression (Y) lmplot() of BMI vs disease progression (Y)

Other than the shape of the plot area, you will note that they are identical. And the shape/size? That is due to the fact that lmplot() creates a Facetgrid in which to display the plot. In this case the Facetgrid is using default values for most parameters — including grid sizes. The plot, in fact, being a regplot().

By default you get a scatter plot on top of which is placed a shadowed line. The line represents the best available fit for the trend of disease progression with respect to the initial BMI of the patients in the dataset. The shadow provides a 95% confidence interval for that fit. You many notice that the shadow converging at some point on the regression line — that point is the statistical mean.

A fit has been made, e.g. calculated. Given the distribution of the data points it may seem a might iffy. But, it really does bear some resemblance to the data. Looking at it you can see a general increase in disease progression for higher BMI values. But, I don’t get the feeling that by itself it would be a particularly accurate predictor of disease progression.

And, there seem to be at least a few outlier data points. You may recall that the box and violin plots for BMI did show some outliers above the upper adjacent value.

S4 Attribute vs Disease Progression

Now what about our crazyily distributed S4 attribute.

I am using jitter to improve? the display.

In [7]:
# note the use of jitter, try it without to see what happens
sns.lmplot(x='S4', y="Y", data=db_data, x_jitter=.075);
lmplot() of S4 attribute vs disease progression (Y)

As previously stated, with BMI you can certainly see that there might be a fit. But with S4 not so much. Even though the overall data does have a bit of a similar trend. At the bottom of the scatter plot, you can almost see disease progression increase with the value of S4 at a similar angle to that of the line showing the fit. Though not so much for the higher values of disease progression (Y).

A Different Look: x_estimator

From the lmplot() documentation:

x_estimator : callable that maps vector -> scalar, optional

Apply this function to each unique value of x and plot the resulting estimate. This is useful when x is a discrete variable. If x_ci is given, this estimate will be bootstrapped and a confidence interval will be drawn.

Well seems to me that S4 is almost a discrete atrribute. So, let’s have a look.

And, let’s take advantage of lmplot()’s ability to show us how the result varies by the sex of the patients.

We’ll start by just splitting the default lmplot() for comparison with the one using an esitmator.

In [8]:
# let's try splitting by sex and using an exstimator
# for comparison let's first display lmplot split by sex
sns.lmplot(x='S4', y="Y", data=db_data, col='SEX', hue='SEX', x_jitter=.075);
lmplot() of S4 attribute vs disease progression (Y)
In [9]:
# let's try splitting by sex and using an exstimator
sns.lmplot(x='S4', y="Y", data=db_data, x_estimator=np.mean, col='SEX', hue='SEX');
lmplot() of S4 attribute vs disease progression (Y)

The means for each bin of data is represented by the dot and the lines, above and below, show the 95% confidence interval for that estimate.

Note the large confidence interval for sex 2 at a S4 value of 2. Then look above at how few points are available for this mean (bin). Pretty much explains that confidence interval range.

In this case a goodly number of means are along the fitted regression line. Especially for sex 2. Not quite sure how to interpret that information, but it likely means something. Especially at higher S4 values.

I am guessing that the data points displayed on the plot simply didn’t fit into a suitable bin?

Do note that the regression estimate is fit to the original dataset, not from the binned means. The binning only affects the visual presentation of the observation in the dataset.

I did look at trying to reduce the extra points, but didn’t like any of the results I initially obtained. Don’t know enough to understand the consequences of the paramters I tried (x_bins and x_ci). That said, here’s a look at my attempt to use bins= and x_bins to control the plot. Which I do think worked better.

In [10]:
# just so we can see something different, let's try 40 bins for each
sns.lmplot(x='S4', y="Y", data=db_data, x_estimator=np.mean, col='SEX', hue='SEX', x_bins=40);
lmplot() of S4 attribute vs disease progression (Y)

Looks to be a greater variety in the data for sex 1. I.E. more means (bins) displayed.

And, if I specify the bin centres?

In [11]:
# now let's try 80 bins
b_cntr = [2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]
sns.lmplot(x='S4', y="Y", data=db_data, x_estimator=np.mean, col='SEX', hue='SEX', x_bins=b_cntr);
lmplot() of S4 attribute vs disease progression (Y)

Personally, I think the last plot is the most useful. Good chance of overfitting the means if use too many bins. And even with the apparent outliers, the means to seem to track the regression estimate reasonably well. Might be useful to try this with some of other attributes.

Plots for All the Attributes Continuous Attributes

I thought I’d take a look at the lmplot() by sex for all the continuous attributes.

Attempt #1

Figured I could use the facetgrid in a similar fashion to subplots in Matplotlib. No such luck.

I tried a variety of approaches to get the ‘display’ I was hoping for to no avail. I knew I wanted to keep the overall charts somewhat viewable, so broke the attributes into three batches.

In [12]:
# let's draw lmplot for each continuous variable with respect to sex
# I'll do this in batches for easier viewing
batch1 = ['AGE', 'BMI', 'BP']
batch2 = ['S1', 'S2', 'S3']
batch3 = ['S4', 'S5', 'S6']

It seems that when the plot function is called within a loop, the trailing semi-colon doesn’t seem to prevent the return value being printed.

In [13]:
# couldn't get it to work with a facetgrid or with matplotlib subplots
for var in batch1:
  sns.lmplot(data=db_data, x=var, y='Y', col='SEX', hue='SEX', palette="Set2", height=4, aspect=1.3);
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x227bc51ff10>
<seaborn.axisgrid.FacetGrid at 0x227bc03d610>
<seaborn.axisgrid.FacetGrid at 0x227bc04cdc0>
lmplot()s of AGE attribute vs disease progression (Y) split on SEX
lmplot() BMI attribute vs disease progression (Y) split on SEX
lmplot() of BP attribute vs disease progression (Y) split on SEX

Not what I wanted or was expecting. So, what happened?

The regplot() and lmplot() functions are closely related, but the former is an axes-level function while the latter is a figure-level function that combines regplot() and FacetGrid. seaborn.lmplot

Don’t think there is any way to get a single facetgrid figure for the plots I wanted.

The plots for the other 2 batches can be viewed here: S1, S2, S3, S4, S5, S6

Attempt #2

So, back to Matplotlib subplots and regplot().

In [16]:
# took a bit of research, but sorted something
fig, axs = fig, axes = plt.subplots(3, 2, figsize=(24,20), sharey=True)
parameters = {'axes.labelsize': 18,
          'axes.titlesize': 22}
plt.rcParams.update(parameters)
y_s1 = db_data.loc[db_data['SEX'] == 1, "Y"]
y_s2 = db_data.loc[db_data['SEX'] == 2, "Y"]
# y_s1.head()
# y_s2.head()
sns.regplot(x=db_data.loc[db_data['SEX'] == 1, "AGE"], y=y_s1, color='g', ax=axs[0,0]);
# axs[0,0].set_title('SEX = 1', fontsize=18);
axs[0,0].set_title('SEX = 1');
# axs[0,0].set_ylabel("Y ", rotation="horizontal", fontsize="large");
axs[0,0].set_ylabel("Y ", rotation="horizontal");
sns.regplot(x=db_data.loc[db_data['SEX'] == 2, "AGE"], y=y_s2, color='orange', ax=axs[0,1]);
axs[0,1].set_title('SEX = 2');
sns.regplot(x=db_data.loc[db_data['SEX'] == 1, "BMI"], y=y_s1, color='g', ax=axs[1,0]);
sns.regplot(x=db_data.loc[db_data['SEX'] == 2, "BMI"], y=y_s2, color='orange', ax=axs[1,1]);
sns.regplot(x=db_data.loc[db_data['SEX'] == 1, "BP"], y=y_s1, color='g', ax=axs[2,0]);
sns.regplot(x=db_data.loc[db_data['SEX'] == 2, "BP"], y=y_s2, color='orange', ax=axs[2,1]);
axs[1,0].set_ylabel("Y ", rotation="horizontal");
axs[2,0].set_ylabel("Y ", rotation="horizontal");
regplot() of AGE, BMI, BP attributes vs disease progression (Y) split on SEX in subplots

That’s more like what I was expecting. A single figure with multiple plots in a suitable grid. In this case: one row for each attribute and a column for each value of the SEX categorical attribute.

Just so you know, I have reduced the size of the image for inclusion in this post. The version in the notebook comes out considerably larger and easier to read. (Hint?) Though it just kills me how big the images saved in the notebook are when formatted as data:image/png;base64. Wonder if using SVG would make them any smaller? Then, SVG may not be suitable for the underlying JSON used to store the notebooks.

Anyway, let’s do the other 2 batches.

In [17]:
fig1, axs1 = plt.subplots(len(batch2), 2, figsize=(24,20), sharey=True)
for i, var in enumerate(batch2):
  sns.regplot(x=db_data.loc[db_data['SEX'] == 1, var], y=y_s1, color='g', ax=axs1[i,0]);
  sns.regplot(x=db_data.loc[db_data['SEX'] == 2, var], y=y_s2, color='orange', ax=axs1[i,1]);
axs1[0,0].set_title('SEX = 1');
axs1[0,1].set_title('SEX = 2');
axs1[0,0].set_ylabel("Y ", rotation="horizontal");
axs1[1,0].set_ylabel("Y ", rotation="horizontal");
axs1[2,0].set_ylabel("Y ", rotation="horizontal");
regplot() of S1, S2, S3 attributes vs disease progression (Y) split on SEX in subplots

Note the code got a little tidier. And, in the next cell even tidier (at least I think so).

In [18]:
fig2, axs2 = plt.subplots(len(batch2), 2, figsize=(24,20), sharey=True)
for i, var in enumerate(batch3):
  sns.regplot(x=db_data.loc[db_data['SEX'] == 1, var], y=y_s1, color='g', ax=axs2[i,0]);
  sns.regplot(x=db_data.loc[db_data['SEX'] == 2, var], y=y_s2, color='orange', ax=axs2[i,1]);
  axs2[i,0].set_ylabel("Y ", rotation="horizontal");
axs2[0,0].set_title('SEX = 1');
axs2[0,1].set_title('SEX = 2');
Out[18]:
regplot() of S4, S5, S6 attributes vs disease progression (Y) split on SEX in subplots

Well those plots all look a lot alike to me. Not sure I am getting a lot of info from them. But am also, at the moment, not really looking all that hard.

Though it does seem to me that in most cases the slopes of the fit line for both sexes are similar. Noteable exception being in the case of BMI.

lmplot() at Its Best

But let’s try something else. Mostly as an example of what can be done with lmplot().

For the first one, let’s try plotting BP against disease progression (Y) splitting the plots into columns based on bmi_class and rows based on SEX. Compare this code (1 line, if we ignore setting the font-scale) to that used to generate the above plots using Matplotlib and subplots.

In [19]:
sns.set(font_scale=1.4)
g = sns.lmplot(data=db_data, x="BP", y="Y", col="bmi_class", col_order=['underweight', 'normal', 'overweight', 'obese'], row="SEX", hue="SEX", facet_kws={'margin_titles':True})
lmplot() of BP attribute vs disease progression (Y) split on bmi_class and sex

Looks like the fit line for sex 2 has a sharper slope for higher BMI classifiers. Which seems to fit with the BMI chart above. Always nice to get some correlation in different views of an attribute.

Now let’s look at another example specifically using a FacetGrid() with regplot()s. Essentially what lmplot() does for us. Still only 2 lines of code if we ignore setting the font-scale.

In [20]:
# let's try that using a facetgrid and regplot()
sns.set(font_scale=1)
g = sns.FacetGrid(db_data, col="tsh", col_order=['ref low', 'ref high', 'high', 'very high'], row="SEX", hue="SEX", margin_titles=True)
g.map_dataframe(sns.regplot, x="BMI", y="Y");
lmplot() of BP attribute vs disease progression (Y) split on bmi_class and sex

And, again slopes for sex 2 at higher TSH classifications sharper than for sex 1.

Find some of those charts interesting, but I really don’t know what they are trying to tell me.

So, I think I am going to leave it here for now.

Done for Now

Started Machine Learning with Python-From Linear Models to Deep Learning (MITx) last week. The math is driving me nuts and taking a lot of time. Consequently I have to admit I was not really well focused on the subject at hand in this notebook. Sorry. Am also concerned I’ll be missing a post or two as a result.

That said, I do feel better about the post than when I started it a day or two ago. There may actually have been some real learning afterall.

There are any number of plots I did not look at, and possibly should have. For example catplot(). Check out the seaborn API reference for a list of a goodly number of them. As well as many other potentially useful methods/objects.

If you wish to play with the above, feel free to download my notebook covering the contents of this post. As mentioned in the previous posts, in order to keep file sizes down, I am clearing all output cells in the notebook before moving to the web server or github. (Old school me, nothing bigger on the internet than absouletly need be — I started out with very slow telephone line modems.)

Resources