It would appear that I may have done the violin plot a disservice in the last post. One, because of the way I plotted things, they were a little too small for one to really see what they offered in terms of data. Two, they in fact probably tell us more than the boxplot tells us. So, let’s have a bit more of a look at the violin plot before continuing on to other visualizations.

Violin Plot Revisited

Basic Violin Plot

Let’s start with a look at disease progression by sex using a violin plot. A bit more of a close up look for that matter. (Note, skipping all the set up and loading of the dataset. If need help see previous post or download my notebook.)

In [12]:

boxp = pd.DataFrame(data=db_data, columns=['AGE', 'BMI', 'S3'])
sns.violinplot(x="SEX", y="Y", data=db_data, hue='SEX', orient='v', palette="Set3");

violin plot of BMI by sex against disease progression

That plot also contains the information provided by a box plot. The following image identifies all the box plot elements.

violin plot of description

But the violin plot also shows us the data distribution of the plotted features. Something the box plot simply can not do. The “thicker” the plot, the more probable the value.

A couple of the articles in the resource section included a diagram showing why the violin plot is more informative than the boxplot. See the “Conclusion” of Violin plots explained. You also might find Same Stats, Different Graphs of interest.

Display Variation

Here’s a variation on the above violin plot. Instead of the default 5-number summary, let’s just show the quartiles.

In [13]:

# let's look at some variations
ax = sns.violinplot(x="SEX", y="Y", inner='quartile', data=db_data)
ax.set_title('Distribution of disease progression', fontsize=16);

violin plot of BMI by sex against disease progression showing quartiles instead of 5 number summary

We can clearly see the medians are almost equal. As is the 1st quartile. The upper quartile is a little higher for sex 2 than for sex 1, implying slightly more dispersed results for sex 2. And, the overall distribution is very similar for both.

Categorizing Continuous Attributes

Unfortunately this kind of assessment only works when catergorical features are present in the dataset. Would be nice if we had more than one. So, let’s add another one.

TSH

Let’s look at taking that S4 attribute we briefly looked at in the last post and turn it into a categorical feature. We’ll break the coninuous numerical values into 5 categories (see the code below). Then we’ll use a violin plot to compare the disease progression based on the new categories and sex.

In [16]:

# let's add categorical column based on TSH level
def lbl_tsh(row):
  if row['S4'] < 0.5:
    return "low"
  elif row['S4'] <= 2.0:
    return "ref low"
  elif row['S4'] <= 4.0:
    return "ref high"
  elif row['S4'] <= 6.0:
    return "high"
  else:
    return "very high"
db_data['tsh'] = db_data.apply(lbl_tsh, axis=1)

In [26]:

ax = sns.violinplot(x="tsh", y="Y", hue="SEX", split=True, data=db_data, order=['ref low', 'ref high', 'high', 'very high'])
ax.set_title('Distribution of progression by TSH level', fontsize=16);
plt.legend(loc='lower right');

violin plot of categorized S4, TSH, data by sex against disease progression

Okay. The median of disease progression appears to increase with TSH level. But given the overlapping distributions hardly seems conclusive.

Distributions for both sexes similar for the higher three classifications. But significantly different for the sex 2 cases with TSH in the bottom half of the reference range. Seems to imply that, for sex 2 cases, TSH in the ref low class has little connection with disease progression. Though I suppose it could be the result of a small number of data points (I haven’t checked that just yet).

BMI

Let’s try that with BMI. This time we’ll use four classifications.

In [23]:

# let's try classifying BMI
def lbl_bmi(row):
if row['BMI'] < 18.5:
return "underweight"
elif row['BMI'] < 25.0:
return "normal"
elif row['BMI'] < 30.0:
return "overweight"
else:
return "obese"
db_data['bmi_class'] = db_data.apply(lbl_bmi, axis=1)

In [28]:

ax = sns.violinplot(x="bmi_class", y="Y", hue="SEX", split=True, data=db_data, order=['underweight', 'normal', 'overweight', 'obese'])
ax.set_title('Distribution of progression by BMI level', fontsize=16);
plt.legend(loc='upper left');

violin plot of categorized BMI data by sex against disease progression

Wow, no cases of underweight sex 1 individuals. Again, median disease progression appeart to increase in the overweight and obese classes. Very similar median for both of the two lower classifications (underweight and normal).

Also looks to be slightly different distributions for the 2 sexes in the two higher classes. However in the three higher classes, would appear that the peaks (whether uni- or multi-modal) go up with as classification the classification moves towards obesity. So likely a meaningful relationship between the two.

Short and Sweet

That was fun, but I think I will leave it there. Lots more to learn; but, perhaps, a step in the right direction.

I was planning to continue on with lmplots and the like; but, I think I will end this post now. Sometimes, short and sweet is a better option for learning/understanding than lengthy and intensive.

If you wish to play with the above, feel free to download my notebook covering the contents of this post. As mentioned in the last post, in order to keep file sizes down, I am clearing all output cells in the notebook before moving to the web server. (Old school me, nothing bigger on the internet than absouletly need be.)

Resources

seaborn.violinplot
Violin Plots 101: Visualizing Distribution and Probability Density
Violin plots explained
A Complete Guide to Violin Plots
5 reasons you should use a violin graph
Violin plots are great
How useful is the body mass index (BMI)?
Assessing Your Weight and Health Risk
Gender and socio-demographic distribution of body mass index: The nutrition transition in an adult Angolan community
Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

Too Old To Code

Data Science Basics: Data Visualization, Part IV