Colour-intensity scales

In this tutorial we will look at how to use colours in the Sankey diagram. We have already seen how to use a palette, but in this tutorial we will also create a Sankey where the intensity of the colour is proportional to a numerical value.

First step is to import all the requried packages and data:

In [1]:
import pandas as pd
import numpy as np
from floweaver import *

df1 = pd.read_csv('holiday_data.csv')

Now take a look at the dataset we are using. This is a very insightful [made-up] dataset about how differnt types of people lose weight while on holiday enjoying themselves.

In [2]:
dataset = Dataset(df1)
df1
Out[2]:
source target Calories Burnt Enjoyment Employment Job Activity
0 Activity Employment Job 2.5 35 Student Reading
1 Activity Employment Job 4.5 20 Student Swimming
2 Activity Employment Job 8.0 5 Student Sleeping
3 Activity Employment Job 1.0 5 Student Travelling
4 Activity Employment Job 8.0 30 Student Working out
5 Activity Employment Job 1.0 35 Trainee Reading
6 Activity Employment Job 3.0 40 Trainee Travelling
7 Activity Employment Job 2.0 40 Trainee Swimming
8 Activity Employment Job 6.0 5 Trainee Sleeping
9 Activity Employment Job 12.0 45 Trainee Working out
10 Activity Employment Job 4.5 20 Administrator Swimming
11 Activity Employment Job 9.0 10 Administrator Sleeping
12 Activity Employment Job 7.5 50 Administrator Working out
13 Activity Employment Job 1.5 35 Administrator Reading
14 Activity Employment Job 1.5 50 Administrator Travelling
15 Activity Employment Job 11.0 55 Manager Working out
16 Activity Employment Job 2.0 45 Manager Reading
17 Activity Employment Job 7.5 10 Manager Sleeping
18 Activity Employment Job 1.5 90 Manager Travelling
19 Activity Employment Job 2.0 40 Manager Swimming
20 Activity Employment Job 3.0 35 Pensioner Reading
21 Activity Employment Job 9.0 15 Pensioner Swimming
22 Activity Employment Job 9.0 15 Pensioner Sleeping
23 Activity Employment Job 3.0 60 Pensioner Travelling
24 Activity Employment Job 0.0 0 Pensioner Working out

We now define the partitions of the data. Rather than listing the categories by hand, we use np.unique to pick out a list of the unique values that occur in the dataset.

In [3]:
partition_job = Partition.Simple('Employment Job', np.unique(df1['Employment Job']))
partition_activity = Partition.Simple('Activity', np.unique(df1['Activity']))

In fact, this is pretty common so there is a built-in function to do this:

In [4]:
# these statements or the ones above do the same thing
partition_job = dataset.partition('Employment Job')
partition_activity = dataset.partition('Activity')

We then go on to define the structure of our sankey. We define nodes, bundles and the order. In this case its pretty straightforward:

In [5]:
nodes = {
    'Activity': ProcessGroup(['Activity'], partition_activity),
    'Job': ProcessGroup(['Employment Job'], partition_job),
}

bundles = [
    Bundle('Activity', 'Job'),
]

ordering = [
    ['Activity'],
    ['Job'],
]

Now we will plot a Sankey that shows the share of time dedicated to each activity by each type of person.

In [6]:
# These are the same each time, so just write them here once
size_options = dict(width=500, height=400,
                    margins=dict(left=100, right=100))

sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, dataset, measures='Calories Burnt').to_widget(**size_options)

We can start using colour by specifying that we want to partition the flows according to type of person. Notice that this time we are using a pre-determined palette.

You can find all sorts of palettes listed here.

In [7]:
sdd = SankeyDefinition(nodes, bundles, ordering, flow_partition=partition_job)

weave(sdd, dataset, palette='Set2_8', measures='Calories Burnt').to_widget(**size_options)

Now, if we want to make the colour of the flow to be proprtional to a numerical value. Use the hue parameter to set the name of the variable that you want to display in colour. To start off, let’s use “value”, which is the width of the lines: wider lines will be shown in a darker colour.

In [8]:
weave(sdd, dataset, link_color=QuantitativeScale('Calories Burnt'), measures='Calories Burnt').to_widget(**size_options)

It’s more interesting to use colour to show a different attribute from the flow table. But because a line in the Sankey diagram is an aggregation of multiple flows in the original data, we need to specify how the new dimension will be aggregated. For example, we’ll use the mean of the flows within each Sankey link to set the colour. In this case we will use the colour to show how much each type of person emjoys each activity. We can be interested in either the cumulative enjoyment, or the mean enjoyment: try both!

Aggregation is specified with the ameasures parameter, which should be set to a dictionary mapping dimension names to aggregation functions ('mean', 'sum' etc).

In [9]:
weave(sdd, dataset, measures={'Calories Burnt': 'sum', 'Enjoyment': 'mean'}, link_width='Calories Burnt',
      link_color=QuantitativeScale('Enjoyment')).to_widget(**size_options)
In [10]:
weave(sdd, dataset, measures={'Calories Burnt': 'sum', 'Enjoyment': 'mean'}, link_width='Calories Burnt',
      link_color=QuantitativeScale('Enjoyment', intensity='Calories Burnt')).to_widget(**size_options)
/home/rick/ownCloud/devel/sankey-view/floweaver/color_scales.py:114: RuntimeWarning: invalid value encountered in true_divide
  value /= measures[self.intensity]

You can change the colour palette using the palette attribute. The palette names are different from before, because those were categorical (or qualitative) scales, and this is now a sequential scale. The palette names are listed here.

In [11]:
scale = QuantitativeScale('Enjoyment', palette='Blues_9')
weave(sdd, dataset,
      measures={'Calories Burnt': 'sum', 'Enjoyment': 'mean'},
      link_width='Calories Burnt',
      link_color=scale) \
    .to_widget(**size_options)
In [12]:
scale.domain
Out[12]:
(0, 90)

It is possible to create a colorbar / scale to show the range of intensity values, but it’s not currently as easy as it should be. This should be improved in future.