Sankey diagram (Alluvial Diagram)

What’s this?

Macro for Sankey diagram using SAS GRAPH.

To be more precise, it is called alluvial Diagram.

A Sankey diagram is a visualization used to depict a flow from one domain of values to another domain. Sankey graphs is useful for showing flow and relationships among multiple categories.

In pharmacoepidemiology study, this diagram is used for changes of the state of a patient over time, such as medication switching, disease stage progression, or movement through treatment pathways. 1.

In the past, there are reports of sankey diagram example using SAS GRAPH 23. But in this macro, more options such as statistics display and node focus are available.

If you can allow the use of HTML, I suggest you refer to the case study of using D3.js and SAS 4.

Elements of Sankey diagram

Sankey diagram is contained the three elements.

_images/elements_of_sankey_diagram.png
  • domain

    the group of data as x-axis

  • Node

    the category. displays as rectangle shape.

  • link (flow)

    sigmoid shapes connected from node (source node) to next node (target node). amounts of flow are displays as widths of sigmoid shapes. the links connect the nodes in order left domain to right domain.

Rules

  • the nodes are stacked in node category order.

  • all records of the input data are displayed in the all domains.

  • nodes in the first domain should be source node only, not target node.

  • the nodes of last domain should be target node only, not source node.

  • Circular links are not allowed.

Input data

the node categories are set to each domain variables like below. I recommended that the node category format is applied to each domain variable. the domain variables should be numeric.

null value is allowed. null value is defined as null node and same as other nodes.

domain1

domain2

domain3

node A

node B

node C

node B

node A

node C

node D

node A

node B

the domain format is required.

code

proc format;
  value domainfmt
  1="Domain 1"
  2="Domain 2"
  3="Domain 3"
;
run;

Syntax

 ods graphics / < graphics option > ;
 ods listing gpath=< output path >;

%macro sankey(
   data=,

   domain=,
   domainfmt=,
   domaintextattrs=(color=black size=11),
   gap=5,

   nodefmt=auto,
   nodewidth=0.2,
   nodeattrs=auto,
   nodename=true,
   nodetextattrs=(color=black size=9),

   linktext=true,
   linktext_offset=0.05,
   linkattrs=auto,
   linktextattrs=(color=black size=8),

   reverse=false,
   focus=None,
   endFollowup=None,
   stat=both,
   unit=,

   legend=false,
   palette=sns,
   note=,
   deletedata=true);

Parameters

  • data : dataset name (required)

    input data. keep and rename options are available but keep option is not. input data is copied to work library and keep only domain variable set to domain parameter.

Domain settings

  • domain : variable name (required)

    domain variable. domain variables are set in link order. for example, if domain parameter is set as “var1 var2 var3”, the var1 and var2, var2 and var3 are connected.

  • domainfmt : variable name (required)

    format for domain variable. format catalog should be saved in work library and value should be numeric. interval of domains is depends on the values of format. labels of the format are displayed on plot. values not be included into sankey should be exclude from format.

  • domaintextattrs : text appearance (optional)

  • The appearance (font size, font color and font weights) of domain text. it is same as textattrs of GTL. default is “(size=11 color=black).

Node settings

  • gap : numeric (optional)

    the gap between nodes. default is 5.

  • nodefmt : keyword or format name (optional)

    format for nodes. if “auto” is set, format is obtained from domain variable in input data. if you want to display the node which is not existed in input data in the legend. the node format should be set to this parameter. format catalog should be saved in work library. default is “auto”.

  • nodewidth : numeric (optional)

    the width of node shapes. the default is 0.2.

_images/nodewidth.png
  • nodeattrs : keyword or fill appearance (optional)

    node appearance (color, transparency). it is same as fillattrs option of GTL. if “auto” is set, the fill color of nodes are set based on the node category. if you want to be set same color to all the nodes.fill attribute is set to this parameter.

    for example, the fill appearance described below is set and the all nodes is set blue.

    nodeattrs=(color=blue)

    default is “auto”.

  • nodename : bool (optional)

    displays the node name (node category name) in the sankey diagram. default is “true”.

  • nodetextattrs : text appearance (optional)

    The node text appearance (font size, font color and font weights). It is same as textattrs option of GTL. default is “(size=9 color=black).

Other settings

  • reverse : bool (optional)

    if True the y-axis will be reversed. default is False.

  • focus : subset-if statement (optional)

    specify the focused links. The links except for focused links will be set grey color and link text will be not displayed. The focused links are specified by nodes connected by focused links using subset if statement. variables that can be used for subset if are as follows.

    • domain: domain of source_node

    • next_domain: domain of target_node

    • source_node: source node of focused link

    • target_node: target node of focused link

    for example, when links connected from “type A” node (source node) of domain 2 is focused, set the parameter described below. value of type A node is 1.

    focus=(source_node=1 and domain=2)

    default is “none” (not focused).

  • EndFollowup : node (optional)

    specify the node of end of follow-up. The color of node specified this parameter will be set grey and the percentage of follow-up at each domain will be displayed at the bottom of Sankey diagram. default is “none” (not specified).

    the percentage of follow-up is calculated using following equation.

    %follow-up = (total number - number of EndFollowup) / Total number * 100

  • stat : keyword (optional)

    displays the frequency or percentage of nodes and links. the keyword described below is available. default is “both”.

    • FREQ: frequency

    • PCT: percentage(displays second decimal place)

    • BOTH: frequency and percentage

    • NONE: not displayed

    the percentage is calculated using following equation.

    percentage = frequency of nodes or links / number observation of input data * 100

  • unit : text (optional)

    the units of frequency (suffix). if “FREQ” or “BOTH” keyword is set to the stat parameter, the units of frequency is displayed in the node text and link text.

    default is “” (null).

  • legend : bool (optional)

    if “True” the legend of node category item is displayed. default is “false”.

  • palette : keyword (optional)

    color palette for fill, line and markers. the palettes described below is available. see color palette section of introduction page. default is “SNS” (Seaborn default palette).

    • SAS

    • SNS (Seaborn)

    • STATA

    • TABLEAU

  • note : statement (optional)

    insert the text entry statement into the graph template and display the title or footnote in the output image. default is “” (not displayed)

  • deletedata : bool (optional)

    if True, the temporary datasets and catalogs generated by macros will be deleted at the end of execution. default is True.

example

output example can be executed using following code after loading SAS plotter.

code

ods listing gpath="your output path";
filename exam url "https://github.com/Superman-jp/SAS_Plotter/raw/main/example/sankey_example.sas" encoding='UTF-8';
%include exam;

Basic sankey diagram

the regimen changes of the patients is displayed as sankey diagram using this macro. variables (day0, day30, day60, day120) are selected regimen of the patient at 0 ,30, 60 and 120days. these variables is set the regimen types.

raw data

proc format;
value domainf
1="day0"
2="day30"
3="day60"
4="day120";

value nodef
0="Drug A"
1="Drug B"
2="Drug C"
3="Drug D"
4="Drug E"
;
run;
data raw;
usubjid+1;
input day0 day30 day60 day120;
format day0 day30 day60 day120 nodef.;
cards;
0 2 3 4
0 2 3 4
0 2 3 4
2 1 2 4
2 1 2 4
2 1 2 4
2 1 2 4
2 1 2 4
2 1 4 3
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
;
run;

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_basic" noborder;
title "Bacic Sankey";

%sankey(
   data=raw,
   domain=day0 day30 day60 day120,
   domainfmt=domainf,
   note=%nrstr(entrytitle 'your title here';
            entryfootnote halign=left 'your footnote here';
            entryfootnote halign=left 'your footnote here 2';)
   );
_images/sankey_basic1.svg

Adjust the domain interval

The interval of the domain is defined as the values of the domain format. when the domain format values are adjusted, then interval of the domain will be changed.

if the format is changed described below, the plot of the previous section “basic sankey diagram” is as follows.

code

proc format;
value domain2f
1="day0"
2="day30"
3="day60"
5="day120";
run;

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_change_intervals" noborder;
title "Adjust the domain interval";
ods html;
%sankey(
   data=raw,
   domain=day0 day30 day60 day120,
   domainfmt=domain2f
);
_images/sankey_change_intervals1.svg

Node format and legend

If the codes of the previous section “basic sankey diagram” is changed as described below, the color of the “Regimen E”, “Regimen G” and “Regimen I” is different from the previous section.

By default, the color of the nodes is defined based on the input data and domain variables. when the input dataset is modified, the color of the nodes might be changed even though same node.

code

proc format;
value domain3f
1="day0"
2="day60"
;
run;
ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_not_set_nodefmt" noborder;
title "not set nodefmt parameter";

%sankey(
   data=raw,
   domain=day0 day60,
   domainfmt=domain3f,
   legend=true
);
_images/sankey_not_set_nodefmt1.svg

if nodefmt parameter is set, the color of the nodes defined based on the node format. The node color is independent from dataset. All elements of the format is displayed in the legend if legend parameter is true.

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_set_nodefmt" noborder;
title "set nodefmt parameter";

%sankey(
   data=raw,
   domain=day0 day60,
   domainfmt=domain3f,
   nodefmt=nodef,
   legend=true
);
_images/sankey_set_nodefmt1.svg

Change the fill appearance

By default, the node color and the link color are defined based on the node category. If the nodeattrs or linkattrs are set, the fill color of the all the nodes or links will be changed. Nodefmt parameter will be ignored.

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_set_nodeattrs" noborder;
title "change_the_node color";
ods html;
%sankey(
   data=raw,
   domain=day0 day30 day60 day120,
   domainfmt=domainf,
   nodeattrs=(color=grey)
);
_images/sankey_set_nodeattrs1.svg

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_set_linkattrs" noborder;
title "change the link color";
ods html;
%sankey(
   data=raw,
   domain=day0 day30 day60 day120,
   domainfmt=domainf,
   linkattrs=(color=skyblue transparency=0.7)
);
_images/sankey_set_linkattrs1.svg

Change the text appearance

The domaintextattrs, nodetextattrs and linktextattrs parameters are useful for modify text appearance of domains, nodes and links.

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_set_textattrs" noborder;
title "change the text appearance";
ods html;
%sankey(
   data=raw,
   domain=day0 day30 day60 day120,
   domainfmt=domainf,
   domaintextattrs=(color=red size=12),
   nodetextattrs=(color=blue size=12),
   linktext_offset=0.05,
   linktextattrs=(color=green size=8)
);
_images/sankey_set_textattrs1.svg

Focus parameter

The focus parameter is useful for the highlighting specified links. The not-highlighted links are set grey color and the link texts are not displayed.

focus links are specified by nodes connected by focus links and the nodes is specified conditional statement (subset-if).

the nodes described below are available to specify the focus links.

raw data

proc format;

value druglist
1="Drug A"
2="Drug B"
3="Drug C"
4="Drug D"
99="Lost to follow-up";

value timef
1="Day 0"
2="Day 30"
3="Day 60"
4="day 90"
;
run;

data drug_switch;
length usubjid $10 Day0 Day30 Day60 Day90 8;
format Day0 Day30 Day60 Day90 druglist.;
input usubjid $ Day0 Day30 Day60 Day90;
datalines;
A001 1 1 1 3
A002 1 1 1 4
A003 1 1 1 4
A004 1 1 1 4
A005 1 2 1 4
A006 1 3 1 4
A007 1 3 1 4
A008 1 4 1 99
A009 1 4 1 99
A010 1 1 2 2
A011 1 1 2 3
A012 1 1 2 3
A013 1 1 2 3
A014 1 2 2 4
A015 1 2 2 4
A016 1 3 2 4
A017 1 1 3 1
A018 1 1 3 2
A019 1 2 3 4
A020 1 2 3 4
A021 1 2 3 4
A022 1 3 3 4
A023 1 3 3 99
A024 2 4 2 4
A025 2 1 3 3
A026 2 1 3 3
A027 2 1 4 1
A028 2 1 4 2
A029 2 2 4 3
A030 2 2 4 4
A031 2 2 4 4
A032 2 3 4 4
A033 2 3 4 4
A034 2 99 99 99
A035 2 99 99 99
A036 3 1 4 2
A037 3 2 4 4
A038 3 3 4 99
A039 3 1 99 99
A040 3 1 99 99
A041 3 99 99 99
A042 4 4 3 99
A043 4 1 99 99
A044 4 2 99 99
;
run;

example 4: multiple condition

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_focus4" noborder;
title "focus parameter example 4";
%sankey(
   data=drug_switch,
   domain=day0 day30 day60 day90,
   domainfmt=timef,
   focus=((domain=1 and source_node=2) or (next_domain=4 and target_node=99)),
   gap=1,
   reverse=true,
   nodewidth=0.1,
   nodename=false,
   stat=freq,
   legend=true,
   linktext_offset=0.03,
   palette=sns);
_images/sankey_focus41.svg

Percentage of Followup

when the endfollowup parameter is set value of lost followup, percentage followup each domain will be displayed below the sankey diagram.

code

ods graphics / height=15cm width=20cm imagefmt=png imagename="sankey_endfollowup" noborder;
title "percentage of Followup";
%sankey(
   data=drug_switch,
   domain=day0 day30 day60 day90,
   domainfmt=timef,
   gap=1,
   reverse=true,
   nodewidth=0.1,
   nodename=false,
   endfollowup=99,
   stat=freq,
   legend=true,

   note=%nrstr(entrytitle 'your title here';
               entryfootnote halign=left 'your footnote here';
            entryfootnote halign=left 'your footnote here 2';)
);
_images/sankey_endfollowup1.svg

Reference

1

Nicolle M. Gatto, Shirley V. Wang, William Murk, Pattra Mattox, M. Alan Brookhart, Andrew Bate, Sebastian Schneeweiss, and Jeremy A. Rassen. Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting. Pharmacoepidemiology and Drug Safety, 31(11):1140–1152, 2022. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/pds.5529, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/pds.5529, doi:https://doi.org/10.1002/pds.5529.

2

Chapel Hill Shane Rosanbalm. Getting sankey with bar charts. 2015. URL: https://www.pharmasug.org/proceedings/2015/DV/PharmaSUG-2015-DV07.pdf.

3

Sanjay Matange. Sankey diagrams. 2015. URL: https://blogs.sas.com/content/graphicallyspeaking/2015/03/21/sankey-diagrams/.

4

Tanmay Khole. Using sankey diagram to analyze drug pipeline. 2020. URL: https://www.lexjansen.com/phuse-us/2020/dv/DV03.pdf.