5 min read

Multipathing Impact on Performance

Multipathing configuration has always played a part in storage performance, and most vendors now use some form of Round Robin multipathing in order to get full use out all paths available between the host and storage.

However even when using "round robin" multipathing, VMware (and some other OS'es) default to sending a batch of IOs down a single path before switching to the next path. In the case of VMware, the default is 1000 IO or 10MB of data per path, as can be seen in the output of the "esxcli storage nmp device list" command :

Path Selection Policy: VMW_PSP_RR
   Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=2: NumIOsPending=0,numBytesPending=0}

Whilst this might serve it's purpose from a redundancy perspective, this behavior in effects hot-spots the paths one at a time, which can have a significant performance impact - as generally only one path is being used as a time.

In order to test the real impact of this, I presented 4 LUNs from a VMAX 250F to a ESXi 6.0 host, and then mapped these to a Linux guest running Vdbench. Each LUN has a total of 8 paths to the host. ESXi correctly detected these LUNs as being from a "Symmetrix" and applied the default round-robin/1000 IOPS per path policy to them.

In order to remove the performance of the array itself from the equation, I configured Vdbench to use only a 1GB range on each disk - allowing the VMAX to service the IOs entirely from cache. Thus any performance differences experienced are entirely down to the host-to-array connectivity, and multipathing. With the default multipathing settings in place, I kicked off an 8K, 50% read/50% write, 100% random workload, using a single LUN. The results were :

  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                 rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
avg_61-360   38334.82   299.49    8192  50.02    3.335    2.959    3.711  108.002    5.083 127.8   6.0   2.9

Despite only being able to generate a relatively mediocre 38K IOPS (clearly not enough to stress the array), the host-side latency was high for an 8k workload, at 3.3ms.

Changing the ESXi Multipathing config to do only a single IO before switching paths gave very different results :

[root@esx1:~] esxcli storage nmp psp roundrobin deviceconfig set -d naa.60000970000197800380533030303441 --iops 1 --type iops

  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                 rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
avg_61-360  151892.44  1186.66    8192  50.01    0.839    1.019    0.658  100.772    1.427 127.4  29.2  19.2

Changing nothing more than the number of IO per path has resulted in a 4x increase in performance AND an 80% drop in latency!

This result isn't all that surprising. At 38K IO/second, and 1000 IO/path before switching, that means that multipathing used each path for 1/38 of a second before switching to the next path. ie, it spend 26 milliseconds using path A, then the next 26 milliseconds using path B - in effect, it's only ever using 1 path, just which path is being used changes over time. Instead of having the resources of 8 paths available, we've only got the resources of 1 - not just on the path itself, but also on the host/HBA, and just as importantly on the front-end port/engine on the array side.

Theoretically increasing the number of LUNs in use would allow ESXi to make better use of the available paths, as even if each LUN is only using one path at a time, between the multiple LUNs we would expect to see slightly better balance across all paths at any point in time. Increasing to using 4 LUNs does indeed show an improvement, but the performance with 1000 IO/path is still around half that of 1 IO/path.

1000 IO/path :
  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                 rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
avg_61-360  130391.85  1018.69    8192  50.02    3.921    3.718    4.124  112.530    5.440 511.3  25.3  12.2
1 IO/path :
avg_61-360  240222.73  1876.74    8192  50.01    2.124    1.958    2.291  145.301    4.230 510.3  47.5  25.6

Reducing the workload being asked of the array to only 50K IOPS still shows a significant difference. Both configs were able to sustain 50K IOPS without any problems, however with 1000 IO/path the latency was over double that with 1 IO/path - ~0.8ms compared to ~0.34ms

1000 IO/path :
  interval        i/o   MB/sec   bytes   read     resp     read    write     resp     resp queue  cpu%  cpu%
                 rate  1024**2     i/o    pct     time     resp     resp      max   stddev depth sys+u   sys
avg_61-360   50020.33   390.78    8192  50.01    0.796    0.686    0.906   85.592    1.132  39.8  15.3   8.6
1 IO/path :
avg_61-360   50027.45   390.84    8192  49.99    0.337    0.265    0.408   61.580    0.125  16.8  15.5   9.0

Configuring Multipathing

For VMAX, details on how to configure multipathing in ESX are included in the EMC Host Connectivity Guide for VMware ESX Server, however at least as of the December 2016 (Rev 46) version of this guide the command to set the default multipathing does NOT include the option for 1 IO/path.

For XtremIO, the XtremIO Host Configuration Guide includes the details, including how to set the default to 1 IO/path.

For both platforms, there are 2 separate configurations that need to be made - one to set the default behavior which only affects NEW LUNs assigned to the host AFTER this change is made, and a second to change the behavior for any LUNs that have already been assigned to the host.

For XtremIO, the command to change the default setting is (all on one line):
esxcli storage nmp satp rule add -c tpgs_off -e "XtremIO Active/Active" -M XtremApp -P VMW_PSP_RR -O iops=1 -s VMW_SATP_DEFAULT_AA -t vendor -V XtremIO

For VMAX, it's :
esxcli storage nmp satp rule add -s VMW_SATP_SYMM -V EMC -M SYMMETRIX -P VMW_PSP_RR -O iops=1

For LUNs that have already been presented to the host the commands are the same regardless of the array (although the first command generally won't be needed for VMAX as the default in recent versions of ESXi is round robin) and needs to be done for each LUN individually, replacing the <naa_id> with the LUNs NAA :

esxcli storage nmp device set --device <naa_id> --psp VMW_PSP_RR
esxcli storage nmp psp roundrobin deviceconfig set --device <naa_id> --iops 1 --type iops

You can check the current settings for a LUN, and get the NAA's to use with the above commands, using the "esxcli storage nmp device list" command. eg :

[root@esx1:~] esxcli storage nmp device list
naa.60000970000197800380533030303245
   Device Display Name: EMC Fibre Channel Disk (naa.60000970000197800380533030303245)
   Storage Array Type: VMW_SATP_SYMM
   Storage Array Type Device Config: {action_OnRetryErrors=off}
   Path Selection Policy: VMW_PSP_RR
   Path Selection Policy Device Config: {policy=iops,iops=1,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba2:C0:T3:L17, vmhba2:C0:T4:L17, vmhba2:C0:T5:L17, vmhba2:C0:T2:L17, vmhba3:C0:T4:L17, vmhba3:C0:T5:L17, vmhba3:C0:T2:L17, vmhba3:C0:T3:L17

The important parts are the two fields highlighted, although be aware that the "policy=" field on the second highlighted line can be either "policy=iops" or "policy=rr" - both are functionally equivalent and which is shown will depend whether the device was configured manually or via the default rule.

Note that you can NOT do this configuration from the vSphere UI. Although it will allow you to change the multipathing settings, it does NOT allow setting the number of IO per path.

As of ESXi 6.5, VMware have added the above configuration by default for XtremIO, so any new ESXi 6.5 installs will not require this configuration for XtremIO, however VMAX still defaults to 1000 IO/path, even in ESXi 6.5.