Welcome to Knowledge Base!

KB at your finger tips

This is one stop global knowledge base where you can learn about all the products, solutions and support features.

Categories
All

Storage and Backups-Purestorage

Troubleshooting: vSphere Web Client Plugin Installation

Requirements to Install vSphere Plugin 2.5.1+ from the PureStorage GUI

For instructions on installing and updating the plugin from the Pure GUI see VMware vSphere Plugin Install (Version 3.0 | Version 2.5 or Older).

  • Java 1.8 (for TLS 1.1 or 1.2). If Java 1.8+ is not present DO NOT REQUEST TO UPDATE IT.
  • Network access between Array and vSphere server through TCP ports 443.
  • vCenter administrator privileges and PureStorage system administrator privileges, or an administrator with these privileges is available to do the installation. Administrative username and password for vCenter server.
  • vSphere Web Client.

Requirements to Install and Configure vSphere Plugin Using the PluginServer Method

  • Linux OS or Windows OS (either can be VM’s).
  • Network access to vSphere server port 8080 or 8081.
  • Administrative username and password for vCenter server.
  • Windows OS does not need java 1.8 (the pluginserver has a java run-time included).
  • Linux OS needs java jdk 1.8+ and jre 8+(for TLS 1.1 or 1.2).
  • The pluginserver files in archive.
  • vSphere Web Client
  • PureStorage system administrator privileges to connect to PureStorage array within vSphere.
  • TCP Ports 443 (for vCenter to send API commands to the array).
  • vSphere 5.5, vsphere 6.0 and vSphere 6.5. Note: The space reclamation feature requires vSphere 6.0 or higher and ESXi 6.0 or higher.

Troubleshooting

If after following the documentation for VMware vSphere Plugin Installation and the plugin still does not function properly, this is often due to the VMware vSphere web client server not supporting TLS 1.1 or 1.2. See Analyzing the vsphere_client_virgo.log below if you would like to confirm this.

The PluginServer was developed because other troubleshooting steps such as adjusting the wrapper.conf file, updating Java, or manually installing the plugin can cause other unexpected issues in the customer's environment. If the following does not work create a Jira do not try other steps without directions from PSE.

Installing the PluginServer

The procedure below does a few different things. It runs the unregisterplugin sh/bat using the IP address of the vSphere server. It commands this server to stop its attempt to download the vSphere plugin from our array through the API and then attempts installation of the plugin. The next time a user logs into the vSphere web client, the web client triggers the plugin's installation when a user logs in.  If there was a failed attempt to prevent login from taking a long time, it will not try and install the plugin twice. Restarting the vSphere Web Client service resolves this. Next, the startserver enables the API to pull the needed files from the Windows or Linux OS. And finally, the registerplugin uses API to pull the needed files to the vSphere server and tell vSphere to install the plugin the next time someone logs into the vSphere web client.

Windows

If any of these steps do not work see the notes section below.

  1. Download PureStorage_vSphere_installer.jar and the PluginServer-2.5.1_201704051804+163c8cd-rel_2_5_x.zip or latest plugin server from https://archive.dev.purestorage.com/flasharray/purity/customer_shipped_releases/older_releases_at_.._.._.._customer_shipped_releases/vsphere_plugin/ and then copy these files to a google drive (or equivalent) to share.
  2. Make a new directory for the PluginServer files and unzip them to the directory.
    1. Enter make a folder in the gui called pluginserver or in CLI mkdir [drive:]path/pluginserver .
    2. Extract Pluginserver-2.5.1_201704051804+163c8cd-rel_2_5_x.zip or latest to the pluginserver folder.
  3. Extract the PluginServer to above folder and make sure PureStorage_vSphere_installer.jar is present in the same folder as the unzipped PluginServer files.
    1. Enter cd pluginserver/ .
    2. Enter copy ../PureStorage_vSphere_installer.jar /pluginserver .
  4. Unregister the failed plugin install.
    1. Enter unregisterplugin.bat , while in the folder where the extracted files and the PureStorage_vSphere_installer.jar resides.
    2. Enter the vSphere server'ss IP address.
    3. Enter the vSphere credentials.
  5. Run startserver.bat in the command console used for creating the directory. This will start up the web server that hosts the plugin. Keep this console open and running while you perform the rest of the steps.
    1. You should see something similar to the following:
      robm$ ./startserver.bat
      Found these arguments port(8080) keystore(keystore.jks)
      File location(purestorage-vsphere-plugin.zip)
      Starting server on port 8080...
      Server started successfully!
    2. Leave this running and open a new command line window for the next step.
    3. If it doesn’t work and you see “Found these arguments port(8080)” … “Address already in use” follow the steps in the note section below.
  6. In a new command window, run registerplugin.bat while in the directory where unzip extracted the pluginserver and the PureStorage_vSphere_installer.jar resides.
    1. Once run, it will ask for the IP address and credentials to run API commands against the vSphere server.
  7. Restart the vSphere Web Client service on the vCenter server and then wait five minutes for it to fully come up( on vCenter 6.5, you won't see the vSphere Web Client service listed in Services, so you can use a command prompt to stop and start the process:)

C:\Program Files\VMware\vCenter Server\bin>service-control --stop vspherewebclientsvc
C:\Program Files\VMware\vCenter Server\bin>service-control --start vspherewebclientsvc

Now have the user log into the vSphere web client to see the PureStorage plugin, if it is there, you can now close both command line windows.

8. If we still do not see the PureStorage Plugin make sure they have logged out then back into the vSphere Web Client.

9. If we still do not see PureStorage Plugin proceed to Analyzing the vsphere_client_virgo.log (steps below).

Once the user has logged into vSphere Web Client and can see the plugin, follow the vSphere Web Client user guide.

Linux VCSA Appliance, Installing the Plugin Server on a Linux VM

vCenter Server Appliance VCSA requires another OS to run the pluginserver scripts on. This can be another temporary VM running Linux or Windows, the following instructions are for a linux VM. Follow the above Windows steps if this other OS is running windows OS.

If any of these steps do not work see the notes section below.

  1. vCenter Server Appliance VCSA requires another machine to run the pluginserver scripts on. This can be another temporary VM running Linux or Windows (above windows steps) or can be the Windows Guest OS the customer is using for their machine to temporarily run the pluginserver scripts.
  2. Download PureStorage_vSphere_installer.jar from and the PluginServer-2.5.1_201704051804+163c8cd-rel_2_5_x.zip or latest plugin server from https://archive.dev.purestorage.com/flasharray/purity/customer_shipped_releases/older_releases_at_.._.._.._customer_shipped_releases/vsphere_plugin/.
  3. From the command line make new directory for the pluginserver files and unzip them to the directory.
    1. Enter mkdir pluginserver .
    2. Enter unzip -d pluginserver/ Pluginserver-2.5.1_201704051804+163c8cd-rel_2_5_x.zip .
  4. Extract the PluginServer to any directory and make sure to copy PureStorage_vSphere_installer.jar to the same directory as the unzipped pluginserver files.
    1. Enter cd pluginserver/ .
    2. Enter cp ../PureStorage_vSphere_installer.jar ./ .
  5. Make the scripts executable.
    1. Enter chmod a+x *.sh .
  6. Unregister the failed plugin install.
    1. Enter ./unregisterplugin.bat, while in the directory where unzip extracted the files and the PureStorage_vSphere_installer.jar resides.
    2. Enter the vSphere servers ip address.
    3. Enter the vSphere credentials.
  7. Execute the plugin using the pluginserver.
    1. ./startserver.sh .
    2. You should see something similar to the following:
      robm$ ./startserver.sh
      Found these arguments port(8080) keystore(keystore.jks)
      File location(purestorage-vsphere-plugin.zip)
      Starting server on port 8080...
      Server started successfully!
    3. If it doesn’t work and you see “Found these arguments port(8080)” … “Address already in use” follow the steps in the note section below.
    4. If this fails for another reason, you may need to java jdk 1.8+ and jre 8+ on this linux VM. If this is a production VM, you do not want to update java, use a nonproduction VM.
    5. In testing, a ubuntu-16.04.3 VM was used. The following commands were required:.
      robm@ubuntu:~$ sudo apt-get update
      robm@ubuntu:~$ sudo apt-get install default-jre
  8. Leave this running and open a new command line window for the next step.
  9. In a new command window, run registerplugin.sh while in the directory where unzip extracted the PluginServer and where the PureStorage_vSphere_installer.jar resides.
    1. Once run, the program prompts for the IP address and credentials to run API commands against the vSphere server.
  10. Restart the vSphere Web Client service from the command line on the vCenter server (not the VM) and then wait five minutes for it to fully come up.
    1. Enter service vsphere-client restart .
  11. Now have the user log into the vSphere web client to see the PureStorage plugin, if it is there, you can now close both command line windows.
    1. If you still do not see the PureStorage Plugin, make sure you have logged out then back into the vSphere Web Client.
    2. If you still do not see PureStorage Plugin, proceed to Analyzing the vsphere_client_virgo.log (steps below).
  12. Once the user has logged into vSphere Web Client and can see the plugin, follow the vSphere Web Client user guide.

Analyzing the vsphere_client_virgo.log

Locate the vsphere_client_virgo.log and have the customer copy them to a text file then email them to us. These logs often get zipped up and numbered so be sure we have the file for the time period of when we attempted to install the plugin. Below is where these files should be located. Often this file is not where we expect it to be so searching it may be necessary. The following is from this vmware kb.

  1. For vSphere 5.0, all the logs for the vSphere Web Client service are located at:
    • Windows: C:\Program Files\VMware\Infrastructure\vSphere Web Client\DMServer\serviceability\.
    • Linux: /usr/lib/vmware-vsphere-client/server/serviceability/.
  2. For vSphere 5.1, all the logs for the vSphere Web Client service are located at:
    • Windows: C:\ProgramData\VMware\vSphere Web Client\serviceability\.
    • Linux: /var/log/vmware/vsphere-client/serviceability/.
  3. For vSphere 5.5, all the logs for the vSphere Web Client service are located at:
    • Windows: C:\ProgramData\VMware\vSphere Web Client\serviceability\.
    • Linux: /var/log/vmware/vsphere-client/.
  4. For vSphere 6.0+ as per the following VMware kb.
    • C:\ProgramData\VMware\vCenterServer\logs\vsphere-client\logs.
    • /var/log/vmware/vsphere-client/logs.

Search the vsphere_client_virgo log for when the plugin was attempted. Usually, it will say “purestorage” when the attempt to install was performed. The following are errors we have seen in this file.

[2016-02-11 16:06:03.213] ERROR [ERROR] http-bio-9443-exec-16         com.purestorage.FlashArrayHelper      javax.net.ssl.SSLException: java.lang.RuntimeException: Could not generate DH keypair javax.net.ssl.SSLException: java.lang.RuntimeException: Could not generate DH keypair

This alert was from not having JDK 1.8 see this JIRA.

[2017-06-19 10:34:00.954] [ERROR] vc-service-pool-2169   70002699 100142 200004 com.vmware.vise.vim.extension.VcExtensionManager    Error unzipping https://192.168.41.131/download/pure...?version=2.5.1 to directory C:\ProgramData\VMware\vSphere Web Client\vc-packages\vsphere-client-serenity\com.purestorage.plugin.vsphere-2.5.1, check if the server process has Write Permission on this machine. java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at sun.security.ssl.InputRecord.readFully(Unknown Source)
        at sun.security.ssl.InputRecord.read(Unknown Source)

This alert is due to vSphere trying to establish communication with the array using TLSv1.0 ES-27873.

If installing from the GUI and it does not work and you see the above messages, install the PluginServer. If the vSphere server resides on a windows OS the PluginServer can be installed on it as described above. If the Linux VCSA appliance is being used we can put the PluginServer on a windows OS server (vm or bare metal) or a Linux vm as described above. Make sure this is not a production Linux VM as we may need to update or install java to 1.8+.

Windows

If any of these steps do not work see the notes section below.

  1. Download PureStorage_vSphere_installer.jar and the PluginServer-2.5.1_201704051804+163c8cd-rel_2_5_x.zip or latest plugin server from https://archive.dev.purestorage.com/flasharray/purity/customer_shipped_releases/older_releases_at_.._.._.._customer_shipped_releases/vsphere_plugin/ and then copy these files to a google drive (or equivalent) to share.
  2. Make a new directory for the PluginServer files and unzip them to the directory.
    1. Enter make a folder in the gui called pluginserver or in CLI mkdir [drive:]path/pluginserver .
    2. Extract Pluginserver-2.5.1_201704051804+163c8cd-rel_2_5_x.zip or latest to the pluginserver folder.
  3. Extract the PluginServer to above folder and make sure PureStorage_vSphere_installer.jar is present in the same folder as the unzipped PluginServer files.
    1. Enter cd pluginserver/ .
    2. Enter copy ../PureStorage_vSphere_installer.jar /pluginserver .
  4. Unregister the failed plugin install.
    1. Enter unregisterplugin.bat , while in the folder where the extracted files and the PureStorage_vSphere_installer.jar resides.
    2. Enter the vSphere server'ss IP address.
    3. Enter the vSphere credentials.
  5. Run startserver.bat in the command console used for creating the directory. This will start up the web server that hosts the plugin. Keep this console open and running while you perform the rest of the steps.
    1. You should see something similar to the following:
      robm$ ./startserver.bat
      Found these arguments port(8080) keystore(keystore.jks)
      File location(purestorage-vsphere-plugin.zip)
      Starting server on port 8080...
      Server started successfully!
    2. Leave this running and open a new command line window for the next step.
    3. If it doesn’t work and you see “Found these arguments port(8080)” … “Address already in use” follow the steps in the note section below.
  6. In a new command window, run registerplugin.bat while in the directory where unzip extracted the pluginserver and the PureStorage_vSphere_installer.jar resides.
    1. Once run, it will ask for the IP address and credentials to run API commands against the vSphere server.
  7. Restart the vSphere Web Client service on the vCenter server and then wait five minutes for it to fully come up( on vCenter 6.5, you won't see the vSphere Web Client service listed in Services, so you can use a command prompt to stop and start the process:)

C:\Program Files\VMware\vCenter Server\bin>service-control --stop vspherewebclientsvc
C:\Program Files\VMware\vCenter Server\bin>service-control --start vspherewebclientsvc

Now have the user log into the vSphere web client to see the PureStorage plugin, if it is there, you can now close both command line windows.

8. If we still do not see the PureStorage Plugin make sure they have logged out then back into the vSphere Web Client.

9. If we still do not see PureStorage Plugin proceed to Analyzing the vsphere_client_virgo.log (steps below).

Once the user has logged into vSphere Web Client and can see the plugin, follow the vSphere Web Client user guide.

Linux VCSA Appliance, Installing the Plugin Server on a Linux VM

vCenter Server Appliance VCSA requires another OS to run the pluginserver scripts on. This can be another temporary VM running Linux or Windows, the following instructions are for a linux VM. Follow the above Windows steps if this other OS is running windows OS.

If any of these steps do not work see the notes section below.

  1. vCenter Server Appliance VCSA requires another machine to run the pluginserver scripts on. This can be another temporary VM running Linux or Windows (above windows steps) or can be the Windows Guest OS the customer is using for their machine to temporarily run the pluginserver scripts.
  2. Download PureStorage_vSphere_installer.jar from and the PluginServer-2.5.1_201704051804+163c8cd-rel_2_5_x.zip or latest plugin server from https://archive.dev.purestorage.com/flasharray/purity/customer_shipped_releases/older_releases_at_.._.._.._customer_shipped_releases/vsphere_plugin/.
  3. From the command line make new directory for the pluginserver files and unzip them to the directory.
    1. Enter mkdir pluginserver .
    2. Enter unzip -d pluginserver/ Pluginserver-2.5.1_201704051804+163c8cd-rel_2_5_x.zip .
  4. Extract the PluginServer to any directory and make sure to copy PureStorage_vSphere_installer.jar to the same directory as the unzipped pluginserver files.
    1. Enter cd pluginserver/ .
    2. Enter cp ../PureStorage_vSphere_installer.jar ./ .
  5. Make the scripts executable.
    1. Enter chmod a+x *.sh .
  6. Unregister the failed plugin install.
    1. Enter ./unregisterplugin.bat, while in the directory where unzip extracted the files and the PureStorage_vSphere_installer.jar resides.
    2. Enter the vSphere servers ip address.
    3. Enter the vSphere credentials.
  7. Execute the plugin using the pluginserver.
    1. ./startserver.sh .
    2. You should see something similar to the following:
      robm$ ./startserver.sh
      Found these arguments port(8080) keystore(keystore.jks)
      File location(purestorage-vsphere-plugin.zip)
      Starting server on port 8080...
      Server started successfully!
    3. If it doesn’t work and you see “Found these arguments port(8080)” … “Address already in use” follow the steps in the note section below.
    4. If this fails for another reason, you may need to java jdk 1.8+ and jre 8+ on this linux VM. If this is a production VM, you do not want to update java, use a nonproduction VM.
    5. In testing, a ubuntu-16.04.3 VM was used. The following commands were required:.
      robm@ubuntu:~$ sudo apt-get update
      robm@ubuntu:~$ sudo apt-get install default-jre
  8. Leave this running and open a new command line window for the next step.
  9. In a new command window, run registerplugin.sh while in the directory where unzip extracted the PluginServer and where the PureStorage_vSphere_installer.jar resides.
    1. Once run, the program prompts for the IP address and credentials to run API commands against the vSphere server.
  10. Restart the vSphere Web Client service from the command line on the vCenter server (not the VM) and then wait five minutes for it to fully come up.
    1. Enter service vsphere-client restart .
  11. Now have the user log into the vSphere web client to see the PureStorage plugin, if it is there, you can now close both command line windows.
    1. If you still do not see the PureStorage Plugin, make sure you have logged out then back into the vSphere Web Client.
    2. If you still do not see PureStorage Plugin, proceed to Analyzing the vsphere_client_virgo.log (steps below).
  12. Once the user has logged into vSphere Web Client and can see the plugin, follow the vSphere Web Client user guide.

Analyzing the vsphere_client_virgo.log

Locate the vsphere_client_virgo.log and have the customer copy them to a text file then email them to us. These logs often get zipped up and numbered so be sure we have the file for the time period of when we attempted to install the plugin. Below is where these files should be located. Often this file is not where we expect it to be so searching it may be necessary. The following is from this vmware kb.

  1. For vSphere 5.0, all the logs for the vSphere Web Client service are located at:
    • Windows: C:\Program Files\VMware\Infrastructure\vSphere Web Client\DMServer\serviceability\.
    • Linux: /usr/lib/vmware-vsphere-client/server/serviceability/.
  2. For vSphere 5.1, all the logs for the vSphere Web Client service are located at:
    • Windows: C:\ProgramData\VMware\vSphere Web Client\serviceability\.
    • Linux: /var/log/vmware/vsphere-client/serviceability/.
  3. For vSphere 5.5, all the logs for the vSphere Web Client service are located at:
    • Windows: C:\ProgramData\VMware\vSphere Web Client\serviceability\.
    • Linux: /var/log/vmware/vsphere-client/.
  4. For vSphere 6.0+ as per the following VMware kb.
    • C:\ProgramData\VMware\vCenterServer\logs\vsphere-client\logs.
    • /var/log/vmware/vsphere-client/logs.

Search the vsphere_client_virgo log for when the plugin was attempted. Usually, it will say “purestorage” when the attempt to install was performed. The following are errors we have seen in this file.

[2016-02-11 16:06:03.213] ERROR [ERROR] http-bio-9443-exec-16         com.purestorage.FlashArrayHelper      javax.net.ssl.SSLException: java.lang.RuntimeException: Could not generate DH keypair javax.net.ssl.SSLException: java.lang.RuntimeException: Could not generate DH keypair

This alert was from not having JDK 1.8 see this JIRA.

[2017-06-19 10:34:00.954] [ERROR] vc-service-pool-2169   70002699 100142 200004 com.vmware.vise.vim.extension.VcExtensionManager    Error unzipping https://192.168.41.131/download/pure...?version=2.5.1 to directory C:\ProgramData\VMware\vSphere Web Client\vc-packages\vsphere-client-serenity\com.purestorage.plugin.vsphere-2.5.1, check if the server process has Write Permission on this machine. java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at sun.security.ssl.InputRecord.readFully(Unknown Source)
        at sun.security.ssl.InputRecord.read(Unknown Source)

This alert is due to vSphere trying to establish communication with the array using TLSv1.0 ES-27873.

If installing from the GUI and it does not work and you see the above messages, install the PluginServer. If the vSphere server resides on a windows OS the PluginServer can be installed on it as described above. If the Linux VCSA appliance is being used we can put the PluginServer on a windows OS server (vm or bare metal) or a Linux vm as described above. Make sure this is not a production Linux VM as we may need to update or install java to 1.8+.

Notes:

If running the startserver.bat/sh script fails, there may be an issue when port 8080 is already in use.  As seen below:

robm$Found these arguments port(8080) keystore(deystore.jks)
File location(purestorage-vsphere-plugin.zip)
java.net.BindException: Address already in use: bind”
robm$ ./startserver.sh
Found these arguments port(8080) keystore(keystore.jks)
      File location(purestorage-vsphere-plugin.zip)
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at sun.net.httpserver.ServerImpl.<init>(ServerImpl.java:100)
    at sun.net.httpserver.HttpsServerImpl.<init>(HttpsServerImpl.java:50)
    at sun.net.httpserver.DefaultHttpServerProvider.createHttpsServer(DefaultHttpServerProvider.java:39)
    at com.sun.net.httpserver.HttpsServer.create(HttpsServer.java:90)
    at com.purestorage.PluginServer.main(Unknown Source)
Exception in thread "main" java.lang.NullPointerException: null SSLContext
    at com.sun.net.httpserver.HttpsConfigurator.<init>(HttpsConfigurator.java:82)
    at com.purestorage.PluginServer$1.<init>(Unknown Source)
    at com.purestorage.PluginServer.main(Unknown Source)
 

To fix this error modify the startserver.bat/sh so it is using 8081.

  1. Open startserver.bat/sh in a text editor (notepad or vim).
  2. Search for 8080 then change it to 8081 twice.
  3. Change port 8080 to 8081 in the registerplugin.bat/sh.

Here is what the startserver.bat/sh looks like after modification:

#!/bin/bash
set -e
 
if [ ! -f /usr/bin/java ]; then
   echo 'Please install java'
   exit -1
fi
 
# The listening port is 8081.  You can modify to use any port.  Make sure to modify the registerPlugin script.
java -cp ./PureStorage_PluginServer.jar com.purestorage.PluginServer 8081 keystore.jks purestorage-vsphere-plugin.zip

Additionally, you will need to edit the registerplugin.bat/sh with the following command:

java -cp ./PureStorage_PluginServer.jar:./PureStorage_vSphere_installer.jar com.purestorage.RegisterPlugin 8081 3.0.0 $ip

Additional troubleshooting steps include:

  1. If you get a JNI error when running the registerserver script, most likely the installer.jar isn't in the same directory as the unzipped PluginServer scripts.
  2. To determine where the plugin is failing, you can verify if it is showing up in the https://vcenterIP/mob.
    1. Go to https://ipaddress_of_vSphere_server/mob.
    2. Go to Content > ExtensionManager > extensionList["com.purestorage.plugin.vsphere"] > client .
    3. From there verify the URL and ensure that this is the correct location vCenter should be looking to download and install the plugin.
  3. You can also ensure that the vCenter GUI has the plugin enabled via the following:
    1. Go to https://ipaddress_of_vSphere_server/.
    2. Go to Home > Administration > Client Plug-Ins .
    3. From there make sure the Pure Storage Plugin is enabled (and not set to disabled).


Stay Ahead in Today’s Competitive Market!
Unlock your company’s full potential with a Virtual Delivery Center (VDC). Gain specialized expertise, drive seamless operations, and scale effortlessly for long-term success.

Book A Meeting To Setup A VDCovertime

ActiveCluster Solution Overview

Read article

Troubleshooting when vVol Datastore Fails to Mount on UCS Server

Problem

When creating a VVol datastore on a Cisco UCS Server with Fibre Channel, failures can occur when an older driver is in use. The older fnic driver cannot detect protocol endpoints as it does not support sub-luns (VVols).

Impact

The FlashArray vSphere Plugin fails to mount the VVol datastore with the error:

The following hosts do not have a valid protocol endpoint connection to the selected Pure Storage Array

pluginerror.png

Or when mounting manually, the datastore is marked on the host as inaccessible.

inaccesible.png

The /var/log/vmkernel.log file on the ESXi host will show the following VVol PE warnings when the “Rescan Storage” is initiated:

2018-01-09T18:04:42.098Z cpu5:65799)WARNING: ScsiPath: 705: Sanity check failed for path vmhba0:C0:T1:L1. The path to a VVol PE comes from adapter vmhba0 which is not PE capable. Path dropped.

The problem is likely caused by outdated scsi-fnic Cisco UCS drivers.

To check for general support for sub-luns (VVols), run the following command:

esxcli storage core adapter list

Look for Second Level Lun ID in the Capabilities column.

image.png

Solution

Check Installed Version Of scsi-fnic Cisco UCS Driver

  1. Log in to ESXi host and execute
esxcli software vib get -n scsi-fnic

fnicVersion.png

Update scsi-fnic Cisco UCS Driver

To install the new driver version:

  1. Download the updated driver package from http://software.cisco.com

or at

https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=DT-ESX60-CISCO-FNIC-16033

  1. Copy scsi-fnic vib file to the host which needs updating
  2. As root execute the following command on the ESXi host:
esxcli software vib install -v <full_path_to driver_file>

Example:

esxcli software vib install -v /tmp/scsi-fnic_1.6.0.37-1OEM.600.0.0.2494585.vib

The installation result should look similar to the output below:

Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: CSCO_bootbank_scsi-fnic_1.6.0.37-1OEM.600.0.0.2494585
VIBs Removed: CSCO_bootbank_scsi-fnic_1.6.0.36-1OEM.600.0.0.2494585
VIBs Skipped:

Check Installed Version Of scsi-fnic Cisco UCS Driver

  1. Log in to ESXi host and execute
esxcli software vib get -n scsi-fnic

fnicVersion.png

Update scsi-fnic Cisco UCS Driver

To install the new driver version:

  1. Download the updated driver package from http://software.cisco.com

or at

https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=DT-ESX60-CISCO-FNIC-16033

  1. Copy scsi-fnic vib file to the host which needs updating
  2. As root execute the following command on the ESXi host:
esxcli software vib install -v <full_path_to driver_file>

Example:

esxcli software vib install -v /tmp/scsi-fnic_1.6.0.37-1OEM.600.0.0.2494585.vib

The installation result should look similar to the output below:

Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: CSCO_bootbank_scsi-fnic_1.6.0.37-1OEM.600.0.0.2494585
VIBs Removed: CSCO_bootbank_scsi-fnic_1.6.0.36-1OEM.600.0.0.2494585
VIBs Skipped:

Read article

Datastore Management

Volume Sizing and Count

A common question when first provisioning storage on the FlashArray is what capacity should I be using for each volume? VMware VMFS supports up to a maximum size of 64 TB. The FlashArray supports far larger than that, but for ESXi, volumes should not be made larger than 64 TB due to the filesystem limit of VMFS.

Using a smaller number of large volumes is generally a better idea today. In the past a recommendation to use a larger number of smaller volumes was made for performance limitations that no longer exist.

This limit traditionally was due to two reasons :

  • VMFS scalability issues due to locking
  • Per-volume queue limitations on the underlying array.

VMware resolved the first issue with the introduction of Atomic Test and Set (ATS), also called Hardware Assisted Locking.

Prior to the introduction of VAAI ATS, VMFS used LUN-level locking via full SCSI-2 reservations to acquire exclusive metadata control for a VMFS volume. In a cluster with multiple nodes, all metadata operations were serialized and hosts had to wait until whichever host, currently holding a lock, released that lock. This behavior not only caused metadata lock queues but also prevented standard I/O to a volume from VMs on other ESXi hosts which were not currently holding the lock.

With VAAI ATS, the lock granularity is reduced to a much smaller level of control (specific metadata segments, not an entire volume) for the VMFS that a given host needs to access. This behavior makes the metadata change process not only very efficient, but more importantly provides a mechanism for parallel metadata access while still maintaining data integrity and availability. ATS allows for ESXi hosts to no longer queue metadata change requests, which consequently speeds up operations that previously had to wait for a lock. Therefore, situations with large amounts of simultaneous virtual machine provisioning operations will see the most benefit.

The standard use cases benefiting the most from ATS include:

  • High virtual machine to VMFS density.
  • Extremely dynamic environments—numerous provisioning and de-provisioning of VMs (e.g. VDI using non-persistent linked-clones).
  • High intensity virtual machine operations such as boot storms, or virtual disk growth.

The introduction of ATS removed scaling limits via the removal of lock contention; thus, moving the bottleneck down to the storage, where many traditional arrays had per-volume I/O queue limits. This limited what a single volume could do from a performance perspective as compared to what the array could do in aggregate. This is not the case with the FlashArray.

A FlashArray volume is not limited by an artificial performance limit or an individual queue. A single FlashArray volume can offer the full performance of an entire FlashArray, so provisioning ten volumes instead of one, is not going to empty the HBAs out any faster. From a FlashArray perspective, there is no immediate performance benefit to using more than one volume for your virtual machines.

The main point is that there is always a bottleneck somewhere, and when you fix that bottleneck, it is transferred somewhere in the storage stack. ESXi was once the bottleneck due to its locking mechanism, then it fixed that with ATS. This, in turn, moved the bottleneck down to the array volume queue depth limit. The FlashArray doesn’t have a volume queue depth limit, so now that bottleneck has been moved back to ESXi and its internal queues.

Altering VMware queue limits is not generally needed with the exception of extraordinarily intense workloads. For high-performance configuration, refer to the section of this document on ESXi queue configuration.

VMFS Version Recommendations

Pure Storage recommends using the latest supported version of VMFS that is permitted by your ESXi host.

For ESXi 5.x through 6.0, use VMFS-5. For ESXi 6.5 and later it is highly recommended to use VMFS-6. It should be noted that VMFS-6 is not the default option for ESXi 6.5, so be careful to choose the correct version when creating new VMFS datastores in ESXi 6.5.

Example of VMFS-6  in ESXi 6.5:

vmfs-version.png

When upgrading to ESXi 6.5, there is no in-place upgrade path from VMFS-5 to VMFS-6. Therefore, it is recommended to create a new volume entirely, format it as VMFS-6, and then Storage vMotion all virtual machines from the old VMFS-5 datastore to the new VMFS-6 datastore. Once the migration is completed you can then delete and remove the VMFS-5 datastore from the ESXi host and FlashArray.

BEST PRACTICE: Use the latest supported VMFS version for the in-use ESXi host

Datastore Performance Management

ESXi and vCenter offer a variety of features to control the performance capabilities of a given datastore. This section will overview FlashArray support and recommendations for these features.

Queue Depth Limits

ESXi offers the ability to configure queue depth limits for devices on an HBA or iSCSI initiator. This dictates how many I/Os can be outstanding to a given device before I/Os start queuing in the ESXi kernel. If the queue depth limit is set too low, IOPS and throughput can be limited and latency can increase due to queuing. If too high, virtual machine I/O fairness can be affected and high-volume workloads can affect other workloads from other virtual machines or other hosts. The device queue depth limit is set on the initiator and the value (and setting name) varies depending on the model and type:

Type

Default Value

Value Name

QLogic

64

qlfxmaxqdepth

Brocade

32

bfa_lun_queue_depth

Emulex

32

lpfc0_lun_queue_depth

Cisco UCS

32

fnic_max_qdepth

Software iSCSI

128

iscsivmk_LunQDepth

Changing these settings require a host reboot. For instructions to check and set these values, please refer to this VMware KB article:

Changing the queue depth for QLogic, Emulex, and Brocade HBAs

Disk Schedule Number Requests Outstanding (DSNRO)

T here is a second per-device setting called “Disk Schedule Number Requests Outstanding” often referred to as DSNRO. This is a hypervisor-level queue depth limit that provides a mechanism for managing the queue depth limit for an individual device. This value is a per-device setting that defaults to 32 and can be increased to a value of 256.

It should be noted that this value only comes into play for a volume when that volume is being accessed by two or more virtual machines on that host. If there is more than one virtual machine active on it, the lowest of the two values (DSNRO or the HBA device queue depth limit) is the value that is observed by ESXi as the actual device queue depth limit. So, in other words, if a volume has two VMs on it, and DSNRO is set to 32 and the HBA device queue depth limit is set to 64, the actual queue depth limit for that device is 32. For more information on DSNRO see the VMware KB here:

Setting the Maximum Outstanding Disk Requests for virtual machines

In general, Pure Storage does not recommend changing these values. The majority of workloads are distributed across hosts and/or not intense enough to overwhelm the default queue depths. The FlashArray is fast enough (low enough latency) that the workload has to be quite high in order to overwhelm the queue.

If the default queue depth is consistently overwhelmed, the simplest option is to provision a new datastore and distribute some virtual machines to the new datastore. If a workload from a single virtual machine is too great for the default queue depth, then increasing the queue depth limit is the better option.

If a workload demands queue depths to be increased, Pure Storage recommends making both the HBA device queue depth limit and DSNRO equal.

Do not change these values without direction from VMware or Pure Storage support as this can have performance repercussions.

You can verify the values of both of these for a given device with the command:

esxcli storage core device list –d <naa.xxxxx>
Device Max Queue Depth: 96
No of outstanding IOs with competing worlds: 64

BEST PRACTICE: Do not modify queue depth limits, leave them at their default. Only raise them when performance requirements dictate it and Pure Storage Support or VMware Support provide appropriate guidance.

Dynamic Queue Throttling

ESXi supports the ability to dynamically throttle a device queue depth limit when an array volume has been overwhelmed. An array volume is overwhelmed when the array responds to an I/O request with a sense code of QUEUE FULL or BUSY. When a certain number of these are received, ESXi will reduce the queue depth limit for that device and slowly increase it as conditions improve.

This is controlled via two settings:

  • Disk.QFullSampleSize —The count of QUEUE FULL or BUSY conditions it takes before ESXi will start throttling. Default is zero (feature disabled)
  • Disk.QFullThreshold —The count of good condition responses after a QUEUE FULL or BUSY required before ESXi starts increasing the queue depth limit again

The Pure Storage FlashArray does not advertise a QUEUE FULL condition for a volume. Since every volume can use the full performance and queue of the FlashArray, this limit is unrealistically high and this sense code will likely never be issued. Therefore, there is no reason to set or alter these values for Pure Storage FlashArray volumes because QUEUE FULL should rarely (or never) occur.

Storage I/O Control

VMware vCenter offers a feature called Storage I/O Control (SIOC) that will throttle selected virtual machines when a certain average latency has been reached or when a certain percentage of peak throughput has been hit on a given datastore. ESXi throttles virtual machines by artificially reducing the number of slots that are available to it in the device queue depth limit.

Pure Storage fully supports enabling this technology on datastores residing on the FlashArray. That being said, it may not be particularly useful for a few reasons.

First, the minimum latency that can be configured for SIOC before it will begin throttling a virtual machine is 5 ms.

SIOC-5ms.png

When a latency threshold is entered, vCenter will aggregate a weighted average of all disk latencies seen by all hosts that see that particular datastore. This number does not include host-side queuing, it is only the time it takes for the I/O to be sent from the SAN to the array and acknowledged back.

Furthermore, SIOC uses a random-read injector to identify the capabilities of a datastore from a performance perspective. At a high-level, it runs a quick series of tests with increasing numbers of outstanding I/Os to identify the throughput maximums via high latency identification. This allows ESXi to determine what the peak throughput is, for when the “Percentage of peak throughput” is chosen.

Knowing these factors, we can make these points about SIOC and the FlashArray:

  1. SIOC is not going to be particularly helpful if there is host-side queuing since it does not take host-induced latency into account. This (the ESXi device queue) is generally where most of the latency is introduced in a FlashArray environment.
  2. The FlashArray will rarely have sustained latency above 1 ms, thus reducing the likely hood that this threshold will be reached for any meaningful amount of time on a FlashArray volume.
  3. A single FlashArray volume does not have a queue limit, so it can handle quite a high number of outstanding I/O and throughput (especially reads), therefore SIOC and its random-read injector cannot identify FlashArray limits in meaningful ways.

In short, SIOC is fully supported by Pure Storage, but Pure Storage makes no specific recommendations for configuration.

Storage DRS

VMware vCenter also offers a feature called Storage Dynamic Resource Scheduler (Storage DRS / SDRS). SDRS moves virtual machines from one datastore to another when a certain average latency threshold has been reach on the datastore or when a certain used capacity has been reached. For this section, let’s focus on the performance-based moves.

Storage DRS, like Storage IO Control, waits for a certain latency threshold to be reached before it acts. And, also like SIOC, the minimum is 5 ms.

drs.png

While it is too high in general to be useful for FlashArray induced latency, SDRS differs from SIOC in the latency it actually looks at. SDRS uses the “VMObservedLatency” (referred to a GAVG in esxtop) averages from the hosts accessing the datastore. Therefore, this latency includes time spent queuing in the ESXi kernel. So, theoretically, a high-IOPS workload, with a low configured device queue depth limit, an I/O could conceivably spend 5 ms or more queuing in the kernel. In this situation Storage DRS will suggest moving a virtual machine to a datastore which does not have an overwhelmed queue.

That being said, this is still an unlikely scenario because:

  1. The FlashArray empties out the queue fast enough that a workload must be quite intense to fill up an ESXi queue so much that is spends 5 ms or more in it. Usually, with a workload like that, the queuing is higher up the stack. (in the virtual machine)
  2. Storage DRS samples for 16 hours before it makes a recommendation, so typically you will get one recommendation set per-day for a datastore. So this workload must be consistently and extremely high, for a long time, before SDRS acts.

In short, SDRS is fully supported by Pure Storage, but Pure Storage makes no specific recommendations for performance based move configuration.

Queue Depth Limits

ESXi offers the ability to configure queue depth limits for devices on an HBA or iSCSI initiator. This dictates how many I/Os can be outstanding to a given device before I/Os start queuing in the ESXi kernel. If the queue depth limit is set too low, IOPS and throughput can be limited and latency can increase due to queuing. If too high, virtual machine I/O fairness can be affected and high-volume workloads can affect other workloads from other virtual machines or other hosts. The device queue depth limit is set on the initiator and the value (and setting name) varies depending on the model and type:

Type

Default Value

Value Name

QLogic

64

qlfxmaxqdepth

Brocade

32

bfa_lun_queue_depth

Emulex

32

lpfc0_lun_queue_depth

Cisco UCS

32

fnic_max_qdepth

Software iSCSI

128

iscsivmk_LunQDepth

Changing these settings require a host reboot. For instructions to check and set these values, please refer to this VMware KB article:

Changing the queue depth for QLogic, Emulex, and Brocade HBAs

Disk Schedule Number Requests Outstanding (DSNRO)

T here is a second per-device setting called “Disk Schedule Number Requests Outstanding” often referred to as DSNRO. This is a hypervisor-level queue depth limit that provides a mechanism for managing the queue depth limit for an individual device. This value is a per-device setting that defaults to 32 and can be increased to a value of 256.

It should be noted that this value only comes into play for a volume when that volume is being accessed by two or more virtual machines on that host. If there is more than one virtual machine active on it, the lowest of the two values (DSNRO or the HBA device queue depth limit) is the value that is observed by ESXi as the actual device queue depth limit. So, in other words, if a volume has two VMs on it, and DSNRO is set to 32 and the HBA device queue depth limit is set to 64, the actual queue depth limit for that device is 32. For more information on DSNRO see the VMware KB here:

Setting the Maximum Outstanding Disk Requests for virtual machines

In general, Pure Storage does not recommend changing these values. The majority of workloads are distributed across hosts and/or not intense enough to overwhelm the default queue depths. The FlashArray is fast enough (low enough latency) that the workload has to be quite high in order to overwhelm the queue.

If the default queue depth is consistently overwhelmed, the simplest option is to provision a new datastore and distribute some virtual machines to the new datastore. If a workload from a single virtual machine is too great for the default queue depth, then increasing the queue depth limit is the better option.

If a workload demands queue depths to be increased, Pure Storage recommends making both the HBA device queue depth limit and DSNRO equal.

Do not change these values without direction from VMware or Pure Storage support as this can have performance repercussions.

You can verify the values of both of these for a given device with the command:

esxcli storage core device list –d <naa.xxxxx>
Device Max Queue Depth: 96
No of outstanding IOs with competing worlds: 64

BEST PRACTICE: Do not modify queue depth limits, leave them at their default. Only raise them when performance requirements dictate it and Pure Storage Support or VMware Support provide appropriate guidance.

Dynamic Queue Throttling

ESXi supports the ability to dynamically throttle a device queue depth limit when an array volume has been overwhelmed. An array volume is overwhelmed when the array responds to an I/O request with a sense code of QUEUE FULL or BUSY. When a certain number of these are received, ESXi will reduce the queue depth limit for that device and slowly increase it as conditions improve.

This is controlled via two settings:

  • Disk.QFullSampleSize —The count of QUEUE FULL or BUSY conditions it takes before ESXi will start throttling. Default is zero (feature disabled)
  • Disk.QFullThreshold —The count of good condition responses after a QUEUE FULL or BUSY required before ESXi starts increasing the queue depth limit again

The Pure Storage FlashArray does not advertise a QUEUE FULL condition for a volume. Since every volume can use the full performance and queue of the FlashArray, this limit is unrealistically high and this sense code will likely never be issued. Therefore, there is no reason to set or alter these values for Pure Storage FlashArray volumes because QUEUE FULL should rarely (or never) occur.

Storage I/O Control

VMware vCenter offers a feature called Storage I/O Control (SIOC) that will throttle selected virtual machines when a certain average latency has been reached or when a certain percentage of peak throughput has been hit on a given datastore. ESXi throttles virtual machines by artificially reducing the number of slots that are available to it in the device queue depth limit.

Pure Storage fully supports enabling this technology on datastores residing on the FlashArray. That being said, it may not be particularly useful for a few reasons.

First, the minimum latency that can be configured for SIOC before it will begin throttling a virtual machine is 5 ms.

SIOC-5ms.png

When a latency threshold is entered, vCenter will aggregate a weighted average of all disk latencies seen by all hosts that see that particular datastore. This number does not include host-side queuing, it is only the time it takes for the I/O to be sent from the SAN to the array and acknowledged back.

Furthermore, SIOC uses a random-read injector to identify the capabilities of a datastore from a performance perspective. At a high-level, it runs a quick series of tests with increasing numbers of outstanding I/Os to identify the throughput maximums via high latency identification. This allows ESXi to determine what the peak throughput is, for when the “Percentage of peak throughput” is chosen.

Knowing these factors, we can make these points about SIOC and the FlashArray:

  1. SIOC is not going to be particularly helpful if there is host-side queuing since it does not take host-induced latency into account. This (the ESXi device queue) is generally where most of the latency is introduced in a FlashArray environment.
  2. The FlashArray will rarely have sustained latency above 1 ms, thus reducing the likely hood that this threshold will be reached for any meaningful amount of time on a FlashArray volume.
  3. A single FlashArray volume does not have a queue limit, so it can handle quite a high number of outstanding I/O and throughput (especially reads), therefore SIOC and its random-read injector cannot identify FlashArray limits in meaningful ways.

In short, SIOC is fully supported by Pure Storage, but Pure Storage makes no specific recommendations for configuration.

Storage DRS

VMware vCenter also offers a feature called Storage Dynamic Resource Scheduler (Storage DRS / SDRS). SDRS moves virtual machines from one datastore to another when a certain average latency threshold has been reach on the datastore or when a certain used capacity has been reached. For this section, let’s focus on the performance-based moves.

Storage DRS, like Storage IO Control, waits for a certain latency threshold to be reached before it acts. And, also like SIOC, the minimum is 5 ms.

drs.png

While it is too high in general to be useful for FlashArray induced latency, SDRS differs from SIOC in the latency it actually looks at. SDRS uses the “VMObservedLatency” (referred to a GAVG in esxtop) averages from the hosts accessing the datastore. Therefore, this latency includes time spent queuing in the ESXi kernel. So, theoretically, a high-IOPS workload, with a low configured device queue depth limit, an I/O could conceivably spend 5 ms or more queuing in the kernel. In this situation Storage DRS will suggest moving a virtual machine to a datastore which does not have an overwhelmed queue.

That being said, this is still an unlikely scenario because:

  1. The FlashArray empties out the queue fast enough that a workload must be quite intense to fill up an ESXi queue so much that is spends 5 ms or more in it. Usually, with a workload like that, the queuing is higher up the stack. (in the virtual machine)
  2. Storage DRS samples for 16 hours before it makes a recommendation, so typically you will get one recommendation set per-day for a datastore. So this workload must be consistently and extremely high, for a long time, before SDRS acts.

In short, SDRS is fully supported by Pure Storage, but Pure Storage makes no specific recommendations for performance based move configuration.

Datastore Capacity Management

Managing the capacity usage of your VMFS datastores is an important part of regular care in your virtual infrastructure. There are a variety of mechanisms inside of ESXi and vCenter to monitor capacity. Frequently, the concept of data reduction on the FlashArray is seen as a complicating factor, when in reality it is a simplifying factor, or at worse, a non-issue.

Let’s overview some concepts on how to best manage VMFS datastores from a capacity perspective.

VMFS Usage vs. FlashArray Volume Capacity

VMFS reports how much is currently allocated in the filesystem on that volume. Depending on the type of virtual disk (thin or thick), dictates how much is consumed upon creation of the virtual machine (or virtual disk specifically). Thin disks only allocate what the guest has actually written to, and therefore VMFS only records what the virtual machine has written in its space usage. Thick type virtual disks allocate the full virtual disk immediately, so VMFS records much more space as being used than is actually used by the virtual machines.

This is one of the reasons thin virtual disks are preferred—you get better insight into how much space the guests are actually using.

Regardless of what type you choose, ESXi is going to take the sum total of the allocated space of your virtual disks and compare that to the total capacity of the filesystem of the volume. The used space is the sum of those virtual disks allocations. This number increases as virtual disks grow or new ones are added, and can decrease as old ones are deleted, moved, or even shrunk.

Compare this to what the FlashArray reports for capacity. What the FlashArray reports for volume usage is NOT the amount used for that volume. What the FlashArray reports is the unique footprint of the volume on that array.

In the example below we can see that we are using a 5 TB FlashArray volume and VMFS datastorea. The example confirms that the VMFS datastore reports a total of 720.72 GB of used space on the 5TB filesystem. This tell us that there is a combined total of 720.72 GB of allocated virtual disks on this filesystem:

storage-capacity.png

Now let’s look at the FlashArray volume.

FA-capacity.png

The FlashArray volume shows that 50.33 GB is being used. Does this mean that VMFS is incorrect? No. VMFS is always the source of truth. The “Volumes” metric on the FlashArray simply represents the amount of physical capacity that has been written to the volume after data reduction that no other volume shares.

This metric can change at any time as the data set changes on that volume or any other volume on the FlashArray. If, for instance, some other host writes 2 GB to another volume (let’s call it “volume2”), and that 2 GB happens to be identical to 2 GB of that 50.33 GB GB on “sn1-m20-e05-28-prod-ds”, then  “sn1-m20-e05-28-prod-ds” would no longer have 50.33 GB of unique space. It would drop down to 48.33 GB, even though nothing changed on “sn1-m20-e05-28-prod-ds” itself. Instead, another application just happened to write similar data, making the footprint of “sn1-m20-e05-28-prod-ds” less unique.

For a more detailed conversation around this, refer to this blog post:

http://www.codyhosterman.com/2017/01/vmfs-capacity-monitoring-in-a-data-reducing-world/

Why doesn’t VMFS report the same used capacity as the FlashArray for the underlying volume?

Well, because they mean different things. VMware reports what is allocated on the VMFS and the FlashArray reports what is unique to the underlying volume. The FlashArray value can change constantly. The FlashArray metric is only meant to show how reducible the data on that volume is internal to the volume and against the entire array. Conversely, VMFS capacity usage is based solely on how much capacity is allocated to it by virtual machines. The FlashArray volume space metric, on the other hand, actually relates to what is also being used on other volumes. In other words, VMFS usage is only affected by data on the VMFS volume itself. The FlashArray volume space metric is affected by the data on all of the volumes. So the two values should not be conflated.

For capacity tracking, you should refer to the VMFS usage. How do we best track VMFS usage? What do we do when it is full?

Monitoring and Managing VMFS Capacity Usage

As virtual machines grow and as new ones are added, the VMFS volume they sit on will slowly fill up. How to respond and to manage this is a common question.

In general, using a product like vRealize Operations Manager (vROps) with the FlashArray Management Pack is a great option here. But for the purposes of this document we will focus on what can be done inside of vCenter alone.

You need to decide on a few things:

  • At what percentage full of my VMFS volume do I become concerned?
  • When that happens what should I do?
  • What capacity value should I monitor on the FlashArray?

The first question is the easiest to answer. Choose either a percentage full or at a certain capacity free. Do you want to do something when, for example, a VMFS volume hits 75% full or when there is less than 50 GB free? Choose what makes sense to you.

vCenter alerts are a great way to monitor VMFS capacity automatically. There is a default alert for datastore capacity, but it does not do anything other than tag the datastore object with the alarm state. Pure Storage recommends creating an additional alarm for capacity that executes some type of additional action when the alarm is triggered.

Configuring a script to run, an email to be issued, or a notification trap to be sent greatly diminishes the chance of a datastore running out of space unnoticed. capacity-w-alarm.png

severity-warning.png

severity-critical.png

capacity-w-alarm-completed.png

BEST PRACTICE: Configure capacity alerts to send a message or initiate an action.

The next step is to decide what happens when a capacity warning occurs.

There are a few options:

  1. Increase the capacity of the volume
  2. Move virtual machines off of the volume
  3. Add a new volume

Your solution may be one of these options or a mix of all three. Let’s quickly walk through the options.

Option 1: Increase the capacity of the volume

This is the simplest option. If capacity has crossed the threshold you have specified, increase the volume capacity to clear the threshold.

The process is:

1. Increase the FlashArray volume capacity.

resize-1.png

2. Rescan the hosts that use the datastore.

rescan-storage.png

3. Increase the VMFS to use the new capacity.

increase-ds-capacity.png

increase-ds-capacity-2.png

4. Choose “Use ‘Free space xxx GB/TB’ to expand the datastore” . increase-ds-capacity-3.png

There should be a note that the datastore already occupies space on this volume. If this note does not appear, you have selected the wrong device to expand. Pure Storage highly recommends that you do not create VMFS datastores that span multiple volumes—a VMFS should have a one to one relationship to a FlashArray volume.

5. This will clear the alarm and add additional capacity.

Option 2: Move virtual machine off of the volume

Another option is to move one or more virtual machines from a more-full datastore to a less-full datastore. While this can be manually achieved through case-by-case Storage vMotion, Pure Storage recommends leveraging Storage DRS to automate this. Storage DRS provides, in addition to the performance-based moves discussed earlier in this document, the ability to automatically Storage vMotion VMs based on capacity usage of VMFS datastores. If a datastore reaches a certain percent full, SDRS can automatically move, or make recommendations for, virtual machines to be moved to balance out space usage across volumes.

1. SDRS is enabled on a datastore cluster.

sdrs-create.png

sdrs-1.png

2. When a datastore cluster is created you can enable SDRS and choose capacity threshold settings, which can either be a percentage or a capacity amount.

sdrs-2.png

Pure Storage has no specific recommendations for these values and can be decided upon based on your own environment. Pure Storage does have a few recommendations for datastore cluster configuration in general:

  • Only include datastores on the same FlashArray in a given datastore cluster. This will allow Storage vMotion to use the VAAI XCOPY offload to accelerate the migration process of virtual machines and greatly reduce the footprint of the migration workload.
  • Include datastores with similar configurations in a datastore cluster. For example, if a datastore is replicated on the FlashArray, only include datastores that are replicated in the same FlashArray protection group so that a SDRS migration does not violate required protection for a virtual machine.

Option 3: Create a new VMFS volume

The last option is to create an entirely new VMFS volume. You might decide to do this for a few reasons:

  • The current VMFS volumes have maxed out possible capacity (64 TB each).
  • The current VMFS volumes have overloaded the queue depth inside of every ESXi server using it. Therefore, they can be grown in capacity, but cannot provide any more performance due to ESXi limits.

In this situation follow the standard VMFS provisioning steps for a new datastore. Once the creation of volumes and hosts/host groups and the volume connection is complete, the volumes will be accessible to the ESXi host(s). Using the vSphere Web Client, initiate a “Rescan Storage…” to make the newly-connected Pure Storage volume(s) fully-visible to the ESXi servers in the cluster. One can then use the “Add Storage” wizard to format the newly added volume.

Shrinking a Volume

While it is possible to shrink a FlashArray volume non-disruptively, vSphere does not have the ability to shrink a VMFS partition. Therefore, do not shrink FlashArray volumes that contain VMFS datastores as doing so could incur data loss .

If you have mistakenly increased the size of a datastore, or a larger datastore is simply no longer required, the right steps to take would be creating a new datastore at the required size and then migrating the VMs from the old datstore to the new. Once the migration has been completed you can destroy the old datastore and remove the volume from the FlashArray.

Mounting a Snapshot Volume

The Pure Storage FlashArray provides the ability to take local or remote point-in-time snapshots of volumes which can then be used for backup/restore and/or test/dev. When a snapshot is taken of a volume containing VMFS, there are a few additional steps from both the FlashArray and vSphere sides to be able to access the snapshot point-in-time data.

When a FlashArray snapshot is taken, a new volume is not created—essentially it is a metadata point-in-time reference to data blocks on the array that reflect that moment’s version of the data. This snapshot is immutable and cannot be directly mounted. Instead, the metadata of a snapshot has to be “copied” to an actual volume which then allows the point-in-time, which was preserved by the snapshot metadata, to be presented to a host. This behavior allows the snapshot to be re-used again and again without changing the data in that snapshot. If a snapshot is not needed more than one time an alternative option is to create a direct snap copy from one volume to another—merging the snapshot creation step with the association step.

When a volume hosting a VMFS datastore is copied via array-based snapshots, the copied VMFS datastore is now on a volume that has a different serial number than the original source volume. Therefore, the VMFS will be reported as having an invalid signature since the VMFS datastore signature is a hash partially based on the serial of the hosting device. Consequently, the device will not be automatically mounted upon rescan—instead the new datastore wizard needs to be run to find the device and resignature the VMFS datastore. Pure Storage recommends resignaturing copied volumes rather than mounting them with an existing signatures (referred to as force mounting).

BEST PRACTICE: "Assign a new signature" to copied VMFS volumes and do not force mount them.

new-ds-snapshot.png
For additional details on resignaturing and snapshot management, please refer to the following blog posts:

  • Mounting an unresolved VMFS
  • Why not force mount?
  • Why might a VMFS resignature operation fail?
  • How to correlate a VMFS and a FlashArray volume
  • How to snapshot a VMFS on the FlashArray
  • Restoring a single VM from a FlashArray snapshot

VMFS Usage vs. FlashArray Volume Capacity

VMFS reports how much is currently allocated in the filesystem on that volume. Depending on the type of virtual disk (thin or thick), dictates how much is consumed upon creation of the virtual machine (or virtual disk specifically). Thin disks only allocate what the guest has actually written to, and therefore VMFS only records what the virtual machine has written in its space usage. Thick type virtual disks allocate the full virtual disk immediately, so VMFS records much more space as being used than is actually used by the virtual machines.

This is one of the reasons thin virtual disks are preferred—you get better insight into how much space the guests are actually using.

Regardless of what type you choose, ESXi is going to take the sum total of the allocated space of your virtual disks and compare that to the total capacity of the filesystem of the volume. The used space is the sum of those virtual disks allocations. This number increases as virtual disks grow or new ones are added, and can decrease as old ones are deleted, moved, or even shrunk.

Compare this to what the FlashArray reports for capacity. What the FlashArray reports for volume usage is NOT the amount used for that volume. What the FlashArray reports is the unique footprint of the volume on that array.

In the example below we can see that we are using a 5 TB FlashArray volume and VMFS datastorea. The example confirms that the VMFS datastore reports a total of 720.72 GB of used space on the 5TB filesystem. This tell us that there is a combined total of 720.72 GB of allocated virtual disks on this filesystem:

storage-capacity.png

Now let’s look at the FlashArray volume.

FA-capacity.png

The FlashArray volume shows that 50.33 GB is being used. Does this mean that VMFS is incorrect? No. VMFS is always the source of truth. The “Volumes” metric on the FlashArray simply represents the amount of physical capacity that has been written to the volume after data reduction that no other volume shares.

This metric can change at any time as the data set changes on that volume or any other volume on the FlashArray. If, for instance, some other host writes 2 GB to another volume (let’s call it “volume2”), and that 2 GB happens to be identical to 2 GB of that 50.33 GB GB on “sn1-m20-e05-28-prod-ds”, then  “sn1-m20-e05-28-prod-ds” would no longer have 50.33 GB of unique space. It would drop down to 48.33 GB, even though nothing changed on “sn1-m20-e05-28-prod-ds” itself. Instead, another application just happened to write similar data, making the footprint of “sn1-m20-e05-28-prod-ds” less unique.

For a more detailed conversation around this, refer to this blog post:

http://www.codyhosterman.com/2017/01/vmfs-capacity-monitoring-in-a-data-reducing-world/

Why doesn’t VMFS report the same used capacity as the FlashArray for the underlying volume?

Well, because they mean different things. VMware reports what is allocated on the VMFS and the FlashArray reports what is unique to the underlying volume. The FlashArray value can change constantly. The FlashArray metric is only meant to show how reducible the data on that volume is internal to the volume and against the entire array. Conversely, VMFS capacity usage is based solely on how much capacity is allocated to it by virtual machines. The FlashArray volume space metric, on the other hand, actually relates to what is also being used on other volumes. In other words, VMFS usage is only affected by data on the VMFS volume itself. The FlashArray volume space metric is affected by the data on all of the volumes. So the two values should not be conflated.

For capacity tracking, you should refer to the VMFS usage. How do we best track VMFS usage? What do we do when it is full?

Monitoring and Managing VMFS Capacity Usage

As virtual machines grow and as new ones are added, the VMFS volume they sit on will slowly fill up. How to respond and to manage this is a common question.

In general, using a product like vRealize Operations Manager (vROps) with the FlashArray Management Pack is a great option here. But for the purposes of this document we will focus on what can be done inside of vCenter alone.

You need to decide on a few things:

  • At what percentage full of my VMFS volume do I become concerned?
  • When that happens what should I do?
  • What capacity value should I monitor on the FlashArray?

The first question is the easiest to answer. Choose either a percentage full or at a certain capacity free. Do you want to do something when, for example, a VMFS volume hits 75% full or when there is less than 50 GB free? Choose what makes sense to you.

vCenter alerts are a great way to monitor VMFS capacity automatically. There is a default alert for datastore capacity, but it does not do anything other than tag the datastore object with the alarm state. Pure Storage recommends creating an additional alarm for capacity that executes some type of additional action when the alarm is triggered.

Configuring a script to run, an email to be issued, or a notification trap to be sent greatly diminishes the chance of a datastore running out of space unnoticed. capacity-w-alarm.png

severity-warning.png

severity-critical.png

capacity-w-alarm-completed.png

BEST PRACTICE: Configure capacity alerts to send a message or initiate an action.

The next step is to decide what happens when a capacity warning occurs.

There are a few options:

  1. Increase the capacity of the volume
  2. Move virtual machines off of the volume
  3. Add a new volume

Your solution may be one of these options or a mix of all three. Let’s quickly walk through the options.

Option 1: Increase the capacity of the volume

This is the simplest option. If capacity has crossed the threshold you have specified, increase the volume capacity to clear the threshold.

The process is:

1. Increase the FlashArray volume capacity.

resize-1.png

2. Rescan the hosts that use the datastore.

rescan-storage.png

3. Increase the VMFS to use the new capacity.

increase-ds-capacity.png

increase-ds-capacity-2.png

4. Choose “Use ‘Free space xxx GB/TB’ to expand the datastore” . increase-ds-capacity-3.png

There should be a note that the datastore already occupies space on this volume. If this note does not appear, you have selected the wrong device to expand. Pure Storage highly recommends that you do not create VMFS datastores that span multiple volumes—a VMFS should have a one to one relationship to a FlashArray volume.

5. This will clear the alarm and add additional capacity.

Option 2: Move virtual machine off of the volume

Another option is to move one or more virtual machines from a more-full datastore to a less-full datastore. While this can be manually achieved through case-by-case Storage vMotion, Pure Storage recommends leveraging Storage DRS to automate this. Storage DRS provides, in addition to the performance-based moves discussed earlier in this document, the ability to automatically Storage vMotion VMs based on capacity usage of VMFS datastores. If a datastore reaches a certain percent full, SDRS can automatically move, or make recommendations for, virtual machines to be moved to balance out space usage across volumes.

1. SDRS is enabled on a datastore cluster.

sdrs-create.png

sdrs-1.png

2. When a datastore cluster is created you can enable SDRS and choose capacity threshold settings, which can either be a percentage or a capacity amount.

sdrs-2.png

Pure Storage has no specific recommendations for these values and can be decided upon based on your own environment. Pure Storage does have a few recommendations for datastore cluster configuration in general:

  • Only include datastores on the same FlashArray in a given datastore cluster. This will allow Storage vMotion to use the VAAI XCOPY offload to accelerate the migration process of virtual machines and greatly reduce the footprint of the migration workload.
  • Include datastores with similar configurations in a datastore cluster. For example, if a datastore is replicated on the FlashArray, only include datastores that are replicated in the same FlashArray protection group so that a SDRS migration does not violate required protection for a virtual machine.

Option 3: Create a new VMFS volume

The last option is to create an entirely new VMFS volume. You might decide to do this for a few reasons:

  • The current VMFS volumes have maxed out possible capacity (64 TB each).
  • The current VMFS volumes have overloaded the queue depth inside of every ESXi server using it. Therefore, they can be grown in capacity, but cannot provide any more performance due to ESXi limits.

In this situation follow the standard VMFS provisioning steps for a new datastore. Once the creation of volumes and hosts/host groups and the volume connection is complete, the volumes will be accessible to the ESXi host(s). Using the vSphere Web Client, initiate a “Rescan Storage…” to make the newly-connected Pure Storage volume(s) fully-visible to the ESXi servers in the cluster. One can then use the “Add Storage” wizard to format the newly added volume.

Shrinking a Volume

While it is possible to shrink a FlashArray volume non-disruptively, vSphere does not have the ability to shrink a VMFS partition. Therefore, do not shrink FlashArray volumes that contain VMFS datastores as doing so could incur data loss .

If you have mistakenly increased the size of a datastore, or a larger datastore is simply no longer required, the right steps to take would be creating a new datastore at the required size and then migrating the VMs from the old datstore to the new. Once the migration has been completed you can destroy the old datastore and remove the volume from the FlashArray.

Mounting a Snapshot Volume

The Pure Storage FlashArray provides the ability to take local or remote point-in-time snapshots of volumes which can then be used for backup/restore and/or test/dev. When a snapshot is taken of a volume containing VMFS, there are a few additional steps from both the FlashArray and vSphere sides to be able to access the snapshot point-in-time data.

When a FlashArray snapshot is taken, a new volume is not created—essentially it is a metadata point-in-time reference to data blocks on the array that reflect that moment’s version of the data. This snapshot is immutable and cannot be directly mounted. Instead, the metadata of a snapshot has to be “copied” to an actual volume which then allows the point-in-time, which was preserved by the snapshot metadata, to be presented to a host. This behavior allows the snapshot to be re-used again and again without changing the data in that snapshot. If a snapshot is not needed more than one time an alternative option is to create a direct snap copy from one volume to another—merging the snapshot creation step with the association step.

When a volume hosting a VMFS datastore is copied via array-based snapshots, the copied VMFS datastore is now on a volume that has a different serial number than the original source volume. Therefore, the VMFS will be reported as having an invalid signature since the VMFS datastore signature is a hash partially based on the serial of the hosting device. Consequently, the device will not be automatically mounted upon rescan—instead the new datastore wizard needs to be run to find the device and resignature the VMFS datastore. Pure Storage recommends resignaturing copied volumes rather than mounting them with an existing signatures (referred to as force mounting).

BEST PRACTICE: "Assign a new signature" to copied VMFS volumes and do not force mount them.

new-ds-snapshot.png
For additional details on resignaturing and snapshot management, please refer to the following blog posts:

  • Mounting an unresolved VMFS
  • Why not force mount?
  • Why might a VMFS resignature operation fail?
  • How to correlate a VMFS and a FlashArray volume
  • How to snapshot a VMFS on the FlashArray
  • Restoring a single VM from a FlashArray snapshot

Mounting a Snapshot Volume

The Pure Storage FlashArray provides the ability to take local or remote point-in-time snapshots of volumes which can then be used for backup/restore and/or test/dev. When a snapshot is taken of a volume containing VMFS, there are a few additional steps from both the FlashArray and vSphere sides to be able to access the snapshot point-in-time data.

When a FlashArray snapshot is taken, a new volume is not created—essentially it is a metadata point-in-time reference to data blocks on the array that reflect that moment’s version of the data. This snapshot is immutable and cannot be directly mounted. Instead, the metadata of a snapshot has to be “copied” to an actual volume which then allows the point-in-time, which was preserved by the snapshot metadata, to be presented to a host. This behavior allows the snapshot to be re-used again and again without changing the data in that snapshot. If a snapshot is not needed more than one time an alternative option is to create a direct snap copy from one volume to another—merging the snapshot creation step with the association step.

When a volume hosting a VMFS datastore is copied via array-based snapshots, the copied VMFS datastore is now on a volume that has a different serial number than the original source volume. Therefore, the VMFS will be reported as having an invalid signature since the VMFS datastore signature is a hash partially based on the serial of the hosting device. Consequently, the device will not be automatically mounted upon rescan—instead the new datastore wizard needs to be run to find the device and resignature the VMFS datastore. Pure Storage recommends resignaturing copied volumes rather than mounting them with an existing signatures (referred to as force mounting).

BEST PRACTICE: "Assign a new signature" to copied VMFS volumes and do not force mount them.

Deleting a Datastore

Prior to the deletion of a volume, ensure that all important data has been moved off or is no longer needed. From the vSphere Web Client (or CLI) delete or unmount the VMFS volume and then detach the underlying device from the appropriate host(s).

After a volume has been detached from the ESXi host(s) it must first be disconnected (from the FlashArray perspective) from the host within the Purity GUI before it can be destroyed (deleted) on the FlashArray.

BEST PRACTICE: Unmount and detach FlashArray volumes from all ESXi hosts before destroying them on the array.

  1. Unmount the VMFS datastore on every host that it is mounted to.

unmount-ds-1.png

2. Detach the volume that hosted the datastore from every ESXi host that sees the volume.

unmount-ds-2.png

3. Disconnect the volume from the hosts or host groups on the FlashArray.

fa-vol-disconnect.png

4. Destroy the volume on FlashArray.

fa-vol-destroy.png

By default a volume can be recovered after deletion for 24 hours to protect against accidental removal. Therefore, we recommend allowing the FlashArray to eradicate the volume automatically in 24 hours in case the volume is needed for recovery efforts.

(See below on how to recover a volume)

fa-vol-recover.png

This entire removal and deletion process is automated through the Pure Storage Plugin for the vSphere Web Client and its use is therefore recommended.

Read article

FlashArray Configuration

Host and Host Group Creation

This section describes the recommendations for creating provisioning objects (called hosts and host groups) on the FlashArray. The purpose is to outline the proper configuration for general understanding.

The FlashArray has two object types for volume provisioning, hosts and host groups:

  • Host —a host is a collection of initiators (Fibre Channel WWPNs, iSCSI IQNs or NVMe NQNs) that refers to a physical host. A FlashArray host object must have a one to one relationship with an ESXi host. Every active initiator for a given ESXi host should be added to the respective FlashArray host object. If an initiator is not yet zoned (for instance), and not intended to be, it can be omitted from the FlashArray host object. Furthermore, while the FlashArray supports multiple protocols for a single host (a mixture of FC, iSCSI & NVMe), ESXi does not support presenting VMFS storage via more than one protocol. So creating a multi-protocol host object should be avoided on the FlashArray when in use with VMware ESXi.

    In the example below, the ESXi host has two online Fibre Channel HBAs with WWPNs of 21:00:00:0e:1e:1e:7b:e0 and 21:00:00:0e:1e:1e:7b:e1

fa-host-ports.png

  • Host Group —a host group is a collection of host objects. Pure Storage recommends grouping your ESXi hosts into clusters within vCenter—as this provides a variety of benefits like High Availability and Dynamic Resource Scheduling. In order to provide simple provisioning, Pure Storage also recommends creating host groups that correspond to VMware clusters. Therefore, with every VMware cluster that will use FlashArray storage, a respective host group should be created. Every ESXi host that is in the cluster should have a corresponding host (as described above) that is added to a host group. The host group and its respective cluster should have the same number of hosts. It is recommended to not have more or less hosts in the host group as is in the cluster. While it is supported to have an unmatching count, it makes cluster-based provisioning simpler, and a variety of orchestration integrations require these to match. So it is highly recommended to do so.
    fa-host-group.png

BEST PRACTICE: Match FlashArray hosts groups with vCenter clusters.

Be Aware that moving a host out of a host group will disconnect the host from any volume that is connected to the host group.  Doing so will cause a Permanent Device Loss (PDL) scenario to any datastores that are using the volumes connected to that Host Group.

Setting the FlashArray “ESXi” Host Personality

For FlashArrays running 5.3.6 or earlier, DO NOT make this change online. If an ESXi host is running VMs on the array you are setting the host personality on, data unavailability can occur. A fabric logout and login may occur and accidental PDL can occur. To avoid this possibility, only set this personality on hosts that are in maintenance mode or are not actively using that array. If the FlashArray is running 5.3.7 or later the ESXi host personality can be set online.

In Purity 5.1 and later, there is a new host personality type for VMware ESXi hosts. Changing a host personality on a host object on the FlashArray causes the array to change some of its behavior for specific host types.

In general, we endeavor inside of Purity to automatically behave in the correct way without specific configuration changes. Due to a variety of host types supported and varying requirements (a good example is SCSI interaction for features like ActiveCluster & ActiveDR) a manual configuration was required.

In Purity 5.1, it is recommended to enable the “ESXi” host personality for all host objects that represent ESXi hosts.

The ESXi personality does the following things as of Purity 5.1.0:

  • Makes the FlashArray issue a Permanent Device Loss SCSI sense response to ESXi when a pod goes offline due to a mediator loss. If this is not set, no response is sent and vSphere HA does not detect the failure properly and will not restart VMs running on the failed hosts.
  • ESXi uses peripheral LUN IDs instead of flat LUN IDs—this changes how ESXi views any LUN ID on the FlashArray above 255. Since ESXi does not properly interpret flat LUN IDs, it sees LUN ID higher than 255 to be 16,383 higher than it should be (256 is seen as 16,639) which is outside of the supported range of ESXi. Setting the ESXi personality on the FlashArray for a given host switches the FlashArray LUN methodology to peripheral, allowing ESXi to see LUN IDs higher than 255.

While this personality change is currently only relevant for specific ActiveCluster environments and/or environments that want to use higher-than-255 LUN IDs, it is still recommended to set this on all ESXi host objects. Moving forward other behavior changes for ESXi might be included and doing it now ensures it is not missed when it might be important for your environment.

BEST PRACTICE: Set FlashArray host objects to have the FlashArray “ESXi” host personality when using Purity 5.1 or later. This change is REQUIRED for all environments using Purity 6.0+.

The ESXi host personality can be set through the FlashArray GUI, the CLI or REST. To set it via the GUI, click on Storage, then Hosts, then the host you would like to configure:

hostper1.png

Next, go to the Details pane and click the vertical ellipsis and choose Set Personality…:

hostper2.png

Choose the radio button corresponding to ESXi and then click Save.

hostper3.png

hostpers4.png

Connecting Volumes to Hosts

A FlashArray volume can be connected to either host objects or host groups. If a volume is intended to be shared by the entire cluster, it is recommended to connect the volume to the host group, not the individual hosts. The makes provisioning easier and helps ensure the entire ESXi cluster has access to the volume. Generally, volumes that are intended to host virtual machines, should be connected at the host group level.

Private volumes, like ESXi boot volumes, should not be connected to the host group as they should not be shared. These volumes should be connected to the host object instead.

Pure Storage has no requirement on LUN IDs for VMware ESXi environments, and users should, therefore, rely on the automatic LUN ID selection built into Purity.

Setting the FlashArray “ESXi” Host Personality

For FlashArrays running 5.3.6 or earlier, DO NOT make this change online. If an ESXi host is running VMs on the array you are setting the host personality on, data unavailability can occur. A fabric logout and login may occur and accidental PDL can occur. To avoid this possibility, only set this personality on hosts that are in maintenance mode or are not actively using that array. If the FlashArray is running 5.3.7 or later the ESXi host personality can be set online.

In Purity 5.1 and later, there is a new host personality type for VMware ESXi hosts. Changing a host personality on a host object on the FlashArray causes the array to change some of its behavior for specific host types.

In general, we endeavor inside of Purity to automatically behave in the correct way without specific configuration changes. Due to a variety of host types supported and varying requirements (a good example is SCSI interaction for features like ActiveCluster & ActiveDR) a manual configuration was required.

In Purity 5.1, it is recommended to enable the “ESXi” host personality for all host objects that represent ESXi hosts.

The ESXi personality does the following things as of Purity 5.1.0:

  • Makes the FlashArray issue a Permanent Device Loss SCSI sense response to ESXi when a pod goes offline due to a mediator loss. If this is not set, no response is sent and vSphere HA does not detect the failure properly and will not restart VMs running on the failed hosts.
  • ESXi uses peripheral LUN IDs instead of flat LUN IDs—this changes how ESXi views any LUN ID on the FlashArray above 255. Since ESXi does not properly interpret flat LUN IDs, it sees LUN ID higher than 255 to be 16,383 higher than it should be (256 is seen as 16,639) which is outside of the supported range of ESXi. Setting the ESXi personality on the FlashArray for a given host switches the FlashArray LUN methodology to peripheral, allowing ESXi to see LUN IDs higher than 255.

While this personality change is currently only relevant for specific ActiveCluster environments and/or environments that want to use higher-than-255 LUN IDs, it is still recommended to set this on all ESXi host objects. Moving forward other behavior changes for ESXi might be included and doing it now ensures it is not missed when it might be important for your environment.

BEST PRACTICE: Set FlashArray host objects to have the FlashArray “ESXi” host personality when using Purity 5.1 or later. This change is REQUIRED for all environments using Purity 6.0+.

The ESXi host personality can be set through the FlashArray GUI, the CLI or REST. To set it via the GUI, click on Storage, then Hosts, then the host you would like to configure:

hostper1.png

Next, go to the Details pane and click the vertical ellipsis and choose Set Personality…:

hostper2.png

Choose the radio button corresponding to ESXi and then click Save.

hostper3.png

hostpers4.png

Connecting Volumes to Hosts

A FlashArray volume can be connected to either host objects or host groups. If a volume is intended to be shared by the entire cluster, it is recommended to connect the volume to the host group, not the individual hosts. The makes provisioning easier and helps ensure the entire ESXi cluster has access to the volume. Generally, volumes that are intended to host virtual machines, should be connected at the host group level.

Private volumes, like ESXi boot volumes, should not be connected to the host group as they should not be shared. These volumes should be connected to the host object instead.

Pure Storage has no requirement on LUN IDs for VMware ESXi environments, and users should, therefore, rely on the automatic LUN ID selection built into Purity.

Read article

ESXi Host Configuration

VMware Native Multipathing Plugin (NMP) Configuration

VMware offers a Native Multipathing Plugin (NMP) layer in vSphere through Storage Array Type Plugins (SATP) and Path Selection Policies (PSP) as part of the VMware APIs for Pluggable Storage Architecture (PSA). The SATP has all the knowledge of the storage array to aggregate I/Os across multiple channels and has the intelligence to send failover commands when a path has failed. The Path Selection Policy can be either “Fixed”, “Most Recently Used” or “Round Robin”.

Round Robin Path Selection Policy

To best leverage the active-active nature of the front end of the FlashArray, Pure Storage requires that you configure FlashArray volumes to use the Round Robin Path Selection Policy. The Round Robin PSP rotates between all discovered paths for a given volume which allows ESXi (and therefore the virtual machines running on the volume) to maximize the possible performance by using all available resources (HBAs, target ports, etc.).

BEST PRACTICE: Use the Round Robin Path Selection Policy for FlashArray volumes.

The I/O Operations Limit

The Round Robin Path Selection Policy allows for additional tuning of its path-switching behavior in the form of a setting called the I/O Operations Limit. The I/O Operations Limit (sometimes called the “IOPS” value) dictates how often ESXi switches logical paths for a given device. By default, when Round Robin is enabled on a device, ESXi will switch to a new logical path every 1,000 I/Os. In other words, ESXi will choose a logical path, and start issuing all I/Os for that device down that path. Once it has issued 1,000 I/Os for that device, down that path, it will switch to a new logical path and so on.

Pure Storage recommends tuning this value down to the minimum of 1. This will cause ESXi to change logical paths after every single I/O, instead of 1,000.

This recommendation is made for a few reasons:

  1. Performance. Often the reason cited to change this value is performance. While this is true in certain cases, the performance impact of changing this value is not usually profound (generally in the single digits of a percentage performance increase). While changing this value from 1,000 to 1 can improve performance, it generally will not solve a major performance problem. Regardless, changing this value can improve performance in some use cases, especially with iSCSI.
  2. Path Failover Time. It has been noted in testing that ESXi will fail logical paths much more quickly when this value is set to a the minimum of 1. During a physical failure of the storage environment (loss of a HBA, switch, cable, port, controller) ESXi, after a certain period of time, will fail any logical path that relies on that failed physical hardware and will discontinue attempting to use it for a given volume. This failure does not always happen immediately. When the I/O Operations Limit is set to the default of 1,000 path failover time can sometimes be in the 10s of seconds which can lead to noticeable disruption in performance during this failure. When this value is set to the minimum of 1, path failover generally decreases to sub-ten seconds. This greatly reduces the impact of a physical failure in the storage environment and provides greater performance resiliency and reliability.
  3. FlashArray Controller I/O Balance. When Purity is upgraded on a FlashArray, the following process is observed (at a high level): upgrade Purity on one controller, reboot it, wait for it to come back up, upgrade Purity on the other controller, reboot it and you’re done. Due to the reboots, twice during the process half of the FlashArray front-end ports go away. Because of this, we want to ensure that all hosts are actively using both controllers prior to upgrade. One method that is used to confirm this is to check the I/O balance from each host across both controllers. When volumes are configured to use Most Recently Used, an imbalance of 100% is usually observed (ESXi tends to select paths that lead to the same front end port for all devices). This then means additional troubleshooting to make sure that host can survive a controller reboot. When Round Robin is enabled with the default I/O Operations Limit, port imbalance is improved to about 20-30% difference. When the I/O Operations Limit is set to 1, this imbalance is less than 1%. This gives Pure Storage and the end user confidence that all hosts are properly using all available front-end ports.

For these three above reasons, Pure Storage highly recommends altering the I/O Operations Limit to 1. For additional information you can read the VMware KB regarding setting the IOPs Limit.

BEST PRACTICE: Change the Round Robin I/O Operations Limit from 1,000 to 1 for FlashArray volumes on vSphere. This is a default configuration in all supported vSphere releases.

To fully utilize CPU resources, set the host's active power policy to high performance.

ESXi Express Patch 5 or 6.5 Update 1 and later

Starting with ESXi 6.0 Express Patch 5 (build 5572656) and later (Release notes) and ESXi 6.5 Update 1 (build 5969303) and later (release notes), Round Robin and an I/O Operations limit is the default configuration for all Pure Storage FlashArray devices (iSCSI and Fibre Channel) and no configuration is required.

A new default SATP rule, provided by VMware by default was specifically built for the FlashArray to Pure Storage’s best practices. Inside of ESXi you will see a new system rule:

Name                 Device  Vendor    Model             Driver  Transport  Options                     Rule Group  Claim Options                        Default PSP  PSP Options     Description
-------------------  ------  --------  ----------------  ------  ---------  --------------------------  ----------  -----------------------------------  -----------  --------------  --------------------------------------------------------------------------
VMW_SATP_ALUA                PURE      FlashArray                                                       system                                           VMW_PSP_RR   iops=1

For information, refer to this blog post:

https://www.codyhosterman.com/2017/0...e-now-default/

Configuring Round Robin and the I/O Operations Limit

If you are running earlier than ESXi 6.0 Express Patch 5 or 6.5 Update 1, there are a variety of ways to configure Round Robin and the I/O Operations Limit. This can be set on a per-device basis and as every new volume is added, these options can be set against that volume. This is not a particularly good option as one must do this for every new volume, which can make it easy to forget, and must do it on every host for every volume. This makes the chance of exposure to mistakes quite large.

The recommended option for configuring Round Robin and the correct I/O Operations Limit is to create a rule that will cause any new FlashArray device that is added in the future to that host to automatically get the Round Robin PSP and an I/O Operation Limit value of 1.

The following command creates a rule that achieves both of these for only Pure Storage FlashArray devices:

esxcli storage nmp satp rule add -s "VMW_SATP_ALUA" -V "PURE" -M "FlashArray" -P "VMW_PSP_RR" -O "iops=1" -e "FlashArray SATP Rule"

This must be repeated for each ESXi host.

This can also be accomplished through PowerCLI. Once connected to a vCenter Server this script will iterate through all of the hosts in that particular vCenter and create a default rule to set Round Robin for all Pure Storage FlashArray devices with an I/O Operation Limit set to 1.

Connect-VIServer -Server <vCenter> -Credential (Get-Credential)
Get-VMhost | Get-EsxCli –V2 | % {$_.storage.nmp.satp.rule.add.Invoke(@{description='Pure Storage FlashArray SATP';model='FlashArray';vendor='PURE';satp='VMW_SATP_ALUA';psp='VMW_PSP_RR'; pspoption='iops=1'})}

Furthermore, this can be configured using vSphere Host Profiles:

host-profile.png

It is important to note that existing, previously presented devices will need to be manually set to Round Robin and an I/O Operation Limit of 1. Optionally, the ESXi host can be rebooted so that it can inherit the multipathing configuration set forth by the new rule.

For setting a new I/O Operation Limit on an existing device, see Appendix I: Per-Device NMP Configuration.

Note that I/O Operations of 1 is the default in 6.0 Patch 5 and later in the 6.0 code branch, 6.5 Update 1 and later in the 6.5 code branch, and all versions of 6.7 and later.

Enhanced Round Robin Load Balancing (Latency Based PSP)

With the release of vSphere 6.7 U1, there is now a sub-policy option for Round Robin that actively monitors individual path performance. This new sub-policy is called "Enhanced Round Robin Load Balancing" (also known as Latency Based Path Selection Policy (PSP)). Before this policy became available the ESXi host would utilize all active paths by sending I/O requests down each path in a "fire and forget" type of fashion, sending 1 I/O down each path before moving to the next. Often times this resulted in performance penalties when individual paths became degraded and weren't functioning as optimally as other available paths. This performance penalty was invoked because the ESXi host would continue using the non-optimal path due to limited insight into the overall path health. This now changes with the Latency Based PSP by monitoring each path for latency, along with outstanding I/Os, allowing the  ESXi host to make smarter decisions on which paths to use and which to exclude in a more dynamic manner.

How it Works

Like all other Native Multipathing Plugin (NMP) policies this sub-policy is set on a per LUN or per datastore basis. Once enabled the NMP begins by assessing the first 16 user I/O requests per path and calculates their average latency. Once all of the paths have been successfully analyzed the NMP will then calculate the average latency of each path and use this information to determine which paths are healthy (optimal) and which are unhealthy (non-optimal). If a path falls outside of the average latency it is deemed non-optimal and will not be used until latency has reached an optimal response time once more.

After the initial assessment, the ESXi host then repeats the same process outlined above every 3 minutes. It will test every active path, including any non-optimal paths, to confirm if the latency has improved, worsened, or remained the same.  Once again those results will be analyzed and used to determine which paths should continue sending I/O requests and which should be paused to see if they report better health in the next 3 minutes. Throughout this process the NMP is also taking into account any outstanding I/Os for each path to make more informed decisions.

Configuring Round Robin and the Latency Based Sub-Policy

If you are using ESXi 7.0 or later then no changes are required to enable this new sub-policy as it is the new recommendation moving forward. In an effort to make things easier for end-users a new SATP rule has been added that will automatically apply this rule to any Pure Storage LUNs presented to the ESXi host:

Name                 Device  Vendor    Model             Driver  Transport  Options                     Rule Group  Claim Options                        Default PSP  PSP Options     Description
VMW_SATP_ALUA                PURE      FlashArray                                                       system                                           VMW_PSP_RR   policy=latency

If your environment is using ESXi 6.7U1 or later and you wish to utilize this feature, which Pure Storage supports, then the best way is to create a SATP rule on each ESXi host, which can be done as follows:

esxcli storage nmp satp rule add -s "VMW_SATP_ALUA" -V "PURE" -M "FlashArray" -P "VMW_PSP_RR" -O "policy=latency" -e "FlashArray SATP Rule"

Alternatively, this can be done using PowerShell:

Connect-VIServer -Server <vCenter> -Credential (Get-Credential)
Get-VMhost | Get-EsxCli –V2 | % {$_.storage.nmp.satp.rule.add.Invoke(@{description='Pure Storage FlashArray SATP';model='FlashArray';vendor='PURE';satp='VMW_SATP_ALUA';psp='VMW_PSP_RR'; pspoption='policy=latency'})}

Setting a new SATP rule will only change the policy for newly presented LUNs, it does not get applied to LUNs that were present before the rule was set until the host is rebooted.

Lastly, if you would like to change an individual LUN  (or set of LUNs) you can run the following command to change the PSP to latency (where device is specific to your env):

esxcli storage nmp psp roundrobin deviceconfig set --type=latency --device=naa.624a93708a75393becad4e43000540e8

Tuning

By default the RR latency policy is configured to send 16 user I/O requests down each path and evaluate each path every three minutes (180000ms). Based on extensive testing, Pure Storage's recommendation is to leave these options configured to their defaults and no changes are required.

BEST PRACTICE: Enhanced Round Robin Load Balancing is configured by default on ESXi 7.0 and later. No configuration changes are required.

Verifying Connectivity

It is important to verify proper connectivity prior to implementing production workloads on a host or volume.

This consists of a few steps:

  1. Verifying proper multipathing settings in ESXi.
  2. Verifying the proper numbers of paths.
  3. Verifying I/O balance and redundancy on the FlashArray.

The Path Selection Policy and number of paths can be verified easily inside of the vSphere Web Client.

verify-psp.png

This will report the path selection policy and the number of logical paths. The number of logical paths will depend on the number of HBAs, zoning and the number of ports cabled on the FlashArray.

The I/O Operations Limit cannot be checked from the vSphere Web Client—it can only be verified or altered via command line utilities. The following command can check a particular device for the PSP and I/O Operations Limit:

esxcli storage nmp device list -d naa.<device NAA>

Picture5.png

Please remember that each of these settings is a per-host setting, so while a volume might be configured properly on one host, it may not be correct on another.

Additionally, it is also possible to check multipathing from the FlashArray.

A CLI command exists to monitor I/O balance coming into the array:

purehost monitor --balance --interval <how long to sample> --repeat <how many iterations>

The command will report a few things:

  1. The host name.
  2. The individual initiators from the host. If they are logged into more than one FlashArray port, it will be reported more than once. If an initiator is not logged in at all, it will not appear.
  3. The port that the initiator is logged into.
  4. The number of I/Os that came into that port from that initiator over the time period sampled.
  5. The relative percentage of I/Os for that initiator as compared to the maximum.

The balance command will count the I/Os that came down from a particular initiator during the sampled time period, and it will do that for all initiator/target relationships for that host. Whichever relationship/path has the most I/Os will be designated as 100%. The rest of the paths will be then denoted as a percentage of that number. So if a host has two paths, and the first path has 1,000 I/Os and the second path has 800, the first path will be 100% and the second will be 80%.

A well balanced host should be within a few percentage points of each path. Anything more than 15% or so might be worthy of investigation. Refer to this post for more information.

Please keep in mind that if the Latency Based PSP is in use that IO may not be 1 to 1 for all paths to the Array from the ESXi hosts.

There is nothing inherently wrong with the IO not being balanced 1 to 1 for all paths as the Latency Bases PSP will be distributing IO based on which path has the lowest latency.  With that said, a few percentage points difference shouldn't be cause for alarm, however if there are paths with very little to no IO being sent down them this should be something investigated in the SAN to find out why that path is performing poorly.

The GUI will also report on host connectivity in general, based on initiator logins.

2018-01-29_12-02-17.png

This report should be listed as redundant for all hosts, meaning that it is connected to each controller. If this reports something else, investigate zoning and/or host configuration to correct this.

For a detailed explanation of the various reported states, please refer to the FlashArray User Guide which can be found directly in your GUI:

2018-01-26_16-25-39.png

Round Robin Path Selection Policy

To best leverage the active-active nature of the front end of the FlashArray, Pure Storage requires that you configure FlashArray volumes to use the Round Robin Path Selection Policy. The Round Robin PSP rotates between all discovered paths for a given volume which allows ESXi (and therefore the virtual machines running on the volume) to maximize the possible performance by using all available resources (HBAs, target ports, etc.).

BEST PRACTICE: Use the Round Robin Path Selection Policy for FlashArray volumes.

The I/O Operations Limit

The Round Robin Path Selection Policy allows for additional tuning of its path-switching behavior in the form of a setting called the I/O Operations Limit. The I/O Operations Limit (sometimes called the “IOPS” value) dictates how often ESXi switches logical paths for a given device. By default, when Round Robin is enabled on a device, ESXi will switch to a new logical path every 1,000 I/Os. In other words, ESXi will choose a logical path, and start issuing all I/Os for that device down that path. Once it has issued 1,000 I/Os for that device, down that path, it will switch to a new logical path and so on.

Pure Storage recommends tuning this value down to the minimum of 1. This will cause ESXi to change logical paths after every single I/O, instead of 1,000.

This recommendation is made for a few reasons:

  1. Performance. Often the reason cited to change this value is performance. While this is true in certain cases, the performance impact of changing this value is not usually profound (generally in the single digits of a percentage performance increase). While changing this value from 1,000 to 1 can improve performance, it generally will not solve a major performance problem. Regardless, changing this value can improve performance in some use cases, especially with iSCSI.
  2. Path Failover Time. It has been noted in testing that ESXi will fail logical paths much more quickly when this value is set to a the minimum of 1. During a physical failure of the storage environment (loss of a HBA, switch, cable, port, controller) ESXi, after a certain period of time, will fail any logical path that relies on that failed physical hardware and will discontinue attempting to use it for a given volume. This failure does not always happen immediately. When the I/O Operations Limit is set to the default of 1,000 path failover time can sometimes be in the 10s of seconds which can lead to noticeable disruption in performance during this failure. When this value is set to the minimum of 1, path failover generally decreases to sub-ten seconds. This greatly reduces the impact of a physical failure in the storage environment and provides greater performance resiliency and reliability.
  3. FlashArray Controller I/O Balance. When Purity is upgraded on a FlashArray, the following process is observed (at a high level): upgrade Purity on one controller, reboot it, wait for it to come back up, upgrade Purity on the other controller, reboot it and you’re done. Due to the reboots, twice during the process half of the FlashArray front-end ports go away. Because of this, we want to ensure that all hosts are actively using both controllers prior to upgrade. One method that is used to confirm this is to check the I/O balance from each host across both controllers. When volumes are configured to use Most Recently Used, an imbalance of 100% is usually observed (ESXi tends to select paths that lead to the same front end port for all devices). This then means additional troubleshooting to make sure that host can survive a controller reboot. When Round Robin is enabled with the default I/O Operations Limit, port imbalance is improved to about 20-30% difference. When the I/O Operations Limit is set to 1, this imbalance is less than 1%. This gives Pure Storage and the end user confidence that all hosts are properly using all available front-end ports.

For these three above reasons, Pure Storage highly recommends altering the I/O Operations Limit to 1. For additional information you can read the VMware KB regarding setting the IOPs Limit.

BEST PRACTICE: Change the Round Robin I/O Operations Limit from 1,000 to 1 for FlashArray volumes on vSphere. This is a default configuration in all supported vSphere releases.

To fully utilize CPU resources, set the host's active power policy to high performance.

ESXi Express Patch 5 or 6.5 Update 1 and later

Starting with ESXi 6.0 Express Patch 5 (build 5572656) and later (Release notes) and ESXi 6.5 Update 1 (build 5969303) and later (release notes), Round Robin and an I/O Operations limit is the default configuration for all Pure Storage FlashArray devices (iSCSI and Fibre Channel) and no configuration is required.

A new default SATP rule, provided by VMware by default was specifically built for the FlashArray to Pure Storage’s best practices. Inside of ESXi you will see a new system rule:

Name                 Device  Vendor    Model             Driver  Transport  Options                     Rule Group  Claim Options                        Default PSP  PSP Options     Description
-------------------  ------  --------  ----------------  ------  ---------  --------------------------  ----------  -----------------------------------  -----------  --------------  --------------------------------------------------------------------------
VMW_SATP_ALUA                PURE      FlashArray                                                       system                                           VMW_PSP_RR   iops=1

For information, refer to this blog post:

https://www.codyhosterman.com/2017/0...e-now-default/

Configuring Round Robin and the I/O Operations Limit

If you are running earlier than ESXi 6.0 Express Patch 5 or 6.5 Update 1, there are a variety of ways to configure Round Robin and the I/O Operations Limit. This can be set on a per-device basis and as every new volume is added, these options can be set against that volume. This is not a particularly good option as one must do this for every new volume, which can make it easy to forget, and must do it on every host for every volume. This makes the chance of exposure to mistakes quite large.

The recommended option for configuring Round Robin and the correct I/O Operations Limit is to create a rule that will cause any new FlashArray device that is added in the future to that host to automatically get the Round Robin PSP and an I/O Operation Limit value of 1.

The following command creates a rule that achieves both of these for only Pure Storage FlashArray devices:

esxcli storage nmp satp rule add -s "VMW_SATP_ALUA" -V "PURE" -M "FlashArray" -P "VMW_PSP_RR" -O "iops=1" -e "FlashArray SATP Rule"

This must be repeated for each ESXi host.

This can also be accomplished through PowerCLI. Once connected to a vCenter Server this script will iterate through all of the hosts in that particular vCenter and create a default rule to set Round Robin for all Pure Storage FlashArray devices with an I/O Operation Limit set to 1.

Connect-VIServer -Server <vCenter> -Credential (Get-Credential)
Get-VMhost | Get-EsxCli –V2 | % {$_.storage.nmp.satp.rule.add.Invoke(@{description='Pure Storage FlashArray SATP';model='FlashArray';vendor='PURE';satp='VMW_SATP_ALUA';psp='VMW_PSP_RR'; pspoption='iops=1'})}

Furthermore, this can be configured using vSphere Host Profiles:

host-profile.png

It is important to note that existing, previously presented devices will need to be manually set to Round Robin and an I/O Operation Limit of 1. Optionally, the ESXi host can be rebooted so that it can inherit the multipathing configuration set forth by the new rule.

For setting a new I/O Operation Limit on an existing device, see Appendix I: Per-Device NMP Configuration.

Note that I/O Operations of 1 is the default in 6.0 Patch 5 and later in the 6.0 code branch, 6.5 Update 1 and later in the 6.5 code branch, and all versions of 6.7 and later.

The I/O Operations Limit

The Round Robin Path Selection Policy allows for additional tuning of its path-switching behavior in the form of a setting called the I/O Operations Limit. The I/O Operations Limit (sometimes called the “IOPS” value) dictates how often ESXi switches logical paths for a given device. By default, when Round Robin is enabled on a device, ESXi will switch to a new logical path every 1,000 I/Os. In other words, ESXi will choose a logical path, and start issuing all I/Os for that device down that path. Once it has issued 1,000 I/Os for that device, down that path, it will switch to a new logical path and so on.

Pure Storage recommends tuning this value down to the minimum of 1. This will cause ESXi to change logical paths after every single I/O, instead of 1,000.

This recommendation is made for a few reasons:

  1. Performance. Often the reason cited to change this value is performance. While this is true in certain cases, the performance impact of changing this value is not usually profound (generally in the single digits of a percentage performance increase). While changing this value from 1,000 to 1 can improve performance, it generally will not solve a major performance problem. Regardless, changing this value can improve performance in some use cases, especially with iSCSI.
  2. Path Failover Time. It has been noted in testing that ESXi will fail logical paths much more quickly when this value is set to a the minimum of 1. During a physical failure of the storage environment (loss of a HBA, switch, cable, port, controller) ESXi, after a certain period of time, will fail any logical path that relies on that failed physical hardware and will discontinue attempting to use it for a given volume. This failure does not always happen immediately. When the I/O Operations Limit is set to the default of 1,000 path failover time can sometimes be in the 10s of seconds which can lead to noticeable disruption in performance during this failure. When this value is set to the minimum of 1, path failover generally decreases to sub-ten seconds. This greatly reduces the impact of a physical failure in the storage environment and provides greater performance resiliency and reliability.
  3. FlashArray Controller I/O Balance. When Purity is upgraded on a FlashArray, the following process is observed (at a high level): upgrade Purity on one controller, reboot it, wait for it to come back up, upgrade Purity on the other controller, reboot it and you’re done. Due to the reboots, twice during the process half of the FlashArray front-end ports go away. Because of this, we want to ensure that all hosts are actively using both controllers prior to upgrade. One method that is used to confirm this is to check the I/O balance from each host across both controllers. When volumes are configured to use Most Recently Used, an imbalance of 100% is usually observed (ESXi tends to select paths that lead to the same front end port for all devices). This then means additional troubleshooting to make sure that host can survive a controller reboot. When Round Robin is enabled with the default I/O Operations Limit, port imbalance is improved to about 20-30% difference. When the I/O Operations Limit is set to 1, this imbalance is less than 1%. This gives Pure Storage and the end user confidence that all hosts are properly using all available front-end ports.

For these three above reasons, Pure Storage highly recommends altering the I/O Operations Limit to 1. For additional information you can read the VMware KB regarding setting the IOPs Limit.

BEST PRACTICE: Change the Round Robin I/O Operations Limit from 1,000 to 1 for FlashArray volumes on vSphere. This is a default configuration in all supported vSphere releases.

To fully utilize CPU resources, set the host's active power policy to high performance.

ESXi Express Patch 5 or 6.5 Update 1 and later

Starting with ESXi 6.0 Express Patch 5 (build 5572656) and later (Release notes) and ESXi 6.5 Update 1 (build 5969303) and later (release notes), Round Robin and an I/O Operations limit is the default configuration for all Pure Storage FlashArray devices (iSCSI and Fibre Channel) and no configuration is required.

A new default SATP rule, provided by VMware by default was specifically built for the FlashArray to Pure Storage’s best practices. Inside of ESXi you will see a new system rule:

Name                 Device  Vendor    Model             Driver  Transport  Options                     Rule Group  Claim Options                        Default PSP  PSP Options     Description
-------------------  ------  --------  ----------------  ------  ---------  --------------------------  ----------  -----------------------------------  -----------  --------------  --------------------------------------------------------------------------
VMW_SATP_ALUA                PURE      FlashArray                                                       system                                           VMW_PSP_RR   iops=1

For information, refer to this blog post:

https://www.codyhosterman.com/2017/0...e-now-default/

Configuring Round Robin and the I/O Operations Limit

If you are running earlier than ESXi 6.0 Express Patch 5 or 6.5 Update 1, there are a variety of ways to configure Round Robin and the I/O Operations Limit. This can be set on a per-device basis and as every new volume is added, these options can be set against that volume. This is not a particularly good option as one must do this for every new volume, which can make it easy to forget, and must do it on every host for every volume. This makes the chance of exposure to mistakes quite large.

The recommended option for configuring Round Robin and the correct I/O Operations Limit is to create a rule that will cause any new FlashArray device that is added in the future to that host to automatically get the Round Robin PSP and an I/O Operation Limit value of 1.

The following command creates a rule that achieves both of these for only Pure Storage FlashArray devices:

esxcli storage nmp satp rule add -s "VMW_SATP_ALUA" -V "PURE" -M "FlashArray" -P "VMW_PSP_RR" -O "iops=1" -e "FlashArray SATP Rule"

This must be repeated for each ESXi host.

This can also be accomplished through PowerCLI. Once connected to a vCenter Server this script will iterate through all of the hosts in that particular vCenter and create a default rule to set Round Robin for all Pure Storage FlashArray devices with an I/O Operation Limit set to 1.

Connect-VIServer -Server <vCenter> -Credential (Get-Credential)
Get-VMhost | Get-EsxCli –V2 | % {$_.storage.nmp.satp.rule.add.Invoke(@{description='Pure Storage FlashArray SATP';model='FlashArray';vendor='PURE';satp='VMW_SATP_ALUA';psp='VMW_PSP_RR'; pspoption='iops=1'})}

Furthermore, this can be configured using vSphere Host Profiles:

host-profile.png

It is important to note that existing, previously presented devices will need to be manually set to Round Robin and an I/O Operation Limit of 1. Optionally, the ESXi host can be rebooted so that it can inherit the multipathing configuration set forth by the new rule.

For setting a new I/O Operation Limit on an existing device, see Appendix I: Per-Device NMP Configuration.

Note that I/O Operations of 1 is the default in 6.0 Patch 5 and later in the 6.0 code branch, 6.5 Update 1 and later in the 6.5 code branch, and all versions of 6.7 and later.

Enhanced Round Robin Load Balancing (Latency Based PSP)

With the release of vSphere 6.7 U1, there is now a sub-policy option for Round Robin that actively monitors individual path performance. This new sub-policy is called "Enhanced Round Robin Load Balancing" (also known as Latency Based Path Selection Policy (PSP)). Before this policy became available the ESXi host would utilize all active paths by sending I/O requests down each path in a "fire and forget" type of fashion, sending 1 I/O down each path before moving to the next. Often times this resulted in performance penalties when individual paths became degraded and weren't functioning as optimally as other available paths. This performance penalty was invoked because the ESXi host would continue using the non-optimal path due to limited insight into the overall path health. This now changes with the Latency Based PSP by monitoring each path for latency, along with outstanding I/Os, allowing the  ESXi host to make smarter decisions on which paths to use and which to exclude in a more dynamic manner.

How it Works

Like all other Native Multipathing Plugin (NMP) policies this sub-policy is set on a per LUN or per datastore basis. Once enabled the NMP begins by assessing the first 16 user I/O requests per path and calculates their average latency. Once all of the paths have been successfully analyzed the NMP will then calculate the average latency of each path and use this information to determine which paths are healthy (optimal) and which are unhealthy (non-optimal). If a path falls outside of the average latency it is deemed non-optimal and will not be used until latency has reached an optimal response time once more.

After the initial assessment, the ESXi host then repeats the same process outlined above every 3 minutes. It will test every active path, including any non-optimal paths, to confirm if the latency has improved, worsened, or remained the same.  Once again those results will be analyzed and used to determine which paths should continue sending I/O requests and which should be paused to see if they report better health in the next 3 minutes. Throughout this process the NMP is also taking into account any outstanding I/Os for each path to make more informed decisions.

Configuring Round Robin and the Latency Based Sub-Policy

If you are using ESXi 7.0 or later then no changes are required to enable this new sub-policy as it is the new recommendation moving forward. In an effort to make things easier for end-users a new SATP rule has been added that will automatically apply this rule to any Pure Storage LUNs presented to the ESXi host:

Name                 Device  Vendor    Model             Driver  Transport  Options                     Rule Group  Claim Options                        Default PSP  PSP Options     Description
VMW_SATP_ALUA                PURE      FlashArray                                                       system                                           VMW_PSP_RR   policy=latency

If your environment is using ESXi 6.7U1 or later and you wish to utilize this feature, which Pure Storage supports, then the best way is to create a SATP rule on each ESXi host, which can be done as follows:

esxcli storage nmp satp rule add -s "VMW_SATP_ALUA" -V "PURE" -M "FlashArray" -P "VMW_PSP_RR" -O "policy=latency" -e "FlashArray SATP Rule"

Alternatively, this can be done using PowerShell:

Connect-VIServer -Server <vCenter> -Credential (Get-Credential)
Get-VMhost | Get-EsxCli –V2 | % {$_.storage.nmp.satp.rule.add.Invoke(@{description='Pure Storage FlashArray SATP';model='FlashArray';vendor='PURE';satp='VMW_SATP_ALUA';psp='VMW_PSP_RR'; pspoption='policy=latency'})}

Setting a new SATP rule will only change the policy for newly presented LUNs, it does not get applied to LUNs that were present before the rule was set until the host is rebooted.

Lastly, if you would like to change an individual LUN  (or set of LUNs) you can run the following command to change the PSP to latency (where device is specific to your env):

esxcli storage nmp psp roundrobin deviceconfig set --type=latency --device=naa.624a93708a75393becad4e43000540e8

Tuning

By default the RR latency policy is configured to send 16 user I/O requests down each path and evaluate each path every three minutes (180000ms). Based on extensive testing, Pure Storage's recommendation is to leave these options configured to their defaults and no changes are required.

BEST PRACTICE: Enhanced Round Robin Load Balancing is configured by default on ESXi 7.0 and later. No configuration changes are required.

How it Works

Like all other Native Multipathing Plugin (NMP) policies this sub-policy is set on a per LUN or per datastore basis. Once enabled the NMP begins by assessing the first 16 user I/O requests per path and calculates their average latency. Once all of the paths have been successfully analyzed the NMP will then calculate the average latency of each path and use this information to determine which paths are healthy (optimal) and which are unhealthy (non-optimal). If a path falls outside of the average latency it is deemed non-optimal and will not be used until latency has reached an optimal response time once more.

After the initial assessment, the ESXi host then repeats the same process outlined above every 3 minutes. It will test every active path, including any non-optimal paths, to confirm if the latency has improved, worsened, or remained the same.  Once again those results will be analyzed and used to determine which paths should continue sending I/O requests and which should be paused to see if they report better health in the next 3 minutes. Throughout this process the NMP is also taking into account any outstanding I/Os for each path to make more informed decisions.

Configuring Round Robin and the Latency Based Sub-Policy

If you are using ESXi 7.0 or later then no changes are required to enable this new sub-policy as it is the new recommendation moving forward. In an effort to make things easier for end-users a new SATP rule has been added that will automatically apply this rule to any Pure Storage LUNs presented to the ESXi host:

Name                 Device  Vendor    Model             Driver  Transport  Options                     Rule Group  Claim Options                        Default PSP  PSP Options     Description
VMW_SATP_ALUA                PURE      FlashArray                                                       system                                           VMW_PSP_RR   policy=latency

If your environment is using ESXi 6.7U1 or later and you wish to utilize this feature, which Pure Storage supports, then the best way is to create a SATP rule on each ESXi host, which can be done as follows:

esxcli storage nmp satp rule add -s "VMW_SATP_ALUA" -V "PURE" -M "FlashArray" -P "VMW_PSP_RR" -O "policy=latency" -e "FlashArray SATP Rule"

Alternatively, this can be done using PowerShell:

Connect-VIServer -Server <vCenter> -Credential (Get-Credential)
Get-VMhost | Get-EsxCli –V2 | % {$_.storage.nmp.satp.rule.add.Invoke(@{description='Pure Storage FlashArray SATP';model='FlashArray';vendor='PURE';satp='VMW_SATP_ALUA';psp='VMW_PSP_RR'; pspoption='policy=latency'})}

Setting a new SATP rule will only change the policy for newly presented LUNs, it does not get applied to LUNs that were present before the rule was set until the host is rebooted.

Lastly, if you would like to change an individual LUN  (or set of LUNs) you can run the following command to change the PSP to latency (where device is specific to your env):

esxcli storage nmp psp roundrobin deviceconfig set --type=latency --device=naa.624a93708a75393becad4e43000540e8

Tuning

By default the RR latency policy is configured to send 16 user I/O requests down each path and evaluate each path every three minutes (180000ms). Based on extensive testing, Pure Storage's recommendation is to leave these options configured to their defaults and no changes are required.

BEST PRACTICE: Enhanced Round Robin Load Balancing is configured by default on ESXi 7.0 and later. No configuration changes are required.

Verifying Connectivity

It is important to verify proper connectivity prior to implementing production workloads on a host or volume.

This consists of a few steps:

  1. Verifying proper multipathing settings in ESXi.
  2. Verifying the proper numbers of paths.
  3. Verifying I/O balance and redundancy on the FlashArray.

The Path Selection Policy and number of paths can be verified easily inside of the vSphere Web Client.

verify-psp.png

This will report the path selection policy and the number of logical paths. The number of logical paths will depend on the number of HBAs, zoning and the number of ports cabled on the FlashArray.

The I/O Operations Limit cannot be checked from the vSphere Web Client—it can only be verified or altered via command line utilities. The following command can check a particular device for the PSP and I/O Operations Limit:

esxcli storage nmp device list -d naa.<device NAA>

Picture5.png

Please remember that each of these settings is a per-host setting, so while a volume might be configured properly on one host, it may not be correct on another.

Additionally, it is also possible to check multipathing from the FlashArray.

A CLI command exists to monitor I/O balance coming into the array:

purehost monitor --balance --interval <how long to sample> --repeat <how many iterations>

The command will report a few things:

  1. The host name.
  2. The individual initiators from the host. If they are logged into more than one FlashArray port, it will be reported more than once. If an initiator is not logged in at all, it will not appear.
  3. The port that the initiator is logged into.
  4. The number of I/Os that came into that port from that initiator over the time period sampled.
  5. The relative percentage of I/Os for that initiator as compared to the maximum.

The balance command will count the I/Os that came down from a particular initiator during the sampled time period, and it will do that for all initiator/target relationships for that host. Whichever relationship/path has the most I/Os will be designated as 100%. The rest of the paths will be then denoted as a percentage of that number. So if a host has two paths, and the first path has 1,000 I/Os and the second path has 800, the first path will be 100% and the second will be 80%.

A well balanced host should be within a few percentage points of each path. Anything more than 15% or so might be worthy of investigation. Refer to this post for more information.

Please keep in mind that if the Latency Based PSP is in use that IO may not be 1 to 1 for all paths to the Array from the ESXi hosts.

There is nothing inherently wrong with the IO not being balanced 1 to 1 for all paths as the Latency Bases PSP will be distributing IO based on which path has the lowest latency.  With that said, a few percentage points difference shouldn't be cause for alarm, however if there are paths with very little to no IO being sent down them this should be something investigated in the SAN to find out why that path is performing poorly.

The GUI will also report on host connectivity in general, based on initiator logins.

2018-01-29_12-02-17.png

This report should be listed as redundant for all hosts, meaning that it is connected to each controller. If this reports something else, investigate zoning and/or host configuration to correct this.

For a detailed explanation of the various reported states, please refer to the FlashArray User Guide which can be found directly in your GUI:

2018-01-26_16-25-39.png

Disk.DiskMaxIOSize

The ESXi host setting, Disk.DiskMaxIOSize, controls the largest I/O size that ESXi will allow to be sent from ESXi to an underlying storage device. By default this is 32 MB. If an I/O is larger than the Disk.DiskMaxIOSize value, ESXi will split the I/O requests into segments under the configured limit.

If you are running an older release of ESXi (versions listed below) this setting needs to be modified if and only if you are on an old version and have an environment running the following scenarios:

  1. If a virtual machine is using EFI (Extensible Firmware Interface) instead of BIOS and is using VMware Hardware Version 12 or earlier.
  2. If your environment utilizes vSphere Replication.
  3. If your environment contains VMs which house applications that are sending READ or WRITE requests larger than 4 MB.
  4. The environment is using Fibre Channel with one of the above scenarios (this issue is not present with iSCSI).

VMware has resolved this issue in two places--fixing it in ESXi itself (ESXi now reads the maximum supported SCSI from the array and will only send I/Os of that size or smaller and split anything larger) and within VMware HW.

This is resolved in the following ESXi releases:

  • ESXi 6.0, Patch Release ESXi600-201909001
  • ESXi 6.5, Patch Release ESXi650-201811002
  • ESXi 6.7 Update 1 Release
  • ESXi 7.0 all releases

If you are not running one of these newer releases, it is necessary to reduce the ESXi parameter Disk.DiskMaxIOSize from the default of 32 MB (32,768 KB) down to 4 MB (4,096 KB) or less.

The above scenarios are only applicable if the VMs reside on a Pure Storage FlashArray. If you have VMs in your environment that are not on a Pure Storage FlashArray please consult with your vendor to verify if any changes are required.

If this is not configured for ESXi hosts running EFI-enabled VMs, the virtual machine will fail to properly boot. If it is not changed on hosts running VMs being replicated by vSphere Replication, replication will fail. If it is not changed for VMs whose applications are sending requests larger than 4MB, the larger I/O requests will fail which results in the application failing as well.

DiskMaxIOSize-2.png

This should be set on every ESXi host in the cluster that VMs may have access to, in order to ensure vMotion is successful from one ESXi host to another. If none of the above circumstances apply to your environment then this value can remain at the default. There is no known performance impact by changing this value.

For more detail on this change, please refer to the VMware KB article here:

https://kb.vmware.com/s/article/2137402

BEST PRACTICE: Upgrade ESXi to a release that adheres to the maximum supported SCSI size from the FlashArray.

VAAI Configuration

The VMware API for Array Integration (VAAI) primitives offer a way to offload and accelerate certain operations in a VMware environment.

Pure Storage requires that all VAAI features be enabled on every ESXi host that is using FlashArray storage. Disabling VAAI features can greatly reduce the efficiency and performance of FlashArray storage in ESXi environments.

All VAAI features are enabled by default (set to 1) in ESXi 5.x and later, so no action is typically required. Though these settings can be verified via the vSphere Web Client or CLI tools.

  1. WRITE SAME—DataMover.HardwareAcceleratedInit
  2. XCOPY—DataMover.HardwareAcceleratedMove
  3. ATOMIC TEST & SET— VMFSHardwareAcceleratedLocking

vaai.png

BEST PRACTICE: Keep VAAI enabled. DataMover.HardwareAcceleratedInit, DataMover.HardwareAcceleratedMove, and VMFS3.HardwareAcceleratedLocking

In order to provide a more efficient heart-beating mechanism for datastores VMware introduced a new host-wide setting called /VMFS3/UseATSForHBOnVMFS5. In VMware’s own words:

“A change in the VMFS heartbeat update method was introduced in ESXi 5.5 Update 2, to help optimize the VMFS heartbeat process. Whereas the legacy method involves plain SCSI reads and writes with the VMware ESXi kernel handling validation, the new method offloads the validation step to the storage system.“

Pure Storage recommends keeping this value on whenever possible. That being said, it is a host wide setting, and it can possibly affect storage arrays from other vendors negatively.

Read the VMware KB article here:

ESXi host loses connectivity to a VMFS3 and VMFS5 datastore

Pure Storage is NOT susceptible to this issue, but in the case of the presence of an affected array from another vendor, it might be necessary to turn this off. In this case, Pure Storage supports disabling this value and reverting to traditional heart-beating mechanisms.

ats-heartbeat.png

BEST PRACTICE: Keep VMFS3.UseATSForHBOnVMFS5 enabled—this is preferred. If another vendor is present and prefers it to be disabled, it is supported by Pure Storage to disable it .

For additional information please refer to VMware Storage APIs for Array Integration with the Pure Storage FlashArray document.

iSCSI Configuration

Just like any other array that supports iSCSI, Pure Storage recommends the following changes to an iSCSI-based vSphere environment for the best performance.

For a detailed walkthrough of setting up iSCSI on VMware ESXi and on the FlashArray please refer to the following VMware white paper. This is required reading for any VMware/iSCSI user:

https://core.vmware.com/resource/best-practices-running-vmware-vsphere-iscsi

Set Login Timeout to a Larger Value

For example, to set the Login Timeout value to 30 seconds, use commands similar to the following:

  1. Log in to the vSphere Web Client and select the host under Hosts and Clusters.
  2. Navigate to the Manage tab.
  3. Select the Storage option.
  4. Under Storage Adapters, select the iSCSI vmhba to be modified.
  5. Select Advanced and change the Login Timeout parameter. This can be done on the iSCSI adapter itself or on a specific target.

The default Login Timeout value is 5 seconds and the maximum value is 60 seconds.

BEST PRACTICE: Set iSCSI Login Timeout for FlashArray targets to 30 seconds. A higher value is supported but not necessary.

Disable DelayedAck

DelayedAck is an advanced iSCSI option that allows or disallows an iSCSI initiator to delay acknowledgment of received data packets.

Disabling DelayedAck:

  1. Log in to the vSphere Web Client and select the host under Hosts and Clusters.
  2. Navigate to the Configure tab.
  3. Select the Storage option.
  4. Under Storage Adapters, select the iSCSI vmhba to be modified.

Navigate to Advanced Options and modify the DelayedAck setting by using the option that best matches your requirements, as follows:

Option 1: Modify the DelayedAck setting on a particular discovery address (recommended) as follows:

  1. Select Targets.
  2. On a discovery address, select the Dynamic Discovery tab.
  3. Select the iSCSI server.
  4. Click Advanced.
  5. Change DelayedAck to false.

Option 2: Modify the DelayedAck setting on a specific target as follows:

  1. Select Targets.
  2. Select the Static Discovery tab.
  3. Select the iSCSI server and click Advanced.
  4. Change DelayedAck to false.

Option 3: Modify the DelayedAck setting globally for the iSCSI adapter as follows:

  1. Select the Advanced Options tab and click Advanced.
  2. Change DelayedAck to false.

DelayedAck is highly recommended to be disabled, but is not absolutely required by Pure Storage. In highly-congested networks, if packets are lost, or simply take too long to be acknowledged, due to that congestion, performance can drop. If DelayedAck is enabled, where not every packet is acknowledged at once (instead one acknowledgment is sent per so many packets) far more re-transmission can occur, further exacerbating congestion. This can lead to continually decreasing performance until congestion clears. Since DelayedAck can contribute to this it is recommended to disable it in order to greatly reduce the effect of congested networks and packet retransmission.

Enabling jumbo frames can further harm this since packets that are retransmitted are far larger. If jumbo frames are enabled, it is absolutely recommended to disable DelayedAck.

See the following VMware KB for more information:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002598

BEST PRACTICE: Disable DelayedAck for FlashArray iSCSI targets.

iSCSI Port Binding

For software iSCSI initiators, without additional configuration the default behavior for iSCSI pathing is for ESXi to leverage its routing tables to identify a path to its configured iSCSI targets. Without solid understanding of network configuration and routing behaviors, this can lead to unpredictable pathing and/or path unavailability in a hardware failure. To configure predictable and reliable path selection and failover it is necessary to configure iSCSI port binding (iSCSI multipathing).

Configuration and detailed discussion are out of the scope of this document, but it is recommended to read through the following VMware document that describes this and other concepts in-depth:

http://www.vmware.com/files/pdf/techpaper/vmware-multipathing-configuration-software-iSCSI-port-binding.pdf

BEST PRACTICE: Use Port Binding for ESXi software iSCSI adapters when possible.

Note that ESXi 6.5 has expanded support for port binding and features such as iSCSI routing (though the use of iSCSI routing is not usually recommended) and multiple subnets. Refer to ESXi 6.5 release notes for more information.

Jumbo Frames

In some iSCSI environments it is required to enable jumbo frames to adhere to the network configuration between the host and the FlashArray. Enabling jumbo frames is a cross-environment change so careful coordination is required to ensure proper configuration. It is important to work with your networking team and Pure Storage representatives when enabling jumbo frames. Please note that this is not a requirement for iSCSI use on the Pure Storage FlashArray—in general, Pure Storage recommends leaving MTU at the default setting.

That being said, altering the MTU is a fully supported and is up to the discretion of the user.

  1. Configure jumbo frames on the FlashArray iSCSI ports. 2018-01-29_10-00-52.png

Configure jumbo frames on the physical network switch/infrastructure for each port using the relevant switch CLI or GU I.

  1. Configure jumbo frames on the physical network switch/infrastructure for each port using the relevant switch CLI or GUI.
    1. Browse to a host in the vSphere Web Client navigator.
    2. Click the Configure tab and select Networking > Virtual Switches.
    3. Select the switch from the vSwitch list.
    4. Click the name of the VMkernel network adapter.
    5. Click the pencil icon to edit.
    6. Click NIC settings and set the MTU to your desired value.
    7. Click OK.
    8. Click the pencil icon to edit on the top to edit the vSwitch itself.
    9. Set the MTU to your desired value.
    10. Click OK.

Once jumbo frames are configured, verify end-to-end jumbo frame compatibility. To verify, try to ping an address on the storage network with vmkping.

vmkping -d -s 8972 <ip address of Pure Storage iSCSI port>

If the ping operations does not return successfully, then jumbo frames is not properly configured in ESXi, the networking devices, and/or the FlashArray port.

Challenge-Handshake Authentication Protocol (CHAP)

iSCSI CHAP is supported on the FlashArray for unidirectional or bidirectional authentication. Enabling CHAP is optional and up to the discretion of the user. Please refer to the following post for a detailed walkthrough:

http://www.codyhosterman.com/2015/03/configuring-iscsi-chap-in-vmware-with-the-flasharray/

2018-01-29_10-56-13.png

Please note that iSCSI CHAP is not currently supported with dynamic iSCSI targets on the FlashArray. If CHAP is going to be used, you MUST configure your iSCSI FlashArray targets as static targets.

iSCSI Failover Times

A common question encountered here at Pure Storage is why extended pauses in I/O are noted during specific operations or tests when utilizing the iSCSI protocol. Often times the underlying reasons for these pauses in I/O are a result of a network cable being disconnected, a misbehaving switch port, or a failover of the backend storage array; though this list is certainly not exhaustive.

When the default configuration for iSCSI is in use with VMware ESXi the delay for these events will generally be 25-35 seconds. While the majority of environments are able to successfully recover from these events unscathed this is not true for all environments. On a handful of occasions, there have been environments that contain applications that need faster recovery times. Without these faster recovery times, I/O failures have been noted and manual recovery efforts were required to bring the environment back online.

While Pure Storage's official best practice is to utilize default iSCSI configuration for failover times we also understand that not all environments are created equal. As such we do support modifying the necessary iSCSI advanced parameters to decrease failover times for sensitive applications.

Recovery times are controlled by the following 3 iSCSI advanced parameters:

Name                  Current     Default     Min  Max       Settable  Inherit
--------------------  ----------  ----------  ---  --------  --------  -------
NoopOutInterval       15          15          1    60            true    false
NoopOutTimeout        10          10          10   30            true     true
RecoveryTimeout       10          10          1    120           true     true

To better understand how these parameters are used in iSCSI recovery efforts it is recommended you read the following blog posts for deeper insight:

iSCSI: A 25-second pause in I/O during a single link loss? What gives?

iSCSI Advanced Settings

Once a thorough review of these iSCSI options have been investigated, additional testing within your own environment is strongly recommended to ensure no additional issues are introduced as a result of these changes.

Set Login Timeout to a Larger Value

For example, to set the Login Timeout value to 30 seconds, use commands similar to the following:

  1. Log in to the vSphere Web Client and select the host under Hosts and Clusters.
  2. Navigate to the Manage tab.
  3. Select the Storage option.
  4. Under Storage Adapters, select the iSCSI vmhba to be modified.
  5. Select Advanced and change the Login Timeout parameter. This can be done on the iSCSI adapter itself or on a specific target.

The default Login Timeout value is 5 seconds and the maximum value is 60 seconds.

BEST PRACTICE: Set iSCSI Login Timeout for FlashArray targets to 30 seconds. A higher value is supported but not necessary.

Disable DelayedAck

DelayedAck is an advanced iSCSI option that allows or disallows an iSCSI initiator to delay acknowledgment of received data packets.

Disabling DelayedAck:

  1. Log in to the vSphere Web Client and select the host under Hosts and Clusters.
  2. Navigate to the Configure tab.
  3. Select the Storage option.
  4. Under Storage Adapters, select the iSCSI vmhba to be modified.

Navigate to Advanced Options and modify the DelayedAck setting by using the option that best matches your requirements, as follows:

Option 1: Modify the DelayedAck setting on a particular discovery address (recommended) as follows:

  1. Select Targets.
  2. On a discovery address, select the Dynamic Discovery tab.
  3. Select the iSCSI server.
  4. Click Advanced.
  5. Change DelayedAck to false.

Option 2: Modify the DelayedAck setting on a specific target as follows:

  1. Select Targets.
  2. Select the Static Discovery tab.
  3. Select the iSCSI server and click Advanced.
  4. Change DelayedAck to false.

Option 3: Modify the DelayedAck setting globally for the iSCSI adapter as follows:

  1. Select the Advanced Options tab and click Advanced.
  2. Change DelayedAck to false.

DelayedAck is highly recommended to be disabled, but is not absolutely required by Pure Storage. In highly-congested networks, if packets are lost, or simply take too long to be acknowledged, due to that congestion, performance can drop. If DelayedAck is enabled, where not every packet is acknowledged at once (instead one acknowledgment is sent per so many packets) far more re-transmission can occur, further exacerbating congestion. This can lead to continually decreasing performance until congestion clears. Since DelayedAck can contribute to this it is recommended to disable it in order to greatly reduce the effect of congested networks and packet retransmission.

Enabling jumbo frames can further harm this since packets that are retransmitted are far larger. If jumbo frames are enabled, it is absolutely recommended to disable DelayedAck.

See the following VMware KB for more information:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002598

BEST PRACTICE: Disable DelayedAck for FlashArray iSCSI targets.

iSCSI Port Binding

For software iSCSI initiators, without additional configuration the default behavior for iSCSI pathing is for ESXi to leverage its routing tables to identify a path to its configured iSCSI targets. Without solid understanding of network configuration and routing behaviors, this can lead to unpredictable pathing and/or path unavailability in a hardware failure. To configure predictable and reliable path selection and failover it is necessary to configure iSCSI port binding (iSCSI multipathing).

Configuration and detailed discussion are out of the scope of this document, but it is recommended to read through the following VMware document that describes this and other concepts in-depth:

http://www.vmware.com/files/pdf/techpaper/vmware-multipathing-configuration-software-iSCSI-port-binding.pdf

BEST PRACTICE: Use Port Binding for ESXi software iSCSI adapters when possible.

Note that ESXi 6.5 has expanded support for port binding and features such as iSCSI routing (though the use of iSCSI routing is not usually recommended) and multiple subnets. Refer to ESXi 6.5 release notes for more information.

Jumbo Frames

In some iSCSI environments it is required to enable jumbo frames to adhere to the network configuration between the host and the FlashArray. Enabling jumbo frames is a cross-environment change so careful coordination is required to ensure proper configuration. It is important to work with your networking team and Pure Storage representatives when enabling jumbo frames. Please note that this is not a requirement for iSCSI use on the Pure Storage FlashArray—in general, Pure Storage recommends leaving MTU at the default setting.

That being said, altering the MTU is a fully supported and is up to the discretion of the user.

  1. Configure jumbo frames on the FlashArray iSCSI ports. 2018-01-29_10-00-52.png

Configure jumbo frames on the physical network switch/infrastructure for each port using the relevant switch CLI or GU I.

  1. Configure jumbo frames on the physical network switch/infrastructure for each port using the relevant switch CLI or GUI.
    1. Browse to a host in the vSphere Web Client navigator.
    2. Click the Configure tab and select Networking > Virtual Switches.
    3. Select the switch from the vSwitch list.
    4. Click the name of the VMkernel network adapter.
    5. Click the pencil icon to edit.
    6. Click NIC settings and set the MTU to your desired value.
    7. Click OK.
    8. Click the pencil icon to edit on the top to edit the vSwitch itself.
    9. Set the MTU to your desired value.
    10. Click OK.

Once jumbo frames are configured, verify end-to-end jumbo frame compatibility. To verify, try to ping an address on the storage network with vmkping.

vmkping -d -s 8972 <ip address of Pure Storage iSCSI port>

If the ping operations does not return successfully, then jumbo frames is not properly configured in ESXi, the networking devices, and/or the FlashArray port.

Challenge-Handshake Authentication Protocol (CHAP)

iSCSI CHAP is supported on the FlashArray for unidirectional or bidirectional authentication. Enabling CHAP is optional and up to the discretion of the user. Please refer to the following post for a detailed walkthrough:

http://www.codyhosterman.com/2015/03/configuring-iscsi-chap-in-vmware-with-the-flasharray/

2018-01-29_10-56-13.png

Please note that iSCSI CHAP is not currently supported with dynamic iSCSI targets on the FlashArray. If CHAP is going to be used, you MUST configure your iSCSI FlashArray targets as static targets.

iSCSI Failover Times

A common question encountered here at Pure Storage is why extended pauses in I/O are noted during specific operations or tests when utilizing the iSCSI protocol. Often times the underlying reasons for these pauses in I/O are a result of a network cable being disconnected, a misbehaving switch port, or a failover of the backend storage array; though this list is certainly not exhaustive.

When the default configuration for iSCSI is in use with VMware ESXi the delay for these events will generally be 25-35 seconds. While the majority of environments are able to successfully recover from these events unscathed this is not true for all environments. On a handful of occasions, there have been environments that contain applications that need faster recovery times. Without these faster recovery times, I/O failures have been noted and manual recovery efforts were required to bring the environment back online.

While Pure Storage's official best practice is to utilize default iSCSI configuration for failover times we also understand that not all environments are created equal. As such we do support modifying the necessary iSCSI advanced parameters to decrease failover times for sensitive applications.

Recovery times are controlled by the following 3 iSCSI advanced parameters:

Name                  Current     Default     Min  Max       Settable  Inherit
--------------------  ----------  ----------  ---  --------  --------  -------
NoopOutInterval       15          15          1    60            true    false
NoopOutTimeout        10          10          10   30            true     true
RecoveryTimeout       10          10          1    120           true     true

To better understand how these parameters are used in iSCSI recovery efforts it is recommended you read the following blog posts for deeper insight:

iSCSI: A 25-second pause in I/O during a single link loss? What gives?

iSCSI Advanced Settings

Once a thorough review of these iSCSI options have been investigated, additional testing within your own environment is strongly recommended to ensure no additional issues are introduced as a result of these changes.

Network Time Protocol (NTP)

No matter how perfect an environment is configured there will always come a time where troubleshooting an issue will be required. This is inevitable when dealing with large and complex environments. One way to help alleviate some of the stress that comes with troubleshooting is ensuring that the Network Time Protocol (NTP) is enabled on all components in the environment. NTP will ensure that the timestamps for servers, arrays, switches, etc are all aligned and in-sync. It is for this reason that Pure Storage recommends as a best practice that NTP be enabled and configured on all components.

Please refer to VMware KB Configuring Network Time Protocol (NTP) on an ESXi host using the vSphere Client for steps on how to configure NTP on your ESXi hosts.

Often times the VMware vCenter Server is configured to sync time with the ESXi host it resides on. If you do not use this option please ensure the vCenter Server has NTP properly configured and enabled as well.

Remote Syslog Server

Another helpful tool in the toolbox of troubleshooting is having a remote syslog server configured. There may be times where an investigation is required in the environment but when attempting to review the logs it is discovered that they are no longer available. Often times this is a result of the increased logging that happened during the time of the issue. The increased logging leads to thresholds for file size and counts being exceeded and thus the older logs are automatically deleted as a result.

Pure Storage recommends the use of the VMware vRealize Log Insight OVA. This provides for a quick and easy integration for the ESXi hosts and vCenter. Additionally, the Pure Storage Content Pack can be used with vRealize Log Insight which provides a single logging destination for both the vSphere and Pure Storage environments.

Configuring vCenter Server and ESXi with Log Insight

As explained above, configuring vCenter Server and ESXi is a relatively quick and simple process.

  • Login to VMware vRealize Log Insight.
  • Click on Administration .
  • Under Integration click on vSphere .
  • Click + Add vCenter Server .
  • Fill in the applicable vCenter Server information and Test Connection.
  • Ensure the following boxes are checked:
    • Collect vCenter Server events, tasks, and alarms
    • Configure ESXi hosts to send logs to Log Insight
  • Click Save to commit all of the requested changes.

loginsight-configuration.png

The following screenshot is applicable for vRealize Log Insight 8.x. If you have an earlier version of Log Insight then you can refer to the VMware documentation here on how to properly configure vCenter and ESXi.

Additional Remote syslog Options

It is understood that not every customer or environment will have vRealize Log Insight installed or available. If your environment takes advantage of a different solution then please refer to the third party documentation on how the best way to integrate it with your vSphere environment. You can also refer to VMware's Knowledge Base article Configuring syslog on ESXi for additional options and configuration information.

Configuring vCenter Server and ESXi with Log Insight

As explained above, configuring vCenter Server and ESXi is a relatively quick and simple process.

  • Login to VMware vRealize Log Insight.
  • Click on Administration .
  • Under Integration click on vSphere .
  • Click + Add vCenter Server .
  • Fill in the applicable vCenter Server information and Test Connection.
  • Ensure the following boxes are checked:
    • Collect vCenter Server events, tasks, and alarms
    • Configure ESXi hosts to send logs to Log Insight
  • Click Save to commit all of the requested changes.

loginsight-configuration.png

The following screenshot is applicable for vRealize Log Insight 8.x. If you have an earlier version of Log Insight then you can refer to the VMware documentation here on how to properly configure vCenter and ESXi.

Additional Remote syslog Options

It is understood that not every customer or environment will have vRealize Log Insight installed or available. If your environment takes advantage of a different solution then please refer to the third party documentation on how the best way to integrate it with your vSphere environment. You can also refer to VMware's Knowledge Base article Configuring syslog on ESXi for additional options and configuration information.

Read article