vRA7.3 and NSX Integration: Network Security Data Collection Failure

We are building vRA 7.3 . We added vCenter and NSX manager as endpoint in vRA. And associate NSX manager with vCenter. All of computing resource data collection works well but not NSX (network and security):

So in vRA reservation, we only can see vSphere cluster, vDS port-group/logical switch but not Transport zone, security group/tags

When check the log, we see the following:

Workflow ‘vSphereVCNSInventory’ failed with the following exception:

One or more errors occurred.

Inner Exception: An error occurred while sending the request.

at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)

at DynamicOps.VCNSModel.Interface.NSXClient.GetDatacenters()

at DynamicOps.VCNSModel.Activities.CollectDatacenters.Execute(CodeActivityContext context)

at System.Activities.CodeActivity.InternalExecute(ActivityInstance instance, ActivityExecutor executor, BookmarkManager bookmarkManager)

at System.Activities.Runtime.ActivityExecutor.ExecuteActivityWorkItem.ExecuteBody(ActivityExecutor executor, BookmarkManager bookmarkManager, Location resultLocation)

Inner Exception:

VCNS Workflow failure

I tried to delete NSX end point and recreate from vRA but no luck. I raised the issue in vmware community but can’t get any real valuable feedback.

After a few hours investigation, finally I find a fix:

run the “create a NSX endpoint” workflow in vRO as the below

2017-07-26_184701

Then I re-start network & security data collection in vRA. Everything works and I can see all defined NSX Transport Zone, security groups and DLR in vRA network reservations.

Hope this fix can help others who have the same issue.

Perform Packet Capture on VMware ESXi Host for NSX Troubleshooting

VMware offers a great and powerful tool pktcap-uw to perform packet capture on ESXi host.

Pktcap-uw offers a lot of options for packet capture.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2051814

Here I show most common used in my daily life here for your reference. I normally perform a packet based on vSwitch port ID or DV filter (NSX DFW)

To do that, I firstly need to find the vSwitch port ID and DV filter ID on ESXi host so that I can refer them in your packet capture. I normally use “summarize-dvfilter” CLI to find the requested information.

[root@esx4005:/tmp]
summarize-dvfilter | grep -C 10 1314
slowPathID: none
 filter source: Dynamic Filter Creation
 vNic slot 1
 name: nic-18417802-eth0-dvfilter-generic-vmware-swsec.1
 agentName: dvfilter-generic-vmware-swsec
 state: IOChain Attached
 vmState: Detached
 failurePolicy: failClosed
 slowPathID: none
 filter source: Alternate Opaque Channel
 world 18444553 vmm0:auslslnxsd1314-113585a5-f6ed-4eb3-abd2-12083901e942 vcUuid:'11 35 85 a5 f6 ed 4e b3-ab d2 12 08 39 01 e9 42'
port 33554558 (vSwitch PortID) auslslnxsd1314-113585a5-f6ed-4eb3-abd2-12083901e942.eth0
 vNic slot 2
 name: nic-18444553-eth0-vmware-sfw.2 (DV Filter ID)
 agentName: vmware-sfw
 state: IOChain Attached
 vmState: Detached
 failurePolicy: failClosed
 slowPathID: none
 filter source: Dynamic Filter Creation
 vNic slot 1
 name: nic-18444553-eth0-dvfilter-generic-vmware-swsec.1

After I have the vSwitch port ID and DV filter ID, I can start my packet capture.

  • Packet capture to a VM based on vSwitch PortID

pktcap-uw –switchport 33554558 —dir 0 -o /tmp/from1314.pcap

  • Packet capture from a VM based on vSwitch PortID

pktcap-uw –switchport 33554558 —dir 1 -o /tmp/to1314.pcap

  • Packet capture from a VM based on DV filter

pktcap-uw –capture PreDVFilter –dvfilter nic-18444553-eth0-vmware-sfw.2 -o /tmp/1314v3.pcap

Below is a brief explanation of the parameters which we use in the above.

-o (output): save the capture as a packet capture file;

-dir (direction): 0 for traffic to VM and 1 for traffic from VM;

-PreDVFilter: perform packet capture before DFW rules are applied;

-PostDVFilter: perform packet capture after DFW rules are applied;

In addition, you can add filter as well for your capture:

pktcap-uw –switchport 33554558 –tcpport 9000 –dir 1 -o /tmp/from1314.pcap

I list all available filter options here for your reference:

–srcmac
The Ethernet source MAC address.
–dstmac
The Ethernet destination MAC address.
–mac
The Ethernet MAC address(src or dst).
–ethtype
The Ethernet type. HEX format.
–vlan
The Ethernet VLAN ID.
–srcip
The source IP address.
–dstip
The destination IP address.
–ip
The IP address(src or dst).
–proto 0x
The IP protocol.
–srcport
The TCP source port.
–dstport
The TCP destination port.
–tcpport
The TCP port(src or dst).
–vxlan
The vxlan id of flow.

Update:

Start 2 capture at the same time:

pktcap-uw –switchport 50331665 -o /tmp/50331665.pcap & pktcap-uw –uplink vmnic2 -o /tmp/vmnic2.pcap &

Stop all packet capture:

kill $(lsof | grep pktcap-uw | awk ‘{print $1}’ | sort -u)

Of course, you can perform some basic packet capture in NSX manager via Central CLI. If you are interested in, please refer my another blog:

NSX IPSec Throughput in IBM Softlayer

To understand the real throughput capacity of NSX IPSec in Softlayer, I built a quick IPSec performance testing environment.

Below are the network topology of my testing environment:

NSX_IPSec_Performance_Topology

NSX version: 6.2.4
NSX Edge: X-Large (6 vCPUs and 8G Memory), which is the largest size NSX offers. All of Edges in this testing enviroment reside in the same vSphere cluster which include 3 ESXi hosts. Each ESXi host has 64GB DDR4 Memory and 2 processors (2.4GHz Intel Xeon-Haswell (E5-2620-V3-HexCore))
IPerf Client: Redhat 7.1 (2 vCPUs and 4GB Memory)
IPerf Server: Redhat 7.1 (2 vCPUs and 4GB Memory)
IPerf version: IPerf3

2 IPsec tunnels are built as the above diagram. IPSec setting is:

  • Encryption: AES-GCM
  • Diff-Hellman Group: DH5
  • PFS(Perfect forward secrecy): Enabled
  • AESNI: Enabled
I include 3 test cases in my testing:
Test1_Bandwidth_Utilisation
  • Test Case 2: 2 IPerf Clients (172.16.31.0/24) to 2 IPerf Servers (172.16.38.0/24) via 1 IPsec Tunnel. Result: around 1.6-2.3Gbit/s in total
Test2_Bandwidth_Utilisation
Test3_Bandwidth_Utilisation
Please note:
  1. Firewall function on NSX Edge is disabled in all test cases.
  2. TCP traffic is used in all 3 test cases. 10 parallel streams are used to push the performance test to the max on each IPerf Client.
  3. I didn’t see any CPU or Memory contention in all test cases: the CPU utilisation of NSX Edge was  less than 40% and memory utilisation is nearly zero.

CPU_Mem

Simple Python Script Creating a Dynamic Membership Security Group

In this blog, I developed a very simple Python scripts to create NSX security group whose membership is based on Security Tag. Please note this script is to show you the basic, which has not been ready for a production environment.

Two Python functions are includes in this script:

  1. create_tag is used to create a NSX security tag;
  2. create_sg is used to create a security group and define a criterion to add all virtual machines tagged with the specified security tag into this newly created security group;
import requests
from base64 import b64encode
import getpass
username=raw_input('Enter Your NSXManager Username: ')
yourpass = getpass.getpass('Enter Your NSXManager Password: ')
sg_name=raw_input('Enter Security Group Name: ')
vm_tag=raw_input('Enter Tag Name: ')
userandpass=username+":"+yourpass
userpass = b64encode(userandpass).decode("ascii")
auth ="Basic " + userpass
payload_tag="<securityTag>\r\n<objectTypeName>SecurityTag</objectTypeName>\r\n<type>\r\n<typeName>SecurityTag</typeName>\r\n</type>\r\n<name>"+vm_tag+"</name>\r\n<isUniversal>false</isUniversal>\r\n<description>This tage is created by API</description>\r\n<extendedAttributes></extendedAttributes>\r\n</securityTag>"
payload_sg= "<securitygroup>\r\n <objectId></objectId>\r\n <objectTypeName>SecurityGroup</objectTypeName>\r\n <type>\r\n <typeName>SecurityGroup</typeName>\r\n </type>\r\n <description></description>\r\n <name>"+sg_name+"</name>\r\n <revision>0</revision>\r\n<dynamicMemberDefinition>\r\n <dynamicSet>\r\n <operator>OR</operator>\r\n <dynamicCriteria>\r\n <operator>OR</operator>\r\n <key>VM.SECURITY_TAG</key>\r\n <criteria>contains</criteria>\r\n <value>"+vm_tag+"</value>\r\n </dynamicCriteria>\r\n </dynamicSet>\r\n</dynamicMemberDefinition>\r\n</securitygroup>"

def create_tag():
        try:
                response = requests.post(
                url="https://NSX-Manager-IP/api/2.0/services/securitytags/tag",
                verify=False,
                headers={
                        "Authorization": auth,
                        "Content-Type": "application/xml",
                    },
                data=payload_tag
                    )
                print('Response HTTP Status Code: {status_code}'.format(status_code=response.status_code))
                #print('Response HTTP Response Body: {content}'.format(content=response.content))
                if response.status_code == 403:
                        print "***********************************************************************"
                        print "WARNING: your username or password is wrong, please retry again!"
                        print "***********************************************************************"
                if  response.status_code == 201:
                        print "***********************************************************************"
                        print('Response HTTP Response Body: {content}'.format(content=response.content))
                api_response=response.text
                print api_response
        except requests.exceptions.RequestException:
                print('HTTP Request failed')

def create_sg():
        try:
                response = requests.post(
                url="https://NSX-Manager-IP/api/2.0/services/securitygroup/bulk/globalroot-0",
                verify=False,
                headers={
                        "Authorization": auth,
                        "Content-Type": "application/xml",
                    },
                data=payload_sg
                    )
                print('Response HTTP Status Code: {status_code}'.format(status_code=response.status_code))
                #print('Response HTTP Response Body: {content}'.format(content=response.content))
                if response.status_code == 403:
                        print "***********************************************************************"
                        print "WARNING: your username or password is wrong, please retry again!"
                        print "***********************************************************************"
                if  response.status_code == 201:
                        print "***********************************************************************"
                        print('Response HTTP Response Body: {content}'.format(content=response.content))
                api_response=response.text
                print api_response
        except requests.exceptions.RequestException:
                print('HTTP Request failed')

Run this script in our O-Dev:

[root]$ python create_sg_dynamic_member_20170429.py

Enter Your NSXManager UserName: admin

Enter Your NSXManager Passowrd:

Enter Security Group Name: sg_app1_web

Enter Tag Name: tag_app1_web

Response HTTP Status Code: 201

***********************************************************************

Response HTTP Response Body: securitytag-14

securitytag-14

Response HTTP Status Code: 201

***********************************************************************

Response HTTP Response Body: securitygroup-485

securitygroup-485

In NSX manager, we can see a securtiy group sg_app1_web is created as below:

2017-04-30_140657

And its dynamic membeship criterion is:

2017-04-30_140729

NSX-v DLR OSPF Adjacencies Configuration Maximums

In one of NSX doc, the below is suggested around DLR OSPF Adjacencies configuration maximum:

OSPF Adjacencies per DLR 10

This maximum applies to NSX 6.1, 6.2 and 6.3.

In OSPF , OSPF optimizes the LSA flooding process on multiaccess network by using DR (designated rourer) and BDR (backup DR). Routers that are not DR or BDR are called DRother routers. All DRother routers only form full adjacency with DR and BDR. Among DRother routers, they will stay in 2Way state and forms OSPF neighborship but not full adjacency.

To clarify if  “adjacencies” in VMWare NSX configuration maximums doc means “full adjacency” or “neighbor/2way state”, I raise a SR with VMWare GSS. The response from VMware GSS is:

  • their “Adjacencies” mean “neighborship” not “full adjacency”.
  • the 2WAY state also will be included  in the configuration limit of the 10 OSPF adjacencies per DLR

Automate OpenStack Security Group with Terraform

Heat is the main project in the OpenStack Orchestration program. We can use heat to automate security group implementation. If you have NSXv plugin integrated with your OpenStack environment, you can use Heat template to automate your NSX DFW rules implementation as well. Here I will show you how to use Terraform to do the same magic: automate security group  deployment.

Below is my Terraform template of creating a security group and 5 rules within the newly created security group.

provider “openstack” {
user_name = “${var.openstack_user_name}”
password = “${var.openstack_password}”
tenant_name = “tenant1”
auth_url = “http://keystone.ops.com.au:5000/v3&#8221;
domain_name = “domain1”
}
resource “openstack_networking_secgroup_v2” “secgroup_2” {
name = “secgroup_2”
description = “Terraform security group”
tenant_id =”2b8d09cb778346a4ae70c16ee65a5c69″
}
resource “openstack_networking_secgroup_rule_v2” “secgroup_rule_1” {
direction = “egress”
ethertype = “IPv4”
protocol = “tcp”
port_range_min = 22
port_range_max = 22
remote_ip_prefix = “10.41.129.12/32”
security_group_id = “${openstack_networking_secgroup_v2.secgroup_2.id}”
tenant_id =”2b8d09cb778346a4ae70c16ee65a5c69″
depends_on = [“openstack_networking_secgroup_v2.secgroup_2”]

}

resource “openstack_networking_secgroup_rule_v2” “secgroup_rule_2” {
direction = “ingress”
ethertype = “IPv4”
protocol = “tcp”
port_range_min = 443
port_range_max = 443
remote_ip_prefix = “10.41.129.12/32”
security_group_id = “${openstack_networking_secgroup_v2.secgroup_2.id}”
tenant_id =”2b8d09cb778346a4ae70c16ee65a5c69″
depends_on = [
“openstack_networking_secgroup_v2.secgroup_2”,
“openstack_networking_secgroup_rule_v2.secgroup_rule_1”
]
}

resource “openstack_networking_secgroup_rule_v2” “secgroup_rule_3” {
direction = “ingress”
ethertype = “IPv4”
protocol = “tcp”
port_range_min = 443
port_range_max = 443
remote_ip_prefix = “10.41.129.11/32”
security_group_id = “${openstack_networking_secgroup_v2.secgroup_2.id}”
tenant_id =”2b8d09cb778346a4ae70c16ee65a5c69″
depends_on = [
“openstack_networking_secgroup_v2.secgroup_2”,
“openstack_networking_secgroup_rule_v2.secgroup_rule_2”
]
}

resource “openstack_networking_secgroup_rule_v2” “secgroup_rule_4” {
direction = “ingress”
ethertype = “IPv4”
protocol = “tcp”
port_range_min = 8080
port_range_max = 8080
remote_ip_prefix = “10.41.129.11/32”
security_group_id = “${openstack_networking_secgroup_v2.secgroup_2.id}”
tenant_id =”2b8d09cb778346a4ae70c16ee65a5c69″
depends_on = [
“openstack_networking_secgroup_v2.secgroup_2”,
“openstack_networking_secgroup_rule_v2.secgroup_rule_3”
]
}

resource “openstack_networking_secgroup_rule_v2” “secgroup_rule_5” {
direction = “ingress”
ethertype = “IPv4”
protocol = “tcp”
port_range_min = 22
port_range_max = 22
remote_ip_prefix = “10.41.129.11/32”
security_group_id = “${openstack_networking_secgroup_v2.secgroup_2.id}”
tenant_id =”2b8d09cb778346a4ae70c16ee65a5c69″
depends_on = [
“openstack_networking_secgroup_v2.secgroup_2”,
“openstack_networking_secgroup_rule_v2.secgroup_rule_4”
]
}

Please make sure that you added the resource dependency for each firewall rule via”depends_on”.

If not, you will see erros like the below when you try to “terraform apply”. You will be able only to  add 1 rule when you run “terraform apply” once.

2017/03/06 19:47:46 [TRACE] Preserving existing state lineage “607d13a8-c268-498a-bbb4-07f98f0dd6b4”
Error applying plan:

1 error(s) occurred:

2017/03/06 19:47:46 [DEBUG] plugin: waiting for all plugin processes to complete…
* openstack_networking_secgroup_rule_v2.secgroup2_rule_2: Internal Server Error

Terraform does not automatically rollback in the face of errors.

The above issue is known Issue (Issue ID 7519) with Terraform. (Refer the link: https://github.com/hashicorp/terraform/issues/7519).

Unfortunately, the issue is still in version 0.8.7. The current solution is adding specify explicit dependencies when creating firewall rules.

Limitation of NSX Central CLI Packet Capture

Packet capture is very useful and strong tools for troubleshooting as the packet always tell you the truth. From NSX 6.2.3, you can use Central CLI to perform packet capture for individual VM.

My friend Tony has published a very good blog around how to use this tools. You can refer his blog in the link below:

https://tonysangha.com/2016/11/15/nsxv-central-cli-packet-capture/

However, you have to understand some limition around this tool when you use. I have found 2 limitions so far.

  1. You can only use this tool to capture maximum 20000 packets;
  2. The packet capture size is up to 20MB;

So if you are going to capture a bug number of packets, you still have to use the pktcap-uw on ESXi host directly.

Just found some new thing which can help you to capture the traffic which you are interested in.

debug packet capture host host-3287 vnic 50068de0-9f44-0601-4f69-71d2391345ec.000 dir input parameters –ip 10.10.80.24

# show packet capture help host host-3287

Help information for capture options from host
        –srcmac
                The Ethernet source MAC address.
        –dstmac
                The Ethernet destination MAC address.
        –mac
                The Ethernet MAC address(src or dst).
        –ethtype 0x
                The Ethernet type. HEX format.
        –vlan
                The Ethernet VLAN ID.
        –srcip <x.x.x.x[/]>
                The source IP address.
        –dstip <x.x.x.x[/]>
                The destination IP address.
        –ip
                The IP address(src or dst).
        –proto 0x
                The IP protocol.
        –srcport
                The TCP source port.
        –dstport
                The TCP destination port.
        –tcpport
                The TCP port(src or dst).
        –vxlan
                The vxlan id of flow.

nsxmanager#debug packet capture host host-203
vnic capture vnic
vmknic capture vmknic
vmnic capture vmnic (pnic)
vdrport capture vdrport

Quick Summary of packet capture steps:

Step 0: Get the host-id and vnic-id via running CLI “show vm vm-id” on NSX manager

Step 1: start the packet capture. (here we capture inbound traffic for a VM)

debug packet capture host host-id vnic vnic-id dir input parameters –ip 10.10.80.24

You will get a packat session-ID from the above CLI which will be used in the following steps

Step 2: Stop the capature

no debug packet capture session session-id

Step 3: Copy your packat capture to your SFTP server and user your packet capture tool (e.g. Wireshark) to analysis:

debug packet capture scp session session-id url sftpuser@sftp-server-IP:file1.pcap

Step 4: Cleat the capture session and delete the packet capture file from NSX manager

no debug packet capture session session-id discard

show packet capture sessions

Be Careful of NSX DFW AutoSave feature

In NSX,  by default NSX will save your DFW configuration when you perform a change. Up to 90 Configuration can be saved. This feature is called auto save or auto-draft, which is designed to help the restoration of your DFW configuration.

However this feature can be A BOMB as well when you have a “big” (like over 20k rules, I know it is not big at all for an enterprise client) number firewall rules.

Recently we have a P1 issue for this.

We see two symptoms:

  1. NSX manager daily back up failed. In NSX syslog we see the below:2016-12-22 18:04:03.566 AEDT ERROR taskScheduler-1 VsmServiceBackupRestoreExecutor:254 – Run backup script – Failure due to NSX Manager database data dump operation failed.vsm-appliance-mgmt:150500:Exception occurred while taking backup.:NSX Manager database data dump operation failed.2016-12-23 18:04:11.835 AEDT ERROR taskScheduler-1 VsmServiceBackupRestoreExecutor:254 – Run backup script – Failure due to NSX Manager database data dump operation failed.vsm-appliance-mgmt:150500:Exception occurred while taking backup.:NSX Manager database data dump operation failed.
  2. We can’t perform any change around NSX DFW including exclusion list using NSX GUI or API call although we are still able to get the current DFW configuration in GUI and perform GET API.

We worked with VMWare support team and tried to fix the issue. Finally, we identify the issue is due to a “over-sized” (in our case around 13GB, we still have nearly 9GB space in the  DB partition) table in NSX manager. The naughty table is firewall_draft_compact_rule

We have to disable the DFW auto save feature then delete the saved configuration to restore our service.

It is reasonable that the NSX backup failed when you don’t have enough disk space  available in NSX manager. However, we are still waiting for the formal explanation why we can’t change the DFW configuration. My current guess is the firewall_draft_compact_rule table are put into “Read-only” mode when the size exceeds some kind of threshhold. Once we get the feedback, will update this post accordingly.

Note: the auto-save/auto-draft feature can only be disabled when your NSX is 6.2.3 and onwards.

Information from VMWare Support:

Backup’s were failing due to one of the DB tables consuming large space (firewall_draft_compact_rule). This could happen Concurrent Firewall Config operation are sometime throwing drafts in a inconsistent state. Once draft lands into an inconsistent state, the cleanup operation does not work as expected. In addition the scheduled compaction task keeps piling up new compacted configurations in the firewall_draft_compact_config, firewall_draft_compact_section, firewall_draft_compact_rule and firewall_draft_config_change table eventually filling up the disk. 

NSX Edge Packet Capture on Multi-vNics simultaneously

In NSX 6.1.4, I tried to perform packet capture to analysis the end to end connectivity restoration during Edge HA failover. But I only can capture packet for a single vNic at one time. Somebody may say this can be worked around by performing another packet capture on another vNIC in ESXi hosts by use of “pktcap-uw”. However,”pktcap-uw” can only capture uni-directional traffic in ESXi hosts. This behavior will bring extra challenge for packet analysis.

Luckily in the new version of NSX 6.2.4, it looks like that we can capture on different vNIC at the same time by run multiple times of “debug packet capture interface vNIC” like the below:

debug packet capture interface vNIC_2
debug packet capture interface vNIC_3

nsx-edge You can see that I successfully captured the packet on vNic_2 and vNic_3.
Then you can upload the packet capture to your SFTP server for further analysis by CLI:

debug copy scp user@url:path file-name/all

2017-03-23_090111

When you perform the packet capture, you can use filter to only capture the traffic which you are interested in.

debug packet display interface vNic_0 host_192.168.11.3_and_host_192.168.11.41
debug packet capture interface vNic_0 host_192.168.1.2_and_host_192.168.2.2_and_port_80

vSphere DRS anti-affinity rules block ESXi hosts NSX upgrade

In our OpenDev NSX environment, we have 3 vSphere clusters: each cluster has 3 ESXi hosts.

During the NSX upgrade from 6.2.3 to 6.2.4, we see the following message when we tried to upgrade the ESXi cluster which NSX Controllers are running on: DRS recommends hosts to evacuate.

nsx-upgrade

After a quick investigation, I realised that the issue is due to default DRS anti-affinity rule for NSX controllers. This anti-affinity rule prevent any host in management cluster from going into maintenance mode as we have only 3 ESXi hosts in this cluster.

So I just temporarily disable the NSX controller anti-affinity rule so that EAM (ESX Agent Manager) can continue to update the ESXi hosts VIB to the new version.