Overview

After exploring XDP for some time for L2TPv3 and VXLAN, I wanted to dig deeper into hardware offload capabilities, specifically for VXLAN, but without using Open vSwitch (OVS). Mellanox has some great blog posts to get me started:

https://community.mellanox.com/s/article/Configuring-VXLAN-Encap-Decap-Offload-Using-tc

Plus some help from their official documentation:
https://community.mellanox.com/s/article/Configuring-VXLAN-Encap-Decap-Offload-Using-tc

What follows is a detailed description of me recreating their VXLAN offload example setup. Two Linux hosts with Mellanox ConnectX-5 EN adapters, connected via Direct Attach Cable (DAC). Host1 is using VXLAN offload, while Host1 uses kernel based VXLAN:

Lab Setup

I’m using two Linux servers running Ubuntu 20.04, both equipped with a dual port 25 Gbps Mellanox ConnectX-5 PCI Express card. A DAC cable interconnects port 1 (port 0 remains unused) on both servers.

I installed Mellanox OFED Linux drivers from https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed and
their firmware tools from https://www.mellanox.com/products/adapter-software/firmware-tools to upgrade the firmware.

After the upgrade, the following firmware is active:

~$ sudo mlxfwmanager 
Querying Mellanox devices firmware …
Device #1:
Device Type: ConnectX5
Part Number: MCX512A-ACA_Ax_Bx
Description: ConnectX-5 EN network interface card; 10/25GbE dual-port SFP28; PCIe3.0 x8; tall bracket; ROHS R6
PSID: MT_0000000080
PCI Device Name: /dev/mst/mt4119_pciconf0
Base GUID: 043f720300ad0512
Base MAC: 043f72ad0512
Versions: Current Available
FW 16.29.2002 N/A
PXE 3.6.0204 N/A
UEFI 14.22.0016 N/A

The connected 25 Gbps interfaces are configured with IPv4 (and v6) addresses in the same network and connectivity is verified via ping:

Host1$ ifconfig enp6s0f1
enp6s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.203.51 netmask 255.255.255.0 broadcast 192.168.203.255
inet6 2a02:168:5f67:203::51 prefixlen 64 scopeid 0x0<global>
inet6 fe80::63f:72ff:fead:513 prefixlen 64 scopeid 0x20<link>
ether 04:3f:72:ad:05:13 txqueuelen 1000 (Ethernet)
Host2$ ifconfig enp6s0f1
enp6s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.203.56 netmask 255.255.255.0 broadcast 192.168.203.255
inet6 fe80::63f:72ff:fead:1d7f prefixlen 64 scopeid 0x20<link>
inet6 2a02:168:5f67:203::56 prefixlen 64 scopeid 0x0<global>
ether 04:3f:72:ad:1d:7f txqueuelen 1000 (Ethernet)

Testing connectivity and show the learned ARP entry. I know, nothing magic, here to see yet, but we get to that shortly.

Host1$ sudo ping -f -c 100 192.168.203.56 
PING 192.168.203.56 (192.168.203.56) 56(84) bytes of data.

--- 192.168.203.56 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 0.027/0.033/0.248/0.022 ms, ipg/ewma 0.052/0.029 ms

Host1$ arp -na 192.168.203.56
? (192.168.203.56) at 04:3f:72:ad:1d:7f [ether] on enp6s0f1

Host2 without offload

A single VXLAN virtual interface is configured on Host2 using VNI 16, pointing to Host1 with the IP address 1.1.1.3:

Host2# ip link add name vxlan16 type vxlan id 16 dev enp6s0f1 remote 192.168.203.51 dstport 4789

Host2# ifconfig vxlan16 1.1.1.3/24 up

Host2# ifconfig vxlan16
vxlan16: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 1.1.1.3 netmask 255.255.255.0 broadcast 1.1.1.255
inet6 fe80::78fb:e8ff:fe3c:637b prefixlen 64 scopeid 0x20<link>
ether 7a:fb:e8:3c:63:7b txqueuelen 1000 (Ethernet)

Initially I had some trouble bringing the interface up, because the port 4789 was in use by another interface or service. If you face the same issue, you can simply use another UDP port on both hosts. My “culprit” was a running Juniper cRPD instance in host mode.

The Ethernet address of the vxlan16 interface will be required further down in configuring static offload entries. This address can change during restarts and hence the filters need to be adjusted accordingly. Typically, mac learning for offload will be done via BGP type-2 routes when using EVPN/VXLAN, but here I wanted to understand the very basic of hardware offload.

With vxlan16 interface ready on Host2, but nothing configured on Host1 yet, there isn’t much to test, apart from checking, if ARP requests are being sent to the Host2 over the DAC connection.

With tcpdump running on enp6s0f1 on Host2 (or Host1), we can send a ping to 1.1.1.2 from host2:

Host2$ ping 1.1.1.2
PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.
^C
--- 1.1.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1011ms

From another windows, while the ping is running:

Host2$ sudo tcpdump -n -i enp6s0f1 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp6s0f1, link-type EN10MB (Ethernet), capture size 262144 bytes

14:06:36.363483 04:3f:72:ad:1d:7f > 04:3f:72:ad:05:13, ethertype IPv4 (0x0800), length 92: 192.168.203.56.57222 > 192.168.203.51.4789: VXLAN, flags [I] (0x08), vni 16
82:8a:29:ad:4e:ef > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.3, length 28

14:06:36.363602 04:3f:72:ad:05:13 > 04:3f:72:ad:1d:7f, ethertype IPv4 (0x0800), length 120: 192.168.203.51 > 192.168.203.56: ICMP 192.168.203.51 udp port 4789 unreachable, length 86

14:06:37.374091 04:3f:72:ad:1d:7f > 04:3f:72:ad:05:13, ethertype IPv4 (0x0800), length 92: 192.168.203.56.57222 > 192.168.203.51.4789: VXLAN, flags [I] (0x08), vni 16
82:8a:29:ad:4e:ef > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.3, length 28

14:06:37.374183 04:3f:72:ad:05:13 > 04:3f:72:ad:1d:7f, ethertype IPv4 (0x0800), length 120: 192.168.203.51 > 192.168.203.56: ICMP 192.168.203.51 udp port 4789 unreachable, length 86

^C

Perfect! Ok, no response yet, but we can see ARP requests for 1.1.1.2 being sent via VXLAN to Host1.

Host1 with VXLAN offload

Hardware offload on ConnectX-5 works by enabling the NICs per-port eSwitch via /sbin/devlink (part of iproute2 package), followed by enabling TC hardware offload (hw-tc-offload) and configuring TC rules. But lets go thru this step by step…

My goal here is to apply VXLAN offload to a Virtual Function (VF) interface (represented via a PCI address) and use that VF either in a VM, a container using SRIOV plugin or simply as a network interface. The main problem with VF’s “handed over” to another network namespace, VM or container is, how do you program the offload filters when you don’t have the interface anymore in the default namespace? This is where so called e-switch vport / VF representor, simply called “representors”, come in. I learned about them from a talk and slides I found from netdev 1.2 in 2016 by Or Gerlitz, Hadar Hen-Zion, Amir Vadai and Rony Efraim.

There are tons of instructions found on Internet regarding enabling SR-IOV and creating Virtual Functions (VF)’s. The kernel must have IOMMU enabled, which can be done by adding the following keywords to the GRUB_CMDLINE_LINUX variable in /etc/default/grub (on Ubuntu):

Host1$ grep iommu /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="maybe-ubiquity default_hugepagesz=1G hugepagesz=1G hugepages=4 iommu=pt intel_iommu=on"

The hugepages aren’t required ASAIK, but I have them in there for DPDK applications like pktgen. Activate the change via ‘update-grub2’ followed by a reboot:

Host1$ sudo update-grub2 && sudo reboot

Before creating VFs and enabling switchev, we need an automated mechanism to rename the created eth interfaces. Mellanox provides a repo with the required files at https://github.com/Mellanox/mlnx-tools

Clone the repo or download the files 82-net-setup-link.rules and vf-net-link-name.sh, then copy them into /etc/udev/rules.d and /ib/udev:

Host1# cp 82-net-setup-link.rules /etc/udev/rules.d/
Host1# cp vf-net-link-name.sh /lib/udev/
Host1# chmod +x /lib/udev/vf-net-link-name.sh

Enable 2 SRIOV VFs (we will only need 1 though):

Host1# echo 2 > /sys/class/net/enp6s0f1/device/sriov_numvfs

while verifying the created VFs, you’ll notice, they don’t have a valid ethernet address. Hence we need to assign them manually:

Host1$ ip -d link show dev enp6s0f1
7: enp6s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 04:3f:72:ad:05:13 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 640 numrxqueues 32 gso_max_size 65536 gso_max_segs 65535 portname p1
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
Host1# ip link set enp6s0f1 vf 0 mac e4:11:11:11:11:10
Host1# ip link set enp6s0f1 vf 1 mac e4:11:11:11:11:11

Check again the output of ‘ip -d link’:

Host1# ip -d link show dev enp6s0f1
7: enp6s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 04:3f:72:ad:05:13 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 640 numrxqueues 32 gso_max_size 65536 gso_max_segs 65535 portname p1
vf 0 link/ether e4:11:11:11:11:10 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
vf 1 link/ether e4:11:11:11:11:11 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off

We must remove the just created VFs from the kernel before enabling the cards eSwitch (switchdev). First, find out the PCI address of the Virtual Functions (06:01.2 and 06:01.3 in my case):

Host1# lspci |grep Ethernet |grep 06
06:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
06:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
06:01.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
06:01.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

Unbind them from the kernel

Host1# echo 0000:06:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind
Host1# echo 0000:06:01.3 > /sys/bus/pci/drivers/mlx5_core/unbind

now set port 1 to switchdev mode

Host1# devlink dev eswitch set pci/0000:06:00.1 mode switchdev   

and verify the change to switchdev by searching for the switchid in the ip link output

Host1# ip -d link show dev enp6s0f1
7: enp6s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 04:3f:72:ad:05:13 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 640 numrxqueues 32 gso_max_size 65536 gso_max_segs 65535 portname p1 switchid 1205ad0003723f04
vf 0 link/ether e4:11:11:11:11:10 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
vf 1 link/ether e4:11:11:11:11:11 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off

With the helper files installed properly for udev, new representor interfaces have been created:

Host1# ls -l /sys/class/net/enp6s0f1*
lrwxrwxrwx 1 root root 0 Feb 15 14:04 /sys/class/net/enp6s0f1 -> ../../devices/pci0000:00/0000:00:03.1/0000:06:00.1/net/enp6s0f1
lrwxrwxrwx 1 root root 0 Feb 15 15:23 /sys/class/net/enp6s0f1_0 -> ../../devices/virtual/net/enp6s0f1_0
lrwxrwxrwx 1 root root 0 Feb 15 15:23 /sys/class/net/enp6s0f1_1 -> ../../devices/virtual/net/enp6s0f1_1

Enable the representor interface for VF0 on port 1, then check its up

Host1# ip link set dev enp6s0f1_0 up

Host1# ip link show dev enp6s0f1_0
11: enp6s0f1_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 56:42:16:30:c6:2d brd ff:ff:ff:ff:ff:ff

Enable hw tc offload via ethtool and add ingress tc qdisc to the physical port 1 (enp6s0f1) and the representor VF interface (enp6s0f1_0):

Host1# ethtool --features enp6s0f1 hw-tc-offload on
Host1# ethtool --features enp6s0f1_0 hw-tc-offload on

Host1# tc qdisc add dev enp6s0f1 ingress
Host1# tc qdisc add dev enp6s0f1_0 ingress

Rebind the VF port back to the default kernel namespace, because we simply want to assign an IP address to test VXLAN offload. Usually, that VF will be assigned to a container or kubernetes Pod via SRIOV CNI.

Host1# echo 0000:06:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

This will create the VF interface (enp6s0f1np0v0). To actual name might be different on your system. Use this command to find out

Host1# ls -l /sys/class/net/enp6s0f1*
lrwxrwxrwx 1 root root 0 Feb 15 14:04 /sys/class/net/enp6s0f1 -> ../../devices/pci0000:00/0000:00:03.1/0000:06:00.1/net/enp6s0f1
lrwxrwxrwx 1 root root 0 Feb 15 15:23 /sys/class/net/enp6s0f1_0 -> ../../devices/virtual/net/enp6s0f1_0
lrwxrwxrwx 1 root root 0 Feb 15 15:23 /sys/class/net/enp6s0f1_1 -> ../../devices/virtual/net/enp6s0f1_1
lrwxrwxrwx 1 root root 0 Feb 15 15:35 /sys/class/net/enp6s0f1np0v0 -> ../../devices/pci0000:00/0000:00:03.1/0000:06:01.2/net/enp6s0f1np0v0

The last entry contains the familiar ’06:01.2′ we have seen earlier for VF0. This will become our “vxlan overlay” interface, connected to the vxlan16 interface on Host2. Assign an IP address and bring it up:

Host1# ifconfig enp6s0f1np0v0 1.1.1.2/24 up

Check the interface and you’ll find our manually assigned ethernet address:

Host1# ifconfig enp6s0f1np0v0
enp6s0f1np0v0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 1.1.1.2 netmask 255.255.255.0 broadcast 1.1.1.255
inet6 fe80::e611:11ff:fe11:1110 prefixlen 64 scopeid 0x20<link>
ether e4:11:11:11:11:10 txqueuelen 1000 (Ethernet)

Before adding tc ingress filter rules, we need to create a vxlan interface. This steps hasn’t been intuitive for me, after all, we want all encap/decap to happen in hardware. But turns out, this step is required, despite no packets been “seen” over it.

Host1# ip link add name vxlan16 type vxlan id 6 dev enp6s0f1 remote 192.168.203.56 dstport 4789
Host1# ifconfig vxlan16 up

Reduce the MTU of the VF and representor interface to make room for the VXLAN overhead

Host1# ifconfig enp6s0f1np0v0 mtu 1450
Host1# ifconfig enp6s0f1_0 mtu 1450

We have enabled tc qdisc earlier on the VF and its representor interface, but need it also enabled on the vxlan16 interface:

tc qdisc add dev vxlan16 ingress

Now we can finally add TC rules to encap traffic from the VF into a VXLAN header. In order to match the L2 packets, we need to ethernet destination address of the vxlan16 interface from host2 (which I mentioned earlier to capture). Lets grab it again from host2:

Host2$ ifconfig vxlan16
vxlan16: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 1.1.1.3 netmask 255.255.255.0 broadcast 1.1.1.255
inet6 fe80::808a:29ff:fead:4eef prefixlen 64 scopeid 0x20<link>
ether 82:8a:29:ad:4e:ef txqueuelen 1000 (Ethernet)

Adding now the encap VXLAN filters for IP and ARP (unicast and broadcast):

tc filter add dev enp6s0f1_0 protocol ip parent ffff: prio 1 
flower
dst_mac 82:8a:29:ad:4e:ef
src_mac e4:11:11:11:11:10
action tunnel_key set
src_ip 192.168.203.51
dst_ip 192.168.203.56
dst_port 4789
id 16
action mirred egress redirect dev vxlan16

tc filter add dev enp6s0f1_0 protocol arp parent ffff: prio 2
flower
dst_mac 82:8a:29:ad:4e:ef
src_mac e4:11:11:11:11:10
action tunnel_key set
src_ip 192.168.203.51
dst_ip 192.168.203.56
dst_port 4789
id 16
action mirred egress redirect dev vxlan16

tc filter add dev enp6s0f1_0 protocol arp parent ffff: prio 3
flower
dst_mac ff:ff:ff:ff:ff:ff
src_mac e4:11:11:11:11:10
action tunnel_key set
src_ip 192.168.203.51
dst_ip 192.168.203.56
dst_port 4789
id 16
action mirred egress redirect dev vxlan16

And decap filters on vxlan16, redirected to enp6s0f1_0, also for IP, ARP unicast and broadcast:

tc filter add dev vxlan16 protocol ip parent ffff: prio 1 
flower
dst_mac e4:11:11:11:11:10
src_mac 82:8a:29:ad:4e:ef
enc_src_ip 192.168.203.56
enc_dst_ip 192.168.203.51
enc_dst_port 4789
enc_key_id 16
action tunnel_key unset
action mirred egress redirect dev enp6s0f1_0

tc filter add dev vxlan16 protocol arp parent ffff: prio 2
flower
dst_mac e4:11:11:11:11:10
src_mac 82:8a:29:ad:4e:ef
enc_src_ip 192.168.203.56
enc_dst_ip 192.168.203.51
enc_dst_port 4789
enc_key_id 16
action tunnel_key unset
action mirred egress redirect dev enp6s0f1_0

tc filter add dev vxlan16 protocol arp parent ffff: prio 3
flower
dst_mac ff:ff:ff:ff:ff:ff
src_mac 82:8a:29:ad:4e:ef
enc_src_ip 192.168.203.56
enc_dst_ip 192.168.203.51
enc_dst_port 4789
enc_key_id 16
action tunnel_key unset
action mirred egress redirect dev enp6s0f1_0

Verify they got applied properly on the interfaces via ‘tc -s filter show …’. Watch out for the ‘in_hw” count:

Host1# tc -s filter show dev enp6s0f1_0 root | grep --color -E 'in_hw|'
filter parent ffff: protocol ip pref 1 flower chain 0
filter parent ffff: protocol ip pref 1 flower chain 0 handle 0x1
dst_mac 82:8a:29:ad:4e:ef
src_mac e4:11:11:11:11:10
eth_type ipv4
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 192.168.203.51
dst_ip 192.168.203.56
key_id 16
dst_port 4789
csum pipe
index 1 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: mirred (Egress Redirect to device vxlan16) stolen
index 1 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter parent ffff: protocol arp pref 2 flower chain 0
filter parent ffff: protocol arp pref 2 flower chain 0 handle 0x1
dst_mac 82:8a:29:ad:4e:ef
src_mac e4:11:11:11:11:10
eth_type arp
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 192.168.203.51
dst_ip 192.168.203.56
key_id 16
dst_port 4789
csum pipe
index 2 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: mirred (Egress Redirect to device vxlan16) stolen
index 2 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter parent ffff: protocol arp pref 3 flower chain 0
filter parent ffff: protocol arp pref 3 flower chain 0 handle 0x1
dst_mac ff:ff:ff:ff:ff:ff
src_mac e4:11:11:11:11:10
eth_type arp
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 192.168.203.51
dst_ip 192.168.203.56
key_id 16
dst_port 4789
csum pipe
index 3 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: mirred (Egress Redirect to device vxlan16) stolen
index 3 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Host1# tc -s filter show dev vxlan16 root | grep --color -E 'in_hw|'
filter parent ffff: protocol ip pref 1 flower chain 0
filter parent ffff: protocol ip pref 1 flower chain 0 handle 0x1
dst_mac e4:11:11:11:11:10
src_mac 82:8a:29:ad:4e:ef
eth_type ipv4
enc_dst_ip 192.168.203.51
enc_src_ip 192.168.203.56
enc_key_id 16
enc_dst_port 4789
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 4 ref 1 bind 1 installed 135 sec used 135 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: mirred (Egress Redirect to device enp6s0f1_0) stolen
index 4 ref 1 bind 1 installed 135 sec used 135 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter parent ffff: protocol arp pref 2 flower chain 0
filter parent ffff: protocol arp pref 2 flower chain 0 handle 0x1
dst_mac e4:11:11:11:11:10
src_mac 82:8a:29:ad:4e:ef
eth_type arp
enc_dst_ip 192.168.203.51
enc_src_ip 192.168.203.56
enc_key_id 16
enc_dst_port 4789
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 5 ref 1 bind 1 installed 135 sec used 135 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: mirred (Egress Redirect to device enp6s0f1_0) stolen
index 5 ref 1 bind 1 installed 135 sec used 135 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter parent ffff: protocol arp pref 3 flower chain 0
filter parent ffff: protocol arp pref 3 flower chain 0 handle 0x1
dst_mac ff:ff:ff:ff:ff:ff
src_mac 82:8a:29:ad:4e:ef
eth_type arp
enc_dst_ip 192.168.203.51
enc_src_ip 192.168.203.56
enc_key_id 16
enc_dst_port 4789
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 6 ref 1 bind 1 installed 135 sec used 135 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: mirred (Egress Redirect to device enp6s0f1_0) stolen
index 6 ref 1 bind 1 installed 135 sec used 135 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Ok, I think we are all set now, to finally test it !!!
Ping 1.1.1.3 from Host1:

Host1# ping 1.1.1.3
PING 1.1.1.3 (1.1.1.3) 56(84) bytes of data.
64 bytes from 1.1.1.3: icmp_seq=1 ttl=64 time=0.411 ms
64 bytes from 1.1.1.3: icmp_seq=2 ttl=64 time=0.233 ms
^C
--- 1.1.1.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.233/0.322/0.411/0.089 ms

I had a tcpdump running on host2 before launching the ping. It captured the ARP resolution and ICMP exchange over VXLAN nicely

Host2$ sudo tcpdump -n -i enp6s0f1 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp6s0f1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:04:30.095106 04:3f:72:ad:05:13 > 04:3f:72:ad:1d:7f, ethertype IPv4 (0x0800), length 110: 192.168.203.51.49152 > 192.168.203.56.4789: VXLAN, flags [I] (0x08), vni 16
e4:11:11:11:11:10 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 1.1.1.3 tell 1.1.1.2, length 46
16:04:30.095212 04:3f:72:ad:1d:7f > 04:3f:72:ad:05:13, ethertype IPv4 (0x0800), length 92: 192.168.203.56.57222 > 192.168.203.51.4789: VXLAN, flags [I] (0x08), vni 16
82:8a:29:ad:4e:ef > e4:11:11:11:11:10, ethertype ARP (0x0806), length 42: Reply 1.1.1.3 is-at 82:8a:29:ad:4e:ef, length 28
16:04:30.095309 04:3f:72:ad:05:13 > 04:3f:72:ad:1d:7f, ethertype IPv4 (0x0800), length 148: 192.168.203.51.58743 > 192.168.203.56.4789: VXLAN, flags [I] (0x08), vni 16
e4:11:11:11:11:10 > 82:8a:29:ad:4e:ef, ethertype IPv4 (0x0800), length 98: 1.1.1.2 > 1.1.1.3: ICMP echo request, id 1, seq 1, length 64
16:04:30.095351 04:3f:72:ad:1d:7f > 04:3f:72:ad:05:13, ethertype IPv4 (0x0800), length 148: 192.168.203.56.47046 > 192.168.203.51.4789: VXLAN, flags [I] (0x08), vni 16
82:8a:29:ad:4e:ef > e4:11:11:11:11:10, ethertype IPv4 (0x0800), length 98: 1.1.1.3 > 1.1.1.2: ICMP echo reply, id 1, seq 1, length 64
16:04:31.096154 04:3f:72:ad:05:13 > 04:3f:72:ad:1d:7f, ethertype IPv4 (0x0800), length 148: 192.168.203.51.58743 > 192.168.203.56.4789: VXLAN, flags [I] (0x08), vni 16
e4:11:11:11:11:10 > 82:8a:29:ad:4e:ef, ethertype IPv4 (0x0800), length 98: 1.1.1.2 > 1.1.1.3: ICMP echo request, id 1, seq 2, length 64
16:04:31.096193 04:3f:72:ad:1d:7f > 04:3f:72:ad:05:13, ethertype IPv4 (0x0800), length 148: 192.168.203.56.47046 > 192.168.203.51.4789: VXLAN, flags [I] (0x08), vni 16
82:8a:29:ad:4e:ef > e4:11:11:11:11:10, ethertype IPv4 (0x0800), length 98: 1.1.1.3 > 1.1.1.2: ICMP echo reply, id 1, seq 2, length 64
^C
6 packets captured
6 packets received by filter
0 packets dropped by kernel

Well, but how do we now, the encap/decap happened actually in hardware on the ethernet card? Well, by running tcpdump on vxlan16 on Host1, which doesn’t capture any traffic:

Host1$ sudo tcpdump -n -i vxlan16 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vxlan16, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

We can also check the hardware transmit counters using tc

Host1$ tc -s filter show dev enp6s0f1_0 root  | grep 'Sent hardware'
Sent hardware 888 bytes 6 pkt
Sent hardware 330 bytes 3 pkt
Sent hardware 110 bytes 1 pkt

Host1$ tc -s filter show dev vxlan16 root | grep 'Sent hardware'
Sent hardware 588 bytes 6 pkt
Sent hardware 168 bytes 4 pkt

Summary

This was my very first attempt in using VXLAN hardware offload on the Mellanox ConnectX-5 card and I wanted to document all the steps taken, that finally get the ofload working. A setup like this won’t be used 1:1 in a production deployment, but served me well to learn “whats required under the hood” and how to troubleshoot it.

Some readers might simply point to OVS, which not only simplifies many of the steps taken, but also automate the mac learning.

Maybe I’ll find some time to explore this hardware offload further, e.g. by “outsourcing” the learning of layer2 addresses to BGP via EVPN/VXLAN, handled by Juniper cRPD.