Proxmox: Packet loss on bonded NICs in LACP mode












0















Got a bit of a strange problem. I have a machine running Proxmox 5.3 that has among its hardware a 4 port Intel NIC card (Gigabit, PCI-e) in addition to a fifth Gigabit ethernet on the motherboard.



I have the machine configured that the onboard NIC is the management interface for the machine, and the 4 gigabit NICs are bonded together with LACP (and connected to a HP ProCurve 1810G managed switch) - all the VM and containers on the box get network connectivity through the bonded NIC. Obviously, the switch is managed and supports LACP and I have configured a trunk setup on the switch for the 4 ports.



Everything seems to work fine, or so I thought.



Over the weekend I installed netdata on the Proxmox host, and now I'm getting continual alarms about packet loss on bond0 (the 4 bonded NICs). I'm a little perplexed as to why.



Looking at the statistics for bond0, it seems that RX packets are getting dropped with reasonable frequency (currently showing ~160 RX packets dropped in the last 10 minutes - no TX packets seem to get dropped).



Interface output below, you'll note that the bridge interface to the VMs has no dropped packets, it's happening only on bond0 and it's slaves. The MTU is set to 9000 (jumbo frames are enabled on the switch) - I was still seeing this issue with MTU as 1500. enp12s0 is the management NIC, the other 4 NICs are the bond slaves.



bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
RX packets 347300 bytes 146689725 (139.8 MiB)
RX errors 0 dropped 11218 overruns 0 frame 0
TX packets 338459 bytes 132985798 (126.8 MiB)
TX errors 0 dropped 2 overruns 0 carrier 0 collisions 0

enp12s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.3 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::7285:c2ff:fe67:19b9 prefixlen 64 scopeid 0x20<link>
ether 70:85:c2:67:19:b9 txqueuelen 1000 (Ethernet)
RX packets 25416597 bytes 36117733348 (33.6 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 16850795 bytes 21472508786 (19.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp3s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
RX packets 225363 bytes 113059352 (107.8 MiB)
RX errors 0 dropped 2805 overruns 0 frame 0
TX packets 15162 bytes 2367657 (2.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp3s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
RX packets 25499 bytes 6988254 (6.6 MiB)
RX errors 0 dropped 2805 overruns 0 frame 0
TX packets 263442 bytes 123302293 (117.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
RX packets 33208 bytes 11681537 (11.1 MiB)
RX errors 0 dropped 2804 overruns 0 frame 0
TX packets 42729 bytes 2258949 (2.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
RX packets 63230 bytes 14960582 (14.2 MiB)
RX errors 0 dropped 2804 overruns 0 frame 0
TX packets 17126 bytes 5056899 (4.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vmbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.1.4 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::21b:21ff:fec7:40d8 prefixlen 64 scopeid 0x20<link>
ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
RX packets 54616 bytes 5852177 (5.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 757 bytes 61270 (59.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


Initially suspecting it was some kind of buffer issue, I did some tweaking in sysctl to make sure the buffer sizes were adequate. sysctl tweaks can be found here (these did not appear to make any difference):



https://paste.linux.community/view/3b5f2b63



Network config is:



auto lo
iface lo inet loopback

auto enp12s0
iface enp12s0 inet static
address 192.168.1.3
netmask 255.255.255.0

iface enp3s0f0 inet manual

iface enp3s0f1 inet manual

iface enp4s0f0 inet manual

iface enp4s0f1 inet manual

auto bond0
iface bond0 inet manual
bond-slaves enp3s0f0 enp3s0f1 enp4s0f0 enp4s0f1
bond-miimon 100
bond-mode 802.3ad
mtu 9000

auto vmbr0
iface vmbr0 inet static
address 192.168.1.4
netmask 255.255.255.0
gateway 192.168.1.1
bridge-ports bond0
bridge-stp off
bridge-fd 0


Troubleshooting steps I took:



a) sysctl tweaks (as attached)
b) MTU increase and enabling jumbo frames on the switch (no change)
c) Reset the switch and recreated the LACP trunk (no change)



Any ideas on what I should try next? I'm starting to think that there is something I don't understand about the NIC teaming. As I said, everything seems to work fine but it concerns me a little about the high packet loss.



Other machines on the network that are connected to the switch do not have this issue (the 5th NIC on the machine is fine, too).










share|improve this question



























    0















    Got a bit of a strange problem. I have a machine running Proxmox 5.3 that has among its hardware a 4 port Intel NIC card (Gigabit, PCI-e) in addition to a fifth Gigabit ethernet on the motherboard.



    I have the machine configured that the onboard NIC is the management interface for the machine, and the 4 gigabit NICs are bonded together with LACP (and connected to a HP ProCurve 1810G managed switch) - all the VM and containers on the box get network connectivity through the bonded NIC. Obviously, the switch is managed and supports LACP and I have configured a trunk setup on the switch for the 4 ports.



    Everything seems to work fine, or so I thought.



    Over the weekend I installed netdata on the Proxmox host, and now I'm getting continual alarms about packet loss on bond0 (the 4 bonded NICs). I'm a little perplexed as to why.



    Looking at the statistics for bond0, it seems that RX packets are getting dropped with reasonable frequency (currently showing ~160 RX packets dropped in the last 10 minutes - no TX packets seem to get dropped).



    Interface output below, you'll note that the bridge interface to the VMs has no dropped packets, it's happening only on bond0 and it's slaves. The MTU is set to 9000 (jumbo frames are enabled on the switch) - I was still seeing this issue with MTU as 1500. enp12s0 is the management NIC, the other 4 NICs are the bond slaves.



    bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
    ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
    RX packets 347300 bytes 146689725 (139.8 MiB)
    RX errors 0 dropped 11218 overruns 0 frame 0
    TX packets 338459 bytes 132985798 (126.8 MiB)
    TX errors 0 dropped 2 overruns 0 carrier 0 collisions 0

    enp12s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
    inet 192.168.1.3 netmask 255.255.255.0 broadcast 192.168.1.255
    inet6 fe80::7285:c2ff:fe67:19b9 prefixlen 64 scopeid 0x20<link>
    ether 70:85:c2:67:19:b9 txqueuelen 1000 (Ethernet)
    RX packets 25416597 bytes 36117733348 (33.6 GiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 16850795 bytes 21472508786 (19.9 GiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    enp3s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
    ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
    RX packets 225363 bytes 113059352 (107.8 MiB)
    RX errors 0 dropped 2805 overruns 0 frame 0
    TX packets 15162 bytes 2367657 (2.2 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    enp3s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
    ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
    RX packets 25499 bytes 6988254 (6.6 MiB)
    RX errors 0 dropped 2805 overruns 0 frame 0
    TX packets 263442 bytes 123302293 (117.5 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
    ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
    RX packets 33208 bytes 11681537 (11.1 MiB)
    RX errors 0 dropped 2804 overruns 0 frame 0
    TX packets 42729 bytes 2258949 (2.1 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
    ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
    RX packets 63230 bytes 14960582 (14.2 MiB)
    RX errors 0 dropped 2804 overruns 0 frame 0
    TX packets 17126 bytes 5056899 (4.8 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    vmbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
    inet 192.168.1.4 netmask 255.255.255.0 broadcast 192.168.1.255
    inet6 fe80::21b:21ff:fec7:40d8 prefixlen 64 scopeid 0x20<link>
    ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
    RX packets 54616 bytes 5852177 (5.5 MiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 757 bytes 61270 (59.8 KiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


    Initially suspecting it was some kind of buffer issue, I did some tweaking in sysctl to make sure the buffer sizes were adequate. sysctl tweaks can be found here (these did not appear to make any difference):



    https://paste.linux.community/view/3b5f2b63



    Network config is:



    auto lo
    iface lo inet loopback

    auto enp12s0
    iface enp12s0 inet static
    address 192.168.1.3
    netmask 255.255.255.0

    iface enp3s0f0 inet manual

    iface enp3s0f1 inet manual

    iface enp4s0f0 inet manual

    iface enp4s0f1 inet manual

    auto bond0
    iface bond0 inet manual
    bond-slaves enp3s0f0 enp3s0f1 enp4s0f0 enp4s0f1
    bond-miimon 100
    bond-mode 802.3ad
    mtu 9000

    auto vmbr0
    iface vmbr0 inet static
    address 192.168.1.4
    netmask 255.255.255.0
    gateway 192.168.1.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0


    Troubleshooting steps I took:



    a) sysctl tweaks (as attached)
    b) MTU increase and enabling jumbo frames on the switch (no change)
    c) Reset the switch and recreated the LACP trunk (no change)



    Any ideas on what I should try next? I'm starting to think that there is something I don't understand about the NIC teaming. As I said, everything seems to work fine but it concerns me a little about the high packet loss.



    Other machines on the network that are connected to the switch do not have this issue (the 5th NIC on the machine is fine, too).










    share|improve this question

























      0












      0








      0








      Got a bit of a strange problem. I have a machine running Proxmox 5.3 that has among its hardware a 4 port Intel NIC card (Gigabit, PCI-e) in addition to a fifth Gigabit ethernet on the motherboard.



      I have the machine configured that the onboard NIC is the management interface for the machine, and the 4 gigabit NICs are bonded together with LACP (and connected to a HP ProCurve 1810G managed switch) - all the VM and containers on the box get network connectivity through the bonded NIC. Obviously, the switch is managed and supports LACP and I have configured a trunk setup on the switch for the 4 ports.



      Everything seems to work fine, or so I thought.



      Over the weekend I installed netdata on the Proxmox host, and now I'm getting continual alarms about packet loss on bond0 (the 4 bonded NICs). I'm a little perplexed as to why.



      Looking at the statistics for bond0, it seems that RX packets are getting dropped with reasonable frequency (currently showing ~160 RX packets dropped in the last 10 minutes - no TX packets seem to get dropped).



      Interface output below, you'll note that the bridge interface to the VMs has no dropped packets, it's happening only on bond0 and it's slaves. The MTU is set to 9000 (jumbo frames are enabled on the switch) - I was still seeing this issue with MTU as 1500. enp12s0 is the management NIC, the other 4 NICs are the bond slaves.



      bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 347300 bytes 146689725 (139.8 MiB)
      RX errors 0 dropped 11218 overruns 0 frame 0
      TX packets 338459 bytes 132985798 (126.8 MiB)
      TX errors 0 dropped 2 overruns 0 carrier 0 collisions 0

      enp12s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
      inet 192.168.1.3 netmask 255.255.255.0 broadcast 192.168.1.255
      inet6 fe80::7285:c2ff:fe67:19b9 prefixlen 64 scopeid 0x20<link>
      ether 70:85:c2:67:19:b9 txqueuelen 1000 (Ethernet)
      RX packets 25416597 bytes 36117733348 (33.6 GiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 16850795 bytes 21472508786 (19.9 GiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp3s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 225363 bytes 113059352 (107.8 MiB)
      RX errors 0 dropped 2805 overruns 0 frame 0
      TX packets 15162 bytes 2367657 (2.2 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp3s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 25499 bytes 6988254 (6.6 MiB)
      RX errors 0 dropped 2805 overruns 0 frame 0
      TX packets 263442 bytes 123302293 (117.5 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 33208 bytes 11681537 (11.1 MiB)
      RX errors 0 dropped 2804 overruns 0 frame 0
      TX packets 42729 bytes 2258949 (2.1 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 63230 bytes 14960582 (14.2 MiB)
      RX errors 0 dropped 2804 overruns 0 frame 0
      TX packets 17126 bytes 5056899 (4.8 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      vmbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
      inet 192.168.1.4 netmask 255.255.255.0 broadcast 192.168.1.255
      inet6 fe80::21b:21ff:fec7:40d8 prefixlen 64 scopeid 0x20<link>
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 54616 bytes 5852177 (5.5 MiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 757 bytes 61270 (59.8 KiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


      Initially suspecting it was some kind of buffer issue, I did some tweaking in sysctl to make sure the buffer sizes were adequate. sysctl tweaks can be found here (these did not appear to make any difference):



      https://paste.linux.community/view/3b5f2b63



      Network config is:



      auto lo
      iface lo inet loopback

      auto enp12s0
      iface enp12s0 inet static
      address 192.168.1.3
      netmask 255.255.255.0

      iface enp3s0f0 inet manual

      iface enp3s0f1 inet manual

      iface enp4s0f0 inet manual

      iface enp4s0f1 inet manual

      auto bond0
      iface bond0 inet manual
      bond-slaves enp3s0f0 enp3s0f1 enp4s0f0 enp4s0f1
      bond-miimon 100
      bond-mode 802.3ad
      mtu 9000

      auto vmbr0
      iface vmbr0 inet static
      address 192.168.1.4
      netmask 255.255.255.0
      gateway 192.168.1.1
      bridge-ports bond0
      bridge-stp off
      bridge-fd 0


      Troubleshooting steps I took:



      a) sysctl tweaks (as attached)
      b) MTU increase and enabling jumbo frames on the switch (no change)
      c) Reset the switch and recreated the LACP trunk (no change)



      Any ideas on what I should try next? I'm starting to think that there is something I don't understand about the NIC teaming. As I said, everything seems to work fine but it concerns me a little about the high packet loss.



      Other machines on the network that are connected to the switch do not have this issue (the 5th NIC on the machine is fine, too).










      share|improve this question














      Got a bit of a strange problem. I have a machine running Proxmox 5.3 that has among its hardware a 4 port Intel NIC card (Gigabit, PCI-e) in addition to a fifth Gigabit ethernet on the motherboard.



      I have the machine configured that the onboard NIC is the management interface for the machine, and the 4 gigabit NICs are bonded together with LACP (and connected to a HP ProCurve 1810G managed switch) - all the VM and containers on the box get network connectivity through the bonded NIC. Obviously, the switch is managed and supports LACP and I have configured a trunk setup on the switch for the 4 ports.



      Everything seems to work fine, or so I thought.



      Over the weekend I installed netdata on the Proxmox host, and now I'm getting continual alarms about packet loss on bond0 (the 4 bonded NICs). I'm a little perplexed as to why.



      Looking at the statistics for bond0, it seems that RX packets are getting dropped with reasonable frequency (currently showing ~160 RX packets dropped in the last 10 minutes - no TX packets seem to get dropped).



      Interface output below, you'll note that the bridge interface to the VMs has no dropped packets, it's happening only on bond0 and it's slaves. The MTU is set to 9000 (jumbo frames are enabled on the switch) - I was still seeing this issue with MTU as 1500. enp12s0 is the management NIC, the other 4 NICs are the bond slaves.



      bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 347300 bytes 146689725 (139.8 MiB)
      RX errors 0 dropped 11218 overruns 0 frame 0
      TX packets 338459 bytes 132985798 (126.8 MiB)
      TX errors 0 dropped 2 overruns 0 carrier 0 collisions 0

      enp12s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
      inet 192.168.1.3 netmask 255.255.255.0 broadcast 192.168.1.255
      inet6 fe80::7285:c2ff:fe67:19b9 prefixlen 64 scopeid 0x20<link>
      ether 70:85:c2:67:19:b9 txqueuelen 1000 (Ethernet)
      RX packets 25416597 bytes 36117733348 (33.6 GiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 16850795 bytes 21472508786 (19.9 GiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp3s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 225363 bytes 113059352 (107.8 MiB)
      RX errors 0 dropped 2805 overruns 0 frame 0
      TX packets 15162 bytes 2367657 (2.2 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp3s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 25499 bytes 6988254 (6.6 MiB)
      RX errors 0 dropped 2805 overruns 0 frame 0
      TX packets 263442 bytes 123302293 (117.5 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 33208 bytes 11681537 (11.1 MiB)
      RX errors 0 dropped 2804 overruns 0 frame 0
      TX packets 42729 bytes 2258949 (2.1 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 63230 bytes 14960582 (14.2 MiB)
      RX errors 0 dropped 2804 overruns 0 frame 0
      TX packets 17126 bytes 5056899 (4.8 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      vmbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
      inet 192.168.1.4 netmask 255.255.255.0 broadcast 192.168.1.255
      inet6 fe80::21b:21ff:fec7:40d8 prefixlen 64 scopeid 0x20<link>
      ether 00:1b:21:c7:40:d8 txqueuelen 1000 (Ethernet)
      RX packets 54616 bytes 5852177 (5.5 MiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 757 bytes 61270 (59.8 KiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


      Initially suspecting it was some kind of buffer issue, I did some tweaking in sysctl to make sure the buffer sizes were adequate. sysctl tweaks can be found here (these did not appear to make any difference):



      https://paste.linux.community/view/3b5f2b63



      Network config is:



      auto lo
      iface lo inet loopback

      auto enp12s0
      iface enp12s0 inet static
      address 192.168.1.3
      netmask 255.255.255.0

      iface enp3s0f0 inet manual

      iface enp3s0f1 inet manual

      iface enp4s0f0 inet manual

      iface enp4s0f1 inet manual

      auto bond0
      iface bond0 inet manual
      bond-slaves enp3s0f0 enp3s0f1 enp4s0f0 enp4s0f1
      bond-miimon 100
      bond-mode 802.3ad
      mtu 9000

      auto vmbr0
      iface vmbr0 inet static
      address 192.168.1.4
      netmask 255.255.255.0
      gateway 192.168.1.1
      bridge-ports bond0
      bridge-stp off
      bridge-fd 0


      Troubleshooting steps I took:



      a) sysctl tweaks (as attached)
      b) MTU increase and enabling jumbo frames on the switch (no change)
      c) Reset the switch and recreated the LACP trunk (no change)



      Any ideas on what I should try next? I'm starting to think that there is something I don't understand about the NIC teaming. As I said, everything seems to work fine but it concerns me a little about the high packet loss.



      Other machines on the network that are connected to the switch do not have this issue (the 5th NIC on the machine is fine, too).







      linux networking network-adapter bonding proxmox






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Feb 11 at 10:18









      NOPNOP

      228312




      228312






















          1 Answer
          1






          active

          oldest

          votes


















          1














          I have seen this before: HP switches seem to sometimes send broadast packets to all members of an LACP trunk. The Kernel then sees these packets as duplicates and drops those (apart from the first arriving one of course).



          While this is of course not elegant, it seems not to create problems in real life. You can verify if it is this effect by sending many broadcast packets on purpose and checking if this lines up with the drop statistics.






          share|improve this answer
























          • Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

            – NOP
            Feb 11 at 10:41











          • My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

            – Eugen Rieck
            Feb 11 at 10:59











          • I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

            – NOP
            Feb 13 at 2:13













          • Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

            – NOP
            Feb 13 at 2:44











          • ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

            – Eugen Rieck
            Feb 13 at 8:05











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "3"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1404381%2fproxmox-packet-loss-on-bonded-nics-in-lacp-mode%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          I have seen this before: HP switches seem to sometimes send broadast packets to all members of an LACP trunk. The Kernel then sees these packets as duplicates and drops those (apart from the first arriving one of course).



          While this is of course not elegant, it seems not to create problems in real life. You can verify if it is this effect by sending many broadcast packets on purpose and checking if this lines up with the drop statistics.






          share|improve this answer
























          • Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

            – NOP
            Feb 11 at 10:41











          • My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

            – Eugen Rieck
            Feb 11 at 10:59











          • I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

            – NOP
            Feb 13 at 2:13













          • Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

            – NOP
            Feb 13 at 2:44











          • ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

            – Eugen Rieck
            Feb 13 at 8:05
















          1














          I have seen this before: HP switches seem to sometimes send broadast packets to all members of an LACP trunk. The Kernel then sees these packets as duplicates and drops those (apart from the first arriving one of course).



          While this is of course not elegant, it seems not to create problems in real life. You can verify if it is this effect by sending many broadcast packets on purpose and checking if this lines up with the drop statistics.






          share|improve this answer
























          • Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

            – NOP
            Feb 11 at 10:41











          • My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

            – Eugen Rieck
            Feb 11 at 10:59











          • I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

            – NOP
            Feb 13 at 2:13













          • Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

            – NOP
            Feb 13 at 2:44











          • ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

            – Eugen Rieck
            Feb 13 at 8:05














          1












          1








          1







          I have seen this before: HP switches seem to sometimes send broadast packets to all members of an LACP trunk. The Kernel then sees these packets as duplicates and drops those (apart from the first arriving one of course).



          While this is of course not elegant, it seems not to create problems in real life. You can verify if it is this effect by sending many broadcast packets on purpose and checking if this lines up with the drop statistics.






          share|improve this answer













          I have seen this before: HP switches seem to sometimes send broadast packets to all members of an LACP trunk. The Kernel then sees these packets as duplicates and drops those (apart from the first arriving one of course).



          While this is of course not elegant, it seems not to create problems in real life. You can verify if it is this effect by sending many broadcast packets on purpose and checking if this lines up with the drop statistics.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Feb 11 at 10:32









          Eugen RieckEugen Rieck

          11.2k22429




          11.2k22429













          • Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

            – NOP
            Feb 11 at 10:41











          • My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

            – Eugen Rieck
            Feb 11 at 10:59











          • I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

            – NOP
            Feb 13 at 2:13













          • Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

            – NOP
            Feb 13 at 2:44











          • ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

            – Eugen Rieck
            Feb 13 at 8:05



















          • Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

            – NOP
            Feb 11 at 10:41











          • My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

            – Eugen Rieck
            Feb 11 at 10:59











          • I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

            – NOP
            Feb 13 at 2:13













          • Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

            – NOP
            Feb 13 at 2:44











          • ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

            – Eugen Rieck
            Feb 13 at 8:05

















          Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

          – NOP
          Feb 11 at 10:41





          Hmm, I did suspect something like this. That would make sense (from the math, being that there's basically the same number of dropped packets across each NIC). Is there a way to tell exactly what's getting dropped? I ran dropwatch on the machine, but I really couldn't interpret the output well enough to conclude one way or the other.

          – NOP
          Feb 11 at 10:41













          My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

          – Eugen Rieck
          Feb 11 at 10:59





          My diagnostics were rather basic: Increase rate of broadcast packets and see, if drop rate increases acordingly. I didn't investigate much further, as everything was working fine with no loss of payload packets.

          – Eugen Rieck
          Feb 11 at 10:59













          I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

          – NOP
          Feb 13 at 2:13







          I did some more digging around. I wrote a script to fire out a whole bunch of broadcast packets. At the same time, I ran tcpdump on all 4 individual NICs + the bond interface. While the broadcast traffic was mostly appearing only on 1 NIC, the odd packet did end up hitting another NIC. It didn't seem to increase the rate of dropped packets, though. The box has been running for a couple of hours, and with my several VMs and containers running + me pelting it with broadcast packets, bond0 has dropped out ~700 of 1437352 RX packets. This is representative of what I've been seeing.

          – NOP
          Feb 13 at 2:13















          Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

          – NOP
          Feb 13 at 2:44





          Ok, to try do some further digging I shut down a particularly chatty VM on the box, which cleared up tcpdump quite a bit. Looking at the tcpdump of the bond interface while watching the interface statistics, there's very strong correlation between ARP requests being on the wire and packets getting dropped. Still sitting at around ~1.5% dropped RX packets on the bond. I guess this makes sense, right?

          – NOP
          Feb 13 at 2:44













          ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

          – Eugen Rieck
          Feb 13 at 8:05





          ARP packets are the canonical example of broadcast traffic, so this does tend to confirm our suspicions. TBH I don't know how ARP over an LACP trunk is actually supposed to be broadcast - maybe an "all ports" policy is the standrad?

          – Eugen Rieck
          Feb 13 at 8:05


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Super User!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1404381%2fproxmox-packet-loss-on-bonded-nics-in-lacp-mode%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

          Mangá

          Eduardo VII do Reino Unido