As a little "heads-up": If you are running Ubuntu Bionic 18.04, or Xenial 16.04 with "HWE" kernels, and have systems running under 4.15.0-36 - which was the default between 2018-10-01 and 2018-10-22 - please consider upgrading to the latest 4.15.0-38 ASAP (or downgrade to 4.15.0-34). 4.15.0-36 has a TCP bug[1] that can occasionally slow down a TCP connection to a trickle of 2.5 Kbytes/s (512-byte segments every 200ms). Once a TCP connection is in this state, it will never get out. This started happening within our Ceph clusters after we reinstalled a few servers as part of our Bluestore migration. The effect on our RBD users (OpenStack VMs) was pretty terrible - the typical 4MB transaction would take about 27 MINUTES at this rate, causing timeouts and crashes. This was absolutely painful to diagnose, because it happened so rarely and was hard to reproduce. Fortunately the fix is easy - just don't run this kernel. I should note that our Ceph clusters run over IPv6; I'm not sure whether the TCP bug can hit with IPv4 (the bug was reported for IPv6 as well), although I see no reason why it shouldn't. -- Simon. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796895 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com