Following back up. We ended up increasing min_free_kbytes to 6GB, and dropping MTU back down to 1500. For now, we seem to have stopped the transmit packet drops, though there are other bugs we (and RH) are chasing down. It got bad enough last week that even an ifconfig command resulted in 32KB page allocation failure. This much min_free_kb is definitely not ideal, but for now, it’s helping us get by.
Warren Wang <Warren.Wang@xxxxxxxxxxx>
We’re already at 4GB min_free_kbytes. I hesitate to increase it more because we seem to already be short on memory at times. We see commit % over 100% for hours. I’m not entirely sure the effect of increasing it more would be during those times. I do agree that it might help before we hit critical mass. Our recovery event finally ended after over a week (one host in, one host out), and things are more stable now.
Long term, perhaps the memory recommendations for large busy clusters should be much higher than what they are now, or the memory management will need to get better.
Ben England <bengland@xxxxxxxxxx>
If you are having memory fragmentation problems with jumbo frames, you could try increasing vm.min_free_kbytes so the system doesn't have to work so hard to find chunks of memory with the right size, and reclaims memory from inactive pages sooner (i.e. before it is needed). Usually you can double the default for this and still have < 1% free memory. Mostly I had to do this on RHEL6 systems and have not had to do it for a long time (because Intel NIC drivers don't allocate contiguous physical memory for jumbo frames anymore?).
Since Ceph (and other distributed storage systems) distribute load evenly across OSDs in a mostly static way, then dropping cache on any system will significantly lower system throughput by increasing latency on affected OSDs, so I/O requests will tend to queue up on the "slow" OSDs, increasing latency further, until the cache recovers. Increasing free memory as described above might be better than lobotomizing the cache with a cache-drop command.
Just curious if this applies here, I found out about a behavior recently where there is a "learn" cycle in which writeback cache is temporarily shut off while a storage controller determines its backup battery status. A spot check indicated that there was a correlation to higher Ceph OSD latency and device utilization. I suspect this too can alter Ceph cluster performance, again by suddenly increasing latency on a set of OSDs. At minimum, the recommendation is to track status and health of your storage controller's writeback cache, if you use that feature.
On Fri, Oct 13, 2017 at 6:20 PM, Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx> wrote:
_______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com