Re: EXT: Re: Is this thing still on? Memory pressure/fragmentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Following back up. We ended up increasing min_free_kbytes to 6GB, and dropping MTU back down to 1500. For now, we seem to have stopped the transmit packet drops, though there are other bugs we (and RH) are chasing down.  It got bad enough last week that even an ifconfig command resulted in 32KB page allocation failure. This much min_free_kb is definitely not ideal, but for now, it’s helping us get by.


Warren Wang



From: Warren Wang <Warren.Wang@xxxxxxxxxxx>
Date: Monday, October 16, 2017 at 10:10 AM
To: Ben England <bengland@xxxxxxxxxx>, Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>
Cc: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>, ceph-perf-scale <ceph-perf-scale@xxxxxxxxxx>
Subject: Re: EXT: Re: [Ceph-large] Is this thing still on? Memory pressure/fragmentation


We’re already at 4GB min_free_kbytes. I hesitate to increase it more because we seem to already be short on memory at times. We see commit % over 100% for hours. I’m not entirely sure the effect of increasing it more would be during those times. I do agree that it might help before we hit critical mass. Our recovery event finally ended after over a week (one host in, one host out), and things are more stable now.


Long term, perhaps the memory recommendations for large busy clusters should be much higher than what they are now, or the memory management will need to get better.


Warren Wang



From: Ben England <bengland@xxxxxxxxxx>
Date: Monday, October 16, 2017 at 9:18 AM
To: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>
Cc: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>, Warren Wang <Warren.Wang@xxxxxxxxxxx>, ceph-perf-scale <ceph-perf-scale@xxxxxxxxxx>
Subject: EXT: Re: [Ceph-large] Is this thing still on? Memory pressure/fragmentation




If you are having memory fragmentation problems with jumbo frames,  you could try increasing vm.min_free_kbytes so the system doesn't have to work so hard to find chunks of memory with the right size, and reclaims memory from inactive pages sooner (i.e. before it is needed).  Usually you can double the default for this and still have < 1% free memory.  Mostly I had to do this on RHEL6 systems and have not had to do it for a long time (because Intel NIC drivers don't allocate contiguous physical memory for jumbo frames anymore?).


Since Ceph (and other distributed storage systems) distribute load evenly across OSDs in a mostly static way, then dropping cache on any system will significantly lower system throughput by increasing latency on affected OSDs, so I/O requests will tend to queue up on the "slow" OSDs, increasing latency further, until the cache recovers.  Increasing free memory as described above might be better than lobotomizing the cache with a cache-drop command.


Just curious if this applies here, I found out about a behavior recently where there is a "learn" cycle in which writeback cache is temporarily shut off while a storage controller determines its backup battery status.   A spot check indicated that there was a correlation to higher Ceph OSD latency and device utilization.  I suspect this too can alter Ceph cluster performance, again by suddenly increasing latency on a set of OSDs.  At minimum, the recommendation is to track status and health of your storage controller's writeback cache, if you use that feature.





-ben e



On Fri, Oct 13, 2017 at 6:20 PM, Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx> wrote:

We have several Hammer clusters (0.94.9) that have 1,400-1,500 OSDs and have been running for months or years depending on the cluster. We run either 24 3TB or 32 4TB OSDs per host with 192GB of memory.

We have seen similar memory issues in the past, but we have been able to mitigate them by mostly avoiding swapping (vm.swappiness=0), keeping some free memory around (vm.min_free_kbytes>0), and by adjusting vfs_cache_pressure to keep the Linux page cache from using too much memory.

Around the same time we made these changes we switched from Ubuntu 14.04's default 3.13 kennel to 3.16 to pick up an XFS fix that was causing frequent crashes. That may or may not be related to yours.

Hope this is helpful.

Error! Filename not specified.

Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.


From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx> on behalf of Warren Wang <Warren.Wang@xxxxxxxxxxx>
Sent: Friday, October 13, 2017 2:49:16 PM
To: ceph-large@xxxxxxxxxxxxxx
Subject: [Ceph-large] Is this thing still on? Memory pressure/fragmentation


Hi folks, I’m not even sure this list is still active.


Curious to hear from anyone else out in the community. Does anyone out there have large clusters that have been running for a long amount of time (100+ days), and have successfully done significant amounts of recovery? We are seeing all sorts of memory pressure problems. I’ll try to keep this short. I didn’t send it to the normal users list because I keep getting punted, since our corporate mail server apparently doesn’t like the incoming volume.


37 nodes in our busiest dedicated object storage cluster (we have lots of clusters…)

15 8TB drives

2x 1.6TB NVMe for journal + LVM cache for spinning rust

128GB DDR4 (regretfully small)

2x E52650 pinned at max freq, cstate 0

2x 25Gbe (1 public, 1 cluster, cluster NIC set with jumbo frames on)

Ubuntu 16.04, 4.4.0-78 and 4.4.0-96

RHCS Ceph ( we do have cases open w/ RH, wanted to hear from the other users out there )


Over time, we start seeing symptoms of high memory pressure, such as:

-          Kswapd churning

-          Dropped tx packets (almost always heartbeats, causing “wrongly marked down” alerts, don’t mask this!)

-          XFS crashes (unsure this is related)

-          RGW oddities, like stale index entries,  and false 5xx responses (unsure this is related)


Our normal traffic is measured in GBps, not Mbps J Anything under 2GBps is considered a slow day. We have figured out a few things along the way. Don’t drop cache while OSDs are running. This triggers the XFS crash pretty quickly. /proc/buddyinfo is a good indicator of memory problems. We see a lack of 8K and larger pages, which will cause problems for the jumbo frame config. Asked our high traffic generators to back off during recovery.


Our future machines will have 256GB, but even still, the memory will eventually get fragmented with enough use. I know this completely changes with bluestore, since we wouldn’t have page cache, or normal slab info to handle in memory, but I think bluestore at our scale in prod is likely quite a way off.


Warren Wang


Ceph-large mailing list


Ceph-large mailing list

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFS]

  Powered by Linux