Re: EXT: Re: Is this thing still on? Memory pressure/fragmentation

Warren Wang <Warren.Wang@xxxxxxxxxxx> · Fri, 13 Oct 2017 22:30:50 +0000

I forgot to mention, we have not seen this behavior in our Hammer clusters. Those seem pretty stable. RGW had a major rewrite in Jewel.

Warren Wang
Strati Cloud Storage

M: 703-598-1643

Walmart ✻

From: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>

Date: Friday, October 13, 2017 at 6:20 PM

To: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>, Warren Wang <Warren.Wang@xxxxxxxxxxx>

Subject: EXT: Re: [Ceph-large] Is this thing still on? Memory pressure/fragmentation

We have several Hammer clusters (0.94.9) that have 1,400-1,500 OSDs and have been running for months or years depending on the
 cluster. We run either 24 3TB or 32 4TB OSDs per host with 192GB of memory.

We have seen similar memory issues in the past, but we have been able to mitigate them by mostly avoiding swapping (vm.swappiness=0),
 keeping some free memory around (vm.min_free_kbytes>0), and by adjusting vfs_cache_pressure to keep the Linux page cache from using too much memory.

Around the same time we made these changes we switched from Ubuntu 14.04's default 3.13 kennel to 3.16 to pick up an XFS fix
 that was causing frequent crashes. That may or may not be related to yours.

Hope this is helpful.

Steve Taylor | Senior Software Engineer |
StorageCraft Technology Corporation

380 Data Drive Suite 300 | Draper | Utah | 84020

Office: 801.871.2799 | 

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or
 copying of this message is prohibited.

From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx> on behalf of Warren Wang <Warren.Wang@xxxxxxxxxxx>

Sent: Friday, October 13, 2017 2:49:16 PM

To: ceph-large@xxxxxxxxxxxxxx

Subject: [Ceph-large] Is this thing still on? Memory pressure/fragmentation

Hi folks, I’m not even sure this list is still active.

Curious to hear from anyone else out in the community. Does anyone out there have large clusters that have been running for a long amount of time (100+ days), and have successfully done significant amounts
 of recovery? We are seeing all sorts of memory pressure problems. I’ll try to keep this short. I didn’t send it to the normal users list because I keep getting punted, since our corporate mail server apparently doesn’t like the incoming volume.

37 nodes in our busiest dedicated object storage cluster (we have lots of clusters…)
15 8TB drives
2x 1.6TB NVMe for journal + LVM cache for spinning rust
128GB DDR4 (regretfully small)
2x E52650 pinned at max freq, cstate 0
2x 25Gbe (1 public, 1 cluster, cluster NIC set with jumbo frames on)
Ubuntu 16.04, 4.4.0-78 and 4.4.0-96
RHCS Ceph ( we do have cases open w/ RH, wanted to hear from the other users out there )

Over time, we start seeing symptoms of high memory pressure, such as:
-         
Kswapd churning
-         
Dropped tx packets (almost always heartbeats, causing “wrongly marked down” alerts, don’t mask this!)
-         
XFS crashes (unsure this is related)
-         
RGW oddities, like stale index entries,  and false 5xx responses (unsure this is related)

Our normal traffic is measured in GBps, not Mbps
J Anything under 2GBps is considered a slow day. We have figured out a few things along the way. Don’t drop cache while OSDs are running. This triggers the XFS
 crash pretty quickly. /proc/buddyinfo is a good indicator of memory problems. We see a lack of 8K and larger pages, which will cause problems for the jumbo frame config. Asked our high traffic generators to back off during recovery.

Our future machines will have 256GB, but even still, the memory will eventually get fragmented with enough use. I know this completely changes with bluestore, since we wouldn’t have page cache, or normal slab
 info to handle in memory, but I think bluestore at our scale in prod is likely quite a way off.

Warren Wang
Walmart ✻

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com