Re: EXT: Re: EXT: Re: Is this thing still on? Memory pressure/fragmentation

Aaron Bassett <Aaron.Bassett@xxxxxxxxxxxxx> · Mon, 16 Oct 2017 15:53:23 +0000

What is the nature of the troublesome workload? Small vs large objects, amount of concurrency, etc?

On Oct 16, 2017, at 10:17 AM, Warren Wang <Warren.Wang@xxxxxxxxxxx> wrote:

I think for most people, this is likely okay, but if the cluster is really busy and has a lot of objects in it, that’s when these problems start cropping up. Maybe the kernel or tcmalloc version
 has something to do with it too?

We’re at about 300 million objects in this cluster now, and it’s common for us to get traffic (according to Ceph) over 10GBps. RGW only. This cluster does nothing but RGW. Honestly, I think the cluster
 would likely be far more stable if it weren’t for one particular workload, except that is our most important workload, and also happens to consume the majority of our capacity. As a result of the workloads, we seem to have also uncovered a bunch of crazy bugs
 in RGW in Jewel.

Warren Wang

Walmart ✻

From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx>
 on behalf of Aaron Bassett <Aaron.Bassett@xxxxxxxxxxxxx>

Date: Monday, October 16, 2017 at 9:20 AM

To: Kjetil Joergensen <kjetil@xxxxxxxxxxxx>

Cc: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>

Subject: EXT: Re: [Ceph-large] EXT: Re: Is this thing still on? Memory pressure/fragmentation

Wow I feel like we got off easy. We're on our second generation of hardware running Jewel:

32 Nodes

35 8TB spinners w/collocated journals

128GB Mem

Dual  E5-2620 

Dual 10G nics w/jumbo frames

14.04 on 4.2 kernel

5 RGWs - these are the only clients

This was designed as "warm" object storage with performance secondary to price, hence the lack of flash for journals. Still we've gotten perfectly adequate performance and had nearly 0 problems with it. Jumbo frames, turning off swap and upping some ulimits
 are pretty much the only tuning we've had to do. We've gone through recovery from removing 2 nodes with no problems. 

Aaron 

On Oct 13, 2017, at 8:16 PM, Kjetil Joergensen <kjetil@xxxxxxxxxxxx> wrote:

Hi, 

when you say "XFS crashes" is it actual crashes, or is it "soft" allocation failures ?

We were "experimenting" with dense nodes, as in 31 x 8TB spinning rust, NVMe journal, 256GB RAM, 40 GigE, and ended up on the wrong end of how linux-xfs reclaims memory.

Our symptoms were packet drops (I forget which direction) and latencies jumping up into the second range when packets were't dropped, which caused osd's to be marked as non-responsive by the cluster, and sometimes this spiraled out of hand. Essentially, the
 XFS reclaim were something along the lines of, for each xfs filesystem, try to reclaim without "big fat lock", if that fails, grab "big fat lock" and reclaim. The drives were hideously slow, sadness ensued. In the end, we ended up systemtap'ing the xfs driver
 into not grabbing the big fat lock, hoping that it could do reclaim on one of the other XFS filesystems without "big fat lock". It's so far worked really well for us, and our current plan is to leave the ugly hack in place until we feel bluestore is the way
 to go.

A more detailed and better writeup: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-February/016094.html

-KJ

On Fri, Oct 13, 2017 at 3:53 PM, Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx> wrote:

Those settings seem pretty reasonable. I don't think your XFS crash is the same one we were seeing. It's been quite a while, but the call stack doesn't appear to be the same.

We don't see kswapd chewing up CPUs anymore, but we certainly did before changing those kernel settings.

We don't use jumbo frames. We also don't use RGW. Our use case is strictly RBDs.

<SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg>

Steve Taylor |
 Senior Software Engineer | StorageCraft
 Technology Corporation

380 Data Drive Suite 300 | Draper | Utah | 84020

Office: 801.871.2799 | 

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that
 any dissemination or copying of this message is prohibited.

From: Warren Wang

Sent: Friday, October 13, 4:31 PM

Subject: Re: EXT: Re: [Ceph-large] Is this thing still on? Memory pressure/fragmentation

To: Steve Taylor, ceph-large@xxxxxxxxxxxxxx

I forgot to mention, we have not seen this behavior in our Hammer clusters. Those seem pretty stable. RGW had a major rewrite in Jewel.

Warren Wang

Strati Cloud Storage

M: 703-598-1643

Walmart ✻

From: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>

Date: Friday, October 13, 2017 at 6:20 PM

To: "ceph-large@xxxxxxxxxxxxxx"
 <ceph-large@xxxxxxxxxxxxxx>, Warren Wang <Warren.Wang@xxxxxxxxxxx>

Subject: EXT: Re: [Ceph-large] Is this thing still on? Memory pressure/fragmentation

We have several Hammer clusters (0.94.9) that have 1,400-1,500 OSDs and have been running for months or years depending on the cluster. We run either 24 3TB or 32 4TB OSDs per host with 192GB of
 memory.

We have seen similar memory issues in the past, but we have been able to mitigate them by mostly avoiding swapping (vm.swappiness=0), keeping some free memory around (vm.min_free_kbytes>0), and
 by adjusting vfs_cache_pressure to keep the Linux page cache from using too much memory.

Around the same time we made these changes we switched from Ubuntu 14.04's default 3.13 kennel to 3.16 to pick up an XFS fix that was causing frequent crashes. That may or may not be related to
 yours.

Hope this is helpful.

<image001.jpg>

Steve Taylor 

| Senior Software Engineer | 

StorageCraft
 Technology Corporation

380 Data Drive Suite 300 | Draper | Utah | 84020 

Office:

801.871.2799 |

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx>
 on behalf of Warren Wang <Warren.Wang@xxxxxxxxxxx>

Sent: Friday, October 13, 2017 2:49:16 PM

To: ceph-large@xxxxxxxxxxxxxx

Subject: [Ceph-large] Is this thing still on? Memory pressure/fragmentation

Hi folks, I’m not even sure this list is still active.

Curious to hear from anyone else out in the community. Does anyone out there have large clusters that have been running for a long amount of time (100+ days), and have successfully done significant
 amounts of recovery? We are seeing all sorts of memory pressure problems. I’ll try to keep this short. I didn’t send it to the normal users list because I keep getting punted, since our corporate mail server apparently doesn’t like the incoming volume.

37 nodes in our busiest dedicated object storage cluster (we have lots of clusters…)

15 8TB drives

2x 1.6TB NVMe for journal + LVM cache for spinning rust

128GB DDR4 (regretfully small)

2x E52650 pinned at max freq, cstate 0

2x 25Gbe (1 public, 1 cluster, cluster NIC set with jumbo frames on)

Ubuntu 16.04, 4.4.0-78 and 4.4.0-96

RHCS Ceph ( we do have cases open w/ RH, wanted to hear from the other users out there )

Over time, we start seeing symptoms of high memory pressure, such as:

-          Kswapd churning

-          Dropped tx packets (almost always heartbeats, causing “wrongly marked down” alerts, don’t mask this!)

-          XFS crashes (unsure this is related)

-          RGW oddities, like stale index entries,  and false 5xx responses (unsure this is related)

Our normal traffic is measured in GBps, not Mbps J Anything under 2GBps is considered a slow day. We have figured out a few things along the way. Don’t drop cache while OSDs are running. This triggers
 the XFS crash pretty quickly. /proc/buddyinfo is a good indicator of memory problems. We see a lack of 8K and larger pages, which will cause problems for the jumbo frame config. Asked our high traffic generators to back off during recovery.

Our future machines will have 256GB, but even still, the memory will eventually get fragmented with enough use. I know this completely changes with bluestore, since we wouldn’t have page cache,
 or normal slab info to handle in memory, but I think bluestore at our scale in prod is likely quite a way off.

Warren Wang

Walmart ✻

_______________________________________________

Ceph-large mailing list

Ceph-large@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

-- 

Kjetil Joergensen <kjetil@xxxxxxxxxxxx>

SRE, Medallia Inc

_______________________________________________

Ceph-large mailing list

Ceph-large@xxxxxxxxxxxxxx

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dlarge-2Dceph.com&d=DwICAg&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=YdsjMWuTddcgt4W1SKq6Lan0v6fSzsVTsGq8p2Env2A&s=tyT4PzkLTGGR5Z1rp-tQSBMSmGs5MnElR7AaBo8-6C4&e=

CONFIDENTIALITY NOTICE

This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution
 or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you.

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com