Re: RGW returning HTTP 500 during resharding

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 28 Sep 2024 20:01:40 -0400

> On Sep 28, 2024, at 5:21 PM, Floris Bos <bos@xxxxxxxxxxxxxxxxxx> wrote:
> 
> "Anthony D'Atri" <aad@xxxxxxxxxxxxxx> schreef op 28 september 2024 16:24:
>>> No retries.
>>> Is it expected that resharding can take so long?
>>> (in a setup with all NVMe drives)
>> 
>> Which drive SKU(s)? How full are they? Is their firmware up to date? How many RGWs? Have you tuned
>> your server network stack? Disabled Nagle? How many bucket OSDs? How many index OSDs? How many PGs
>> in the bucket and index pools? How many buckets? Do you have like 200M objects per? Do you have the
>> default max objects/shard setting?
>> 
>> Tiny objects are the devil of many object systems. I can think of cases where the above questions
>> could affect this case. I think you resharding in advance might help.
> 
> - Drives advertise themselves as “Dell Ent NVMe v2 AGN MU U.2 6.4TB” (think that is Samsung under the Dell sticker)

AGN or AG is Dell for “Agnostic”, i.e. whatever the cheapest is they have on the shelf.  Samsung indeed is one of the potentials.  `smartctl -a` should show you what it actually is. 

> newest 2.5.0 firmware.

Verified by a DSU run?

>  They are pretty empty. Although there is some 10% capacity being used by other stuff (RBD images)
> 
> - Single bucket. My import application already errored out after only 72 M objects/476 GiB of data, and need a lot more. Objects are between 0 bytes and 1 MB, 7 KB average.

Only 72M?  That’s a rather sizable bucket.  Were there existing objects as well?  Do you have the ability to spread across multiple buckets?  That would decrease your need to reshard.  As I interpret the docs, 199M is the default max number of objects above which auto-resharding won’t happen.

Since you *know* that you will be pulling in extreme numbers of objects, consider pre-sharding the bucket while it’s empty.  That will be dramatically faster in every way.

> - Currently using only 1 RGW during my test run to simplify looking at logs, although I have 4.

That’s a lot of ingest for even 4.  Would not be surprised if you’re saturating your connection limit or Linux networking. 

> - I cannot touch TCP socket options settings in my Java application.

Your RGW daemons are running on a Java application, not a Linux system????

> When you build a S3AsyncClient with the Java AWS SDK using the .crtBuilder(), the SDK outsources the communication to the AWS aws-c-s3/aws-c-http/aws-io CRT libraries written in C, and I never get to see the raw socket in Java.
> Looking at the source I don’t think Amazon is disabling the nagle algorithm in their code.

On the server(s).  You’re unhappy with the *server* performance, no?  RGW can configure the front end options to disable Nagle; search the archives for an articles where doing so significantly improved small object latency

> At least I don’t see TCP_NODELAY or similar options being used at the place they seem to set the socket options:
> https://github.com/awslabs/aws-c-io/blob/c345d77274db83c0c2e30331814093e7c84c45e2/source/posix/socket.c#L1216
> 
> - Did not tune any network settings, and it is pretty quiet on the network side, nowhere near saturating bandwidth because objects are so small.

There’s more to life than bandwidth.  somaxconn, nf_conntrack filling up, filling buffers, etc.

> - Did not really tune anything else either yet. Pretty much a default cephadm setup for now.
> 
> - See it (automagically) allocated 1024 PGs for .data and 32 for .index.

Those numbers are nearly useless without context: the rest of the info I requested. There was a reason for everything on the list.  Folks contribute to the list out of the goodness of their hearts, and aren’t paid for back-and-forth tooth-pulling. If your index pool and bucket pool share say 3x HDDs or 3x coarse-IU QLC, then don’t expect much.

Sounds like you have the pg autoscaler enabled, which probably doesn’t help.  Your index pool almost certainly needs more PGs.  Probably log as well, or set the rgw log levels to 0.

> 
> - Think the main delay is just Ceph wanting to make sure everything is sync’ed to storage before reporting success. So that is why I am making a lot of concurrent connections to perform multiple PUT requests simultaneously. But even with 250 connections, it only does around 5000 objects per second according to the “object ingress/egress” Grafana graph. Can probably raise it some more…

With one RGW you aren’t going to get far, unless you have a 500 core CPU, and probably not even than.

> 
> 
> Had the default max. objects per shard settings for the dynamic sharding.
> But have now manually resharded to 10069 shards, and will have a go to see if it works better now.
> 
> 
> Yours sincerely,
> 
> Floris Bos
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx