Re: RGW returning HTTP 500 during resharding

"Floris Bos" <bos@xxxxxxxxxxxxxxxxxx> · Sun, 29 Sep 2024 14:21:19 +0000

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> schreef op 29 september 2024 02:01:

>> On Sep 28, 2024, at 5:21 PM, Floris Bos <bos@xxxxxxxxxxxxxxxxxx> wrote:
>> 
>> "Anthony D'Atri" <aad@xxxxxxxxxxxxxx> schreef op 28 september 2024 16:24:
>> No retries.
>> Is it expected that resharding can take so long?
>> (in a setup with all NVMe drives)
>>> Which drive SKU(s)? How full are they? Is their firmware up to date? How many RGWs? Have you tuned
>>> your server network stack? Disabled Nagle? How many bucket OSDs? How many index OSDs? How many PGs
>>> in the bucket and index pools? How many buckets? Do you have like 200M objects per? Do you have the
>>> default max objects/shard setting?
>>> 
>>> Tiny objects are the devil of many object systems. I can think of cases where the above questions
>>> could affect this case. I think you resharding in advance might help.
>> 
>> - Drives advertise themselves as “Dell Ent NVMe v2 AGN MU U.2 6.4TB” (think that is Samsung under
>> the Dell sticker)
> 
> AGN or AG is Dell for “Agnostic”, i.e. whatever the cheapest is they have on the shelf. Samsung
> indeed is one of the potentials. `smartctl -a` should show you what it actually is.

Smartctl only gave “Dell”.
But see iDRAC is more telling and says 90% Samsung, 10% SkHynix (*sigh*)

>> newest 2.5.0 firmware.
> 
> Verified by a DSU run?

Was updated through iDRAC.
Cannot install DSU on the OS itself.

>> They are pretty empty. Although there is some 10% capacity being used by other stuff (RBD images)
>> 
>> - Single bucket. My import application already errored out after only 72 M objects/476 GiB of data,
>> and need a lot more. Objects are between 0 bytes and 1 MB, 7 KB average.
> 
> Only 72M? That’s a rather sizable bucket. Were there existing objects as well? Do you have the
> ability to spread across multiple buckets? That would decrease your need to reshard. As I interpret
> the docs, 199M is the default max number of objects above which auto-resharding won’t happen.
> 
> Since you *know* that you will be pulling in extreme numbers of objects, consider pre-sharding the
> bucket while it’s empty. That will be dramatically faster in every way.

No existing data.
And yes, I set it manually to 10069 shards now.
So now it should not happen again, since that is above the 1999 rgw_max_dynamic_shards

It still feels a bit wrong to me to have to set this manually though.
I am not against having to tune applications for performance gains, but think it is unfortunate that one seems to have to do so just to prevent the “500 internal server errors” that the resharding can effectively cause.

>> - I cannot touch TCP socket options settings in my Java application.
> 
> Your RGW daemons are running on a Java application, not a Linux system????

Sorry, thought you were asking if my (Java based) import-existing-stuff-to-S3 program disabled nagle, in its communication with the rgw.
Ceph has default settings, which is nagle disabled (ms_tcp_nodelay true).

> Those numbers are nearly useless without context: the rest of the info I requested. There was a
> reason for everything on the list. Folks contribute to the list out of the goodness of their
> hearts, and aren’t paid for back-and-forth tooth-pulling. If your index pool and bucket pool share
> say 3x HDDs or 3x coarse-IU QLC, then don’t expect much.
> 
> Sounds like you have the pg autoscaler enabled, which probably doesn’t help. Your index pool almost
> certainly needs more PGs. Probably log as well, or set the rgw log levels to 0.

Reason I am a bit vague about amount of OSDs and such, is that the numbers are not written on stone yet.
Currently using 40 in total, but may or may not be able to “steal” more gear from another project, if the need arises. So I am rather fishing for advice in the form of “if you have >X OSDs, tune this and that” than specific to my exact current situation. ;-)
Not doing anything special placement wise at the moment (not configured anything that would make the index and data go to different OSDs)

Also note that my use-case is a bit non-standard in other ways.
While I have a lot of objects, the application itself is internal within an organization and does not have many concurrent users.
The performance demand is less than one would need if you have say a public facing web application with lots of visitors and peaks.
So my main concern right now is having it run without internal server errors (the resharding thing), and fast enough so that my initial import completes within a couple weeks…

I do have control over the import application, but cannot change the application that will be using the bucket afterwise, so am stuck with their design choice for tiny objects and a single bucket.

Will play a bit with PGs and loglevel.

>> - Think the main delay is just Ceph wanting to make sure everything is sync’ed to storage before
>> reporting success. So that is why I am making a lot of concurrent connections to perform multiple
>> PUT requests simultaneously. But even with 250 connections, it only does around 5000 objects per
>> second according to the “object ingress/egress” Grafana graph. Can probably raise it some more…
> 
> With one RGW you aren’t going to get far, unless you have a 500 core CPU, and probably not even
> than.

I wonder if that also applies to tiny objects.
As I have the impression I am slowed down by RGW/Ceph waiting for individual objects to be synced to storage, instead of having to compute much for a tiny object.

I am experimenting a bit with the settings at the moment.
If I raise the amount of concurrent connections from 250 to 1024 with a single RGW it does raise the number of ingress objects per second from 5k to 9k.
Which means that the RGW was not quite saturated before.
And the server the RGW is on, is only using 14% CPU (1 min avg, taken over all CPU cores)

If I create a DNS entry that includes all 4 RGW IP, and tell it to connect to that with 1024 connections, I see in netstat that it indeeds spreads the connections over the 4 RGW, but I get the same 9k objects/sec.
No improvement at all with that, at least not when it is just my import program running with that number of connections...

(I also noticed that the AWS SDK is not really written with small objects in mind.
By default it only sets up a mere 25 connections.
You have to lie to the SDK that you have a 400 Gbit connection by calling .targetThroughputInGbps(400), to persuade it to setup 1024 connections.
Even though those connections only use up roughly 400 Mbit of client bandwidth in practice due to the tiny objects, and a similar amount for Ceph’s internal traffic.
Nowhere near the 2x25 Gbit SFP28 the servers have).

Yours sincerely,

Floris Bos
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx