Re: RGW returning HTTP 500 during resharding

"Floris Bos" <bos@xxxxxxxxxxxxxxxxxx> · Sat, 28 Sep 2024 21:21:33 +0000

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> schreef op 28 september 2024 16:24:
>> No retries.
>> Is it expected that resharding can take so long?
>> (in a setup with all NVMe drives)
> 
> Which drive SKU(s)? How full are they? Is their firmware up to date? How many RGWs? Have you tuned
> your server network stack? Disabled Nagle? How many bucket OSDs? How many index OSDs? How many PGs
> in the bucket and index pools? How many buckets? Do you have like 200M objects per? Do you have the
> default max objects/shard setting?
> 
> Tiny objects are the devil of many object systems. I can think of cases where the above questions
> could affect this case. I think you resharding in advance might help.

- Drives advertise themselves as “Dell Ent NVMe v2 AGN MU U.2 6.4TB” (think that is Samsung under the Dell sticker), newest 2.5.0 firmware.  They are pretty empty. Although there is some 10% capacity being used by other stuff (RBD images)

- Single bucket. My import application already errored out after only 72 M objects/476 GiB of data, and need a lot more. Objects are between 0 bytes and 1 MB, 7 KB average.

- Currently using only 1 RGW during my test run to simplify looking at logs, although I have 4.

- I cannot touch TCP socket options settings in my Java application.
When you build a S3AsyncClient with the Java AWS SDK using the .crtBuilder(), the SDK outsources the communication to the AWS aws-c-s3/aws-c-http/aws-io CRT libraries written in C, and I never get to see the raw socket in Java.
Looking at the source I don’t think Amazon is disabling the nagle algorithm in their code.
At least I don’t see TCP_NODELAY or similar options being used at the place they seem to set the socket options:
https://github.com/awslabs/aws-c-io/blob/c345d77274db83c0c2e30331814093e7c84c45e2/source/posix/socket.c#L1216

- Did not tune any network settings, and it is pretty quiet on the network side, nowhere near saturating bandwidth because objects are so small.

- Did not really tune anything else either yet. Pretty much a default cephadm setup for now.

- See it (automagically) allocated 1024 PGs for .data and 32 for .index.

- Think the main delay is just Ceph wanting to make sure everything is sync’ed to storage before reporting success. So that is why I am making a lot of concurrent connections to perform multiple PUT requests simultaneously. But even with 250 connections, it only does around 5000 objects per second according to the “object ingress/egress” Grafana graph. Can probably raise it some more…

Had the default max. objects per shard settings for the dynamic sharding.
But have now manually resharded to 10069 shards, and will have a go to see if it works better now.

Yours sincerely,

Floris Bos
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx