Re: RGW returning HTTP 500 during resharding

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sun, 29 Sep 2024 16:18:08 -0400

>> 
>> AGN or AG is Dell for “Agnostic”, i.e. whatever the cheapest is they have on the shelf. Samsung
>> indeed is one of the potentials. `smartctl -a` should show you what it actually is.
> 
> Smartctl only gave “Dell”.

That’s weird.  Send me a full `smartctl -a` privately please.  I’m very close to forking smartmontools publicly or at least drivedb.h

> But see iDRAC is more telling and says 90% Samsung, 10% SkHynix (*sigh*)

iDRAC uses the same interfaces, so that’s weird.  Last year I found Dell shipping Hynix drives that my contacts said had been cancelled.  Wouldn’t give me a SMART reference, I ended up up getting upstream to accept a regex assumption.   

> 
>>> newest 2.5.0 firmware.
>> 
>> Verified by a DSU run?
> 
> Was updated through iDRAC.
> Cannot install DSU on the OS itself.

Why not?  I’ve done so thousands of times.  

> 
>>> They are pretty empty. Although there is some 10% capacity being used by other stuff (RBD images)
>>> 
>>> - Single bucket. My import application already errored out after only 72 M objects/476 GiB of data,
>>> and need a lot more. Objects are between 0 bytes and 1 MB, 7 KB average.
>> 
>> Only 72M? That’s a rather sizable bucket. Were there existing objects as well? Do you have the
>> ability to spread across multiple buckets? That would decrease your need to reshard. As I interpret
>> the docs, 199M is the default max number of objects above which auto-resharding won’t happen.
>> 
>> Since you *know* that you will be pulling in extreme numbers of objects, consider pre-sharding the
>> bucket while it’s empty. That will be dramatically faster in every way.
> 
> No existing data.
> And yes, I set it manually to 10069 shards now.
> So now it should not happen again, since that is above the 1999 rgw_max_dynamic_shards

Cool.  

> It still feels a bit wrong to me to have to set this manually though.

Might be a failsafe against fat fingers.  Your bucket is an outlier in my world.  

> I am not against having to tune applications for performance gains, but think it is unfortunate that one seems to have to do so just to prevent the “500 internal server errors” that the resharding can effectively cause.

I’m just speculating with my limited info.  

> 
> 
>>> - I cannot touch TCP socket options settings in my Java application.
>> 
>> Your RGW daemons are running on a Java application, not a Linux system????
> 
> Sorry, thought you were asking if my (Java based) import-existing-stuff-to-S3 program disabled nagle, in its communication with the rgw.
> Ceph has default settings, which is nagle disabled (ms_tcp_nodelay true).
> 
> 
>> Those numbers are nearly useless without context: the rest of the info I requested. There was a
>> reason for everything on the list. Folks contribute to the list out of the goodness of their
>> hearts, and aren’t paid for back-and-forth tooth-pulling. If your index pool and bucket pool share
>> say 3x HDDs or 3x coarse-IU QLC, then don’t expect much.
>> 
>> Sounds like you have the pg autoscaler enabled, which probably doesn’t help. Your index pool almost
>> certainly needs more PGs. Probably log as well, or set the rgw log levels to 0.
> 
> Reason I am a bit vague about amount of OSDs and such, is that the numbers are not written on stone yet.

But can inform what you’re seeing today, esp if your poc is a gating one.  

> Currently using 40 in total

Which drives? Replicated pools?

You aren’t equipped for the workload you’re throwing.   

> , but may or may not be able to “steal” more gear from another project, if the need arises. So I am rather fishing for advice in the form of “if you have >X OSDs, tune this and that” than specific to my exact current situation. ;-)

That’s the thing, it isn’t one size fits all.  That advice is a function of your hardware choices.  

> Not doing anything special placement wise at the moment (not configured anything that would make the index and data go to different OSDs)

All pools on all OSDs?  3x replication?

> 
> Also note that my use-case is a bit non-standard in other ways.
> While I have a lot of objects, the application itself is internal within an organization and does not have many concurrent users.

Sounds ripe for tweaking for larger objects, and is more common than you think.  Yours must be like <1KB?  Tiny objects don’t work well with EC, so you’ll want 3x replication.  Which costs more.   

> The performance demand is less than one would need if you have say a public facing web application with lots of visitors and peaks.
> So my main concern right now is having it run without internal server errors (the resharding thing), and fast enough so that my initial import completes within a couple weeks…

Trying to help.  Can’t do that without info.  Network and other tuning likely would help dramatically.  Improving your PoC is the first step toward improving prod.   

> 
> I do have control over the import application, but cannot change the application that will be using the bucket afterwise, so am stuck with their design choice for tiny objects and a single bucket.
> 
> Will play a bit with PGs and loglevel

PGs are a function of your replication and media.   Disable the autoscaler and set pg_num manually so that the PGS column at right of `Ceph osd df` shows like 300, assuming you have conventional TLC.  Index pool is crucial for you, I’d set pg_num to like 256.  You probably have like 32 now.   The buckets pool is important too.  Put an RGW (orvtwo!)  every node and use nginx, haproxy, or spendy hw to load balance, exposing a VIP endpoint.  I was trying to advise on this, but can’t without the requested node details.  This is the last time I’ll ask.  

> 
>>> - Think the main delay is just Ceph wanting to make sure everything is sync’ed to storage before
>>> reporting success. So that is why I am making a lot of concurrent connections to perform multiple
>>> PUT requests simultaneously. But even with 250 connections, it only does around 5000 objects per
>>> second according to the “object ingress/egress” Grafana graph. Can probably raise it some more…
>> 
>> With one RGW you aren’t going to get far, unless you have a 500 core CPU, and probably not even
>> than.
> 
> 
> I wonder if that also applies to tiny objects.

Especially to.  Metadata and intake costs way more than shoveling bits. The header stuff, metadata, index updates, persistence ops are the lion’s share of the overall incremental cost, vs background noise when you’re storing pirated TV episodes dubbed into Portuguese.   

> As I have the impression I am slowed down by RGW/Ceph waiting for individual objects to be synced to storage, instead of having to compute much for a tiny object.

In part.  Far more to it than throughput.   

> 
> I am experimenting a bit with the settings at the moment.
> If I raise the amount of concurrent connections from 250 to 1024 with a single RGW it does raise the number of ingress objects per second from 5k to 9k.
> Which means that the RGW was not quite saturated before.

Fair enough. Probably doesn’t do your average latency any favors, but you don’t care.  

> And the server the RGW is on, is only using 14% CPU (1 min avg, taken over all CPU cores)

Yep, cores usually aren’t the limitation.  Run `powertop` and see if you’re dipping into C6, which can be more crippling than people often realize. You will need to LB proxy RGWs in prod anyway, start thinking that way.   RADOS itself often isn’t the bottleneck here.  

> 
> If I create a DNS entry that includes all 4 RGW IP, and tell it to connect to that with 1024 connections, I see in netstat that it indeeds spreads the connections over the 4 RGW,

Round robin LB. Not for pros but easy to do.  

> but I get the same 9k objects/sec.

Aggregate?

> No improvement at all with that, at least not when it is just my import program running with that number of connections...
> 
> (I also noticed that the AWS SDK is not really written with small objects in mind.
> By default it only sets up a mere 25 connections.
> You have to lie to the SDK that you have a 400 Gbit connection by calling .targetThroughputInGbps(400), to persuade it to setup 1024 connections.

I’m not at all familiar with that SDK.  Are you using the STANDARD storage class?

> Even though those connections only use up roughly 400 Mbit of client bandwidth in practice due to the tiny objects, and a similar amount for Ceph’s internal traffic.
> Nowhere near the 2x25 Gbit SFP28 the servers have).

LACP bonded together, no replication network?  Which xmit hash policy?

> 
> 
> Yours sincerely,
> 
> Floris Bos
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx