Re: Radosgw huge traffic to index bucket compared to incoming requests

Simon Leinen <simon.leinen@xxxxxxxxx> · Thu, 18 Jun 2020 22:18:08 +0200

Mariusz Gronczewski writes:
> listing itself is bugged in version
> I'm running: https://tracker.ceph.com/issues/45955

Ouch! Are your OSDs all running the same version as your RadosGW? The
message looks a bit as if your RadosGW might be a newer version than the
OSDs, and the new optimized bucket list operation was missing the new
extensions to the client<->OSD protocol.

> But yes, our structure is generally /bucket/prefix/prefix/file so there
> is not many big directories (we're migrating from GFS where that was a
> problem)

>> Paul Emmerich has written about performance issues with large buckets
>> on this list, see
>> https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/36P62BOOCJBVVJCVUX5F5J7KYCGAAICV/
>> 
>> Let's say that there are opportunities for further improvements.
>> 
>> You could look for the specific queries that cause the high read load
>> in your system.  Maybe there's something that can be done on the
>> client side.  This could also provide input for Ceph development as
>> to what kinds of index operations are used by applications "in the
>> wild".  Those might be worth optimizing first :-)

> Is there a way to debug which query exactly is causing that ?

What I usually do is grep through the HTTP request logs of the front-end
proxy/load balancer (Nginx in our case), and look for GET requests on a
bucket that have a long duration.  It's a bit crude, I know.  (If
someone knows better techniques for this, I'd also be interested! Maybe
something based on something like Jaeger/OpenTracing, or clever log
correlation?)

> Currently there is a lot of incoming traffic (mostly from aws cli sync)
> as we're migrating data over but that's at most hundreds of requests
> per sec.

>> 
>> > running 15.2.3, nothing special in terms of tunning aside from
>> > disabling some logging as to not overflow the logs.  
>> 
>> > We've had similar test cluster on 12.x (and way slower hardware)
>> > getting similar traffic and haven't observed that magnitude of
>> > difference.  
>> 
>> Was your bucket index sharded in 12.x?

> we didn't touch default settings so I assume not ?  "radosgw-admin
> metadata get" and "radosgw-admin bucket stat" doesn't say anything
> about shards on old cluster, while on new cluster there is from 11 to
> few hundred on the biggest buckets.

Yes, I think it's the sharding that causes the read amplification.

>> Hm, I don't understand enough about the operations that this
>> represents, but maybe one of the RadosGW developers can explain why a
>> single OSD would perform so many similar requests in such a short
>> timeframe.

> I'm getting similar logs on any osd/pg that takes part in the .index

Right, that's what I thought.  Again, I can't tell whether these log
messages are to be expected... the repetitions look a bit odd.

Best regards,
-- 
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx