Dnia 2020-06-18, o godz. 10:51:31 Simon Leinen <simon.leinen@xxxxxxxxx> napisał(a): > Dear Mariusz, > > > we're using Ceph as S3-compatible storage to serve static files > > (mostly css/js/images + some videos) and I've noticed that there > > seem to be huge read amplification for index pool. > > we have observed that too, under Nautilus (14.2.4-14.2.8). > > > Incoming traffic magniture is of around 15k req/sec (mostly sub 1MB > > request but index pool is getting hammered: > > > pool pl-war1.rgw.buckets.index id 10 > > client io 632 MiB/s rd, 277 KiB/s wr, 129.92k op/s rd, 415 op/s > > wr > > > pool pl-war1.rgw.buckets.data id 11 > > client io 4.5 MiB/s rd, 6.8 MiB/s wr, 640 op/s rd, 1.65k op/s wr > > > and is getting order of magnitude more requests > > Our hypothesis is that this is due to the way that RadosGW maps bucket > index queries (ListObjects/ListObjectsV2) to Rados-level operations > against a *sharded* index. > > For certain types of S3 index queries, the response must be collected > from multiple (potentially all) shards of the index. > > S3 index queries are always "bounded" by the response limitation (1000 > keys by default). But when your index is distributed over, let's say, > 2000 shards, RadosGW must collect some data from those 2000 shards, > then throw away most of what it gets, and return the next 1000 keys. > This could explain the kind of read amplification that you are seeing. > > (In practice, S3 index queries often use "prefix" and "delimiter" to > emulate a hierarchical directory structure. A recently merged change, > https://github.com/ceph/ceph/pull/30272 , should make such queries > much more efficient in RadosGW (note that the change contains some > extensions to the OSD-side Rados protocol). But if I read it > correctly, that change is already in the version you are using.) listing itself is bugged in version I'm running: https://tracker.ceph.com/issues/45955 But yes, our structure is generally /bucket/prefix/prefix/file so there is not many big directories (we're migrating from GFS where that was a problem) > Paul Emmerich has written about performance issues with large buckets > on this list, see > https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/36P62BOOCJBVVJCVUX5F5J7KYCGAAICV/ > > Let's say that there are opportunities for further improvements. > > You could look for the specific queries that cause the high read load > in your system. Maybe there's something that can be done on the > client side. This could also provide input for Ceph development as > to what kinds of index operations are used by applications "in the > wild". Those might be worth optimizing first :-) Is there a way to debug which query exactly is causing that ? Currently there is a lot of incoming traffic (mostly from aws cli sync) as we're migrating data over but that's at most hundreds of requests per sec. > > > running 15.2.3, nothing special in terms of tunning aside from > > disabling some logging as to not overflow the logs. > > > We've had similar test cluster on 12.x (and way slower hardware) > > getting similar traffic and haven't observed that magnitude of > > difference. > > Was your bucket index sharded in 12.x? we didn't touch default settings so I assume not ? "radosgw-admin metadata get" and "radosgw-admin bucket stat" doesn't say anything about shards on old cluster, while on new cluster there is from 11 to few hundred on the biggest buckets. > > > when enabling debug on affected OSD I only get spam of > > > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > = 0 2020-06-17T12:35:05.700+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > = 0 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.708+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.708+0200 7f80694c4700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > > 2020-06-17T12:35:05.716+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > = 0 2020-06-17T12:35:05.716+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.716+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 > > bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head > > #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > > > > Hm, I don't understand enough about the operations that this > represents, but maybe one of the RadosGW developers can explain why a > single OSD would perform so many similar requests in such a short > timeframe. I'm getting similar logs on any osd/pg that takes part in the .index Cheers -- Mariusz Gronczewski (XANi) <xani666@xxxxxxxxx> GnuPG: 0xEA8ACE64 http://devrandom.pl _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx