Dear Mariusz, > we're using Ceph as S3-compatible storage to serve static files (mostly > css/js/images + some videos) and I've noticed that there seem to be > huge read amplification for index pool. we have observed that too, under Nautilus (14.2.4-14.2.8). > Incoming traffic magniture is of around 15k req/sec (mostly sub 1MB > request but index pool is getting hammered: > pool pl-war1.rgw.buckets.index id 10 > client io 632 MiB/s rd, 277 KiB/s wr, 129.92k op/s rd, 415 op/s wr > pool pl-war1.rgw.buckets.data id 11 > client io 4.5 MiB/s rd, 6.8 MiB/s wr, 640 op/s rd, 1.65k op/s wr > and is getting order of magnitude more requests Our hypothesis is that this is due to the way that RadosGW maps bucket index queries (ListObjects/ListObjectsV2) to Rados-level operations against a *sharded* index. For certain types of S3 index queries, the response must be collected from multiple (potentially all) shards of the index. S3 index queries are always "bounded" by the response limitation (1000 keys by default). But when your index is distributed over, let's say, 2000 shards, RadosGW must collect some data from those 2000 shards, then throw away most of what it gets, and return the next 1000 keys. This could explain the kind of read amplification that you are seeing. (In practice, S3 index queries often use "prefix" and "delimiter" to emulate a hierarchical directory structure. A recently merged change, https://github.com/ceph/ceph/pull/30272 , should make such queries much more efficient in RadosGW (note that the change contains some extensions to the OSD-side Rados protocol). But if I read it correctly, that change is already in the version you are using.) Paul Emmerich has written about performance issues with large buckets on this list, see https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/36P62BOOCJBVVJCVUX5F5J7KYCGAAICV/ Let's say that there are opportunities for further improvements. You could look for the specific queries that cause the high read load in your system. Maybe there's something that can be done on the client side. This could also provide input for Ceph development as to what kinds of index operations are used by applications "in the wild". Those might be worth optimizing first :-) > running 15.2.3, nothing special in terms of tunning aside from > disabling some logging as to not overflow the logs. > We've had similar test cluster on 12.x (and way slower hardware) > getting similar traffic and haven't observed that magnitude of > difference. Was your bucket index sharded in 12.x? > when enabling debug on affected OSD I only get spam of > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# = 0 > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.700+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# = 0 > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.704+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.708+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.708+0200 7f80694c4700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head# > 2020-06-17T12:35:05.716+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# = 0 > 2020-06-17T12:35:05.716+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.716+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# > 2020-06-17T12:35:05.720+0200 7f806d4cc700 10 bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head #10:8b5ed205:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.151:head# Hm, I don't understand enough about the operations that this represents, but maybe one of the RadosGW developers can explain why a single OSD would perform so many similar requests in such a short timeframe. Cheers, -- Simon. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx