On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander <wido@xxxxxxxx> wrote: > > On 2/14/19 10:20 AM, Dan van der Ster wrote: > > On Thu., Feb. 14, 2019, 6:17 a.m. Wido den Hollander <wido@xxxxxxxx wrote: > >> > >> Hi, > >> > >> On a cluster running RGW only I'm running into BlueStore 12.2.11 OSDs > >> being 100% busy sometimes. > >> > >> This cluster has 85k stale indexes (stale-instances list) and I've been > >> slowly trying to remove them. > >> > > > > Is your implication here that 'stale-instances rm' isn't going to work > > for your case? > > > > No, just saying that I've been running it to remove stale indexes. But > suddenly the OSDs are spiking to 100% busy when reading from BlueFS. How do you "slowly" rm the stale-instances? If you called stale-instances rm, doesn't that mean there's some rgw in your cluster busily now trying to delete 85k indices? I imagine that could be triggering this load spike. (I'm trying to grok the stale-instances rm implementation -- it seems to do the work 1000 keys at a time, but will nevertheless queue up the work to rm all indices from one call to the command <--- please correct me if I'm wrong). -- dan > > > (We haven't updated to 12.2.11 yet but have a similar number of stale > > instances to remove). > > > > In this case the index pool already went from 222k objects to 186k and > is still going down if we run the GC. > > > For the rest below, I didn't understand *when* are the osd's getting > > 100% busy -- is that just during normal operations (after the 12.2.11 > > upgrade) or is it while listing the indexes? > > > > Not sure either. For certain Objects this can be triggered by just > listing the OMAP keys. This OSD eats all I/O at that moment. > > > Also, have you already tried compacting the omap on the relevant osds? > > > > I tried that, but on a offline OSD I cancelled it after 25 minutes as it > was still running. > > Not sure yet why this happens on those OSDs. > > Wido > > > -- Dan > > > > > >> > >> > >> I noticed that regularly OSDs read their HDD heavily and that device > >> then becomes 100% busy. (iostat) > >> > >> $ radosgw-admin reshard stale-instances list > stale.json > >> $ cat stale.json|jq -r '.[]'|wc -l > >> > >> I increased debug_bluefs and debug_bluestore to 10 and I found: > >> > >> 2019-02-14 05:11:18.417097 7f627732d700 10 > >> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.205_head oid > >> #13:a05231a1:::.dir.ams02.36062237.821.79:head# = 0 > >> 2019-02-14 05:11:18.417127 7f627732d700 10 > >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.205_head > >> #13:a05231a1:::.dir.ams02.36062237.821.79:head# > >> 2019-02-14 05:11:18.417133 7f627732d700 10 > >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1 > >> > >> 2019-02-14 05:11:18.417169 7f627732d700 10 bluefs _read_random h > >> 0x560cb77c5080 0x17a8cd0~fba from file(ino 71562 size 0x2d96a43 mtime > >> 2019-02-14 02:52:16.370746 bdev 1 allocated 2e00000 extents > >> [1:0x3228f00000+2e00000]) > >> 2019-02-14 05:11:23.129645 7f627732d700 10 bluefs _read_random h > >> 0x560c14167780 0x17bb6b7~f52 from file(ino 68900 size 0x41919ef mtime > >> 2019-02-01 01:19:59.216218 bdev 1 allocated 4200000 extents > >> [1:0x8b31a00000+200000,1:0x8b31e00000+e00000,1:0x8b32d00000+1700000,1:0x8b3ce00000+1b00000]) > >> 2019-02-14 05:11:23.144550 7f627732d700 10 bluefs _read_random h > >> 0x560c14c86b80 0x96d020~ef3 from file(ino 67189 size 0x419b603 mtime > >> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents > >> [1:0x53da9a00000+4200000]) > >> > >> 2019-02-14 05:11:23.149958 7f627732d700 10 > >> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.e8_head oid > >> #13:171bcbd3:::.dir.ams02.39023047.682.114:head# = 0 > >> 2019-02-14 05:11:23.149975 7f627732d700 10 > >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.e8_head > >> #13:171bcbd3:::.dir.ams02.39023047.682.114:head# > >> 2019-02-14 05:11:23.149981 7f627732d700 10 > >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1 > >> > >> 2019-02-14 05:11:23.150012 7f627732d700 10 bluefs _read_random h > >> 0x560c14e42500 0x1a18670~ff0 from file(ino 71519 size 0x417a60f mtime > >> 2019-02-14 02:51:35.125629 bdev 1 allocated 4200000 extents > >> [1:0x1c30d00000+4200000]) > >> 2019-02-14 05:11:23.155679 7f627732d700 10 bluefs _read_random h > >> 0x560c1c1ab980 0xedad4c~fde from file(ino 71391 size 0x25d4a89 mtime > >> 2019-02-13 22:25:22.801676 bdev 1 allocated 2600000 extents > >> [1:0x38b00000+2600000]) > >> 2019-02-14 05:11:23.158995 7f627732d700 10 bluefs _read_random h > >> 0x560c1c1ab980 0xedbd2a~fba from file(ino 71391 size 0x25d4a89 mtime > >> 2019-02-13 22:25:22.801676 bdev 1 allocated 2600000 extents > >> [1:0x38b00000+2600000]) > >> 2019-02-14 05:11:23.159233 7f627732d700 10 bluefs _read_random h > >> 0x560c14c86b80 0x96df13~fca from file(ino 67189 size 0x419b603 mtime > >> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents > >> [1:0x53da9a00000+4200000]) > >> 2019-02-14 05:11:23.159456 7f627732d700 10 bluefs _read_random h > >> 0x560c14c86b80 0x96eedd~f1b from file(ino 67189 size 0x419b603 mtime > >> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents > >> [1:0x53da9a00000+4200000]) > >> 2019-02-14 05:11:23.159639 7f627732d700 10 bluefs _read_random h > >> 0x560c14c86b80 0x96fdf8~eba from file(ino 67189 size 0x419b603 mtime > >> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents > >> [1:0x53da9a00000+4200000]) > >> > >> After this the _random_read just continues for thousands of lines and > >> the OSD becomes very slow. Slow requests, heartbeat timeouts and even > >> OSDs being marked as down. > >> > >> So I tried to list the omap keys: > >> > >> $ rados -p .rgw.buckets.index listomapkeys .dir.ams02.36062237.821.79 > >> $ rados -p .rgw.buckets.index listomapkeys .dir.ams02.39023047.682.114 > >> > >> .dir.ams02.36062237.821.79: <1s > >> .dir.ams02.39023047.682.114: ~30s > >> > >> Both objects are on the same OSD (266), but if I list the omapkeys for > >> the last Object the disk jumps to 100% busy and stays there for ~1min. > >> > >> I've seen this before with RBD and SSD-backed OSDs where on a omap key > >> list the disks would jump to 100% busy and cause slow requests. > >> > >> This case seems very similar to that one. > >> > >> I can keep triggering it in this case by listing the omap keys for > >> object .dir.ams02.39023047.682.114, which doesn't have any keys though. > >> > >> Has anybody seen this before? > >> > >> Some information: > >> > >> - Ceph 12.2.11 > >> - BlueStore (default memory target of 4G) > >> - RGW-only use-case > >> - WAL+DB+DATA on HDD > >> > >> Thanks, > >> > >> Wido > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com