Re: HDD OSD 100% busy reading OMAP keys RGW

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 14 Feb 2019 14:08:06 +0100

On Thu, Feb 14, 2019 at 12:07 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>
>
>
> On 2/14/19 11:26 AM, Dan van der Ster wrote:
> > On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander <wido@xxxxxxxx> wrote:
> >>
> >> On 2/14/19 10:20 AM, Dan van der Ster wrote:
> >>> On Thu., Feb. 14, 2019, 6:17 a.m. Wido den Hollander <wido@xxxxxxxx wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On a cluster running RGW only I'm running into BlueStore 12.2.11 OSDs
> >>>> being 100% busy sometimes.
> >>>>
> >>>> This cluster has 85k stale indexes (stale-instances list) and I've been
> >>>> slowly trying to remove them.
> >>>>
> >>>
> >>> Is your implication here that 'stale-instances rm' isn't going to work
> >>> for your case?
> >>>
> >>
> >> No, just saying that I've been running it to remove stale indexes. But
> >> suddenly the OSDs are spiking to 100% busy when reading from BlueFS.
> >
> > How do you "slowly" rm the stale-instances? If you called
> > stale-instances rm, doesn't that mean there's some rgw in your cluster
> > busily now trying to delete 85k indices? I imagine that could be
> > triggering this load spike.
>
> I ran it with:
>
> $ timeout 900 <command>
>
> Just to see what would happen.
>
> The problem is that even without this running I now see OSDs spiking too
> 100% busy.
>
> The bluefs debug logs tell me that when a particular object is queried
> it will cause the OSD to scan it's RocksDB database and render the OSD
> useless for a few minutes.
>
> > (I'm trying to grok the stale-instances rm implementation -- it seems
> > to do the work 1000 keys at a time, but will nevertheless queue up the
> > work to rm all indices from one call to the command <--- please
> > correct me if I'm wrong).
> >
> > -- dan
> >
> >
> >>
> >>> (We haven't updated to 12.2.11 yet but have a similar number of stale
> >>> instances to remove).
> >>>
> >>
> >> In this case the index pool already went from 222k objects to 186k and
> >> is still going down if we run the GC.
> >>
> >>> For the rest below, I didn't understand *when* are the osd's getting
> >>> 100% busy -- is that just during normal operations (after the 12.2.11
> >>> upgrade) or is it while listing the indexes?
> >>>
> >>
> >> Not sure either. For certain Objects this can be triggered by just
> >> listing the OMAP keys. This OSD eats all I/O at that moment.
> >>
> >>> Also, have you already tried compacting the omap on the relevant osds?
> >>>
> >>
> >> I tried that, but on a offline OSD I cancelled it after 25 minutes as it
> >> was still running.
> >>
> >> Not sure yet why this happens on those OSDs.
> >>

Do the osd ops show anything suspicious? Are these coming from some
rgw iterating over keys or some internal osd process not resulting
from an op?

It reminds me of this old rgw scrub issue, which I resolved by simply
rados rm'ing the object with millions of keys:
   http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html

(Symptom is similar, though I doubt the cause it related).

-- dan

> >> Wido
> >>
> >>> -- Dan
> >>>
> >>>
> >>>>
> >>>>
> >>>> I noticed that regularly OSDs read their HDD heavily and that device
> >>>> then becomes 100% busy. (iostat)
> >>>>
> >>>> $ radosgw-admin reshard stale-instances list > stale.json
> >>>> $ cat stale.json|jq -r '.[]'|wc -l
> >>>>
> >>>> I increased debug_bluefs and debug_bluestore to 10 and I found:
> >>>>
> >>>> 2019-02-14 05:11:18.417097 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.205_head oid
> >>>> #13:a05231a1:::.dir.ams02.36062237.821.79:head# = 0
> >>>> 2019-02-14 05:11:18.417127 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.205_head
> >>>> #13:a05231a1:::.dir.ams02.36062237.821.79:head#
> >>>> 2019-02-14 05:11:18.417133 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
> >>>>
> >>>> 2019-02-14 05:11:18.417169 7f627732d700 10 bluefs _read_random h
> >>>> 0x560cb77c5080 0x17a8cd0~fba from file(ino 71562 size 0x2d96a43 mtime
> >>>> 2019-02-14 02:52:16.370746 bdev 1 allocated 2e00000 extents
> >>>> [1:0x3228f00000+2e00000])
> >>>> 2019-02-14 05:11:23.129645 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c14167780 0x17bb6b7~f52 from file(ino 68900 size 0x41919ef mtime
> >>>> 2019-02-01 01:19:59.216218 bdev 1 allocated 4200000 extents
> >>>> [1:0x8b31a00000+200000,1:0x8b31e00000+e00000,1:0x8b32d00000+1700000,1:0x8b3ce00000+1b00000])
> >>>> 2019-02-14 05:11:23.144550 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c14c86b80 0x96d020~ef3 from file(ino 67189 size 0x419b603 mtime
> >>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
> >>>> [1:0x53da9a00000+4200000])
> >>>>
> >>>> 2019-02-14 05:11:23.149958 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.e8_head oid
> >>>> #13:171bcbd3:::.dir.ams02.39023047.682.114:head# = 0
> >>>> 2019-02-14 05:11:23.149975 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.e8_head
> >>>> #13:171bcbd3:::.dir.ams02.39023047.682.114:head#
> >>>> 2019-02-14 05:11:23.149981 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
> >>>>
> >>>> 2019-02-14 05:11:23.150012 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c14e42500 0x1a18670~ff0 from file(ino 71519 size 0x417a60f mtime
> >>>> 2019-02-14 02:51:35.125629 bdev 1 allocated 4200000 extents
> >>>> [1:0x1c30d00000+4200000])
> >>>> 2019-02-14 05:11:23.155679 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c1c1ab980 0xedad4c~fde from file(ino 71391 size 0x25d4a89 mtime
> >>>> 2019-02-13 22:25:22.801676 bdev 1 allocated 2600000 extents
> >>>> [1:0x38b00000+2600000])
> >>>> 2019-02-14 05:11:23.158995 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c1c1ab980 0xedbd2a~fba from file(ino 71391 size 0x25d4a89 mtime
> >>>> 2019-02-13 22:25:22.801676 bdev 1 allocated 2600000 extents
> >>>> [1:0x38b00000+2600000])
> >>>> 2019-02-14 05:11:23.159233 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c14c86b80 0x96df13~fca from file(ino 67189 size 0x419b603 mtime
> >>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
> >>>> [1:0x53da9a00000+4200000])
> >>>> 2019-02-14 05:11:23.159456 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c14c86b80 0x96eedd~f1b from file(ino 67189 size 0x419b603 mtime
> >>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
> >>>> [1:0x53da9a00000+4200000])
> >>>> 2019-02-14 05:11:23.159639 7f627732d700 10 bluefs _read_random h
> >>>> 0x560c14c86b80 0x96fdf8~eba from file(ino 67189 size 0x419b603 mtime
> >>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
> >>>> [1:0x53da9a00000+4200000])
> >>>>
> >>>> After this the _random_read just continues for thousands of lines and
> >>>> the OSD becomes very slow. Slow requests, heartbeat timeouts and even
> >>>> OSDs being marked as down.
> >>>>
> >>>> So I tried to list the omap keys:
> >>>>
> >>>> $ rados -p .rgw.buckets.index listomapkeys .dir.ams02.36062237.821.79
> >>>> $ rados -p .rgw.buckets.index listomapkeys .dir.ams02.39023047.682.114
> >>>>
> >>>> .dir.ams02.36062237.821.79: <1s
> >>>> .dir.ams02.39023047.682.114: ~30s
> >>>>
> >>>> Both objects are on the same OSD (266), but if I list the omapkeys for
> >>>> the last Object the disk jumps to 100% busy and stays there for ~1min.
> >>>>
> >>>> I've seen this before with RBD and SSD-backed OSDs where on a omap key
> >>>> list the disks would jump to 100% busy and cause slow requests.
> >>>>
> >>>> This case seems very similar to that one.
> >>>>
> >>>> I can keep triggering it in this case by listing the omap keys for
> >>>> object .dir.ams02.39023047.682.114, which doesn't have any keys though.
> >>>>
> >>>> Has anybody seen this before?
> >>>>
> >>>> Some information:
> >>>>
> >>>> - Ceph 12.2.11
> >>>> - BlueStore (default memory target of 4G)
> >>>> - RGW-only use-case
> >>>> - WAL+DB+DATA on HDD
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Wido
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com