Re: HDD OSD 100% busy reading OMAP keys RGW

Wido den Hollander <wido@xxxxxxxx> · Fri, 15 Feb 2019 07:29:14 +0100

On 2/14/19 2:08 PM, Dan van der Ster wrote:
> On Thu, Feb 14, 2019 at 12:07 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>>
>>
>> On 2/14/19 11:26 AM, Dan van der Ster wrote:
>>> On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>
>>>> On 2/14/19 10:20 AM, Dan van der Ster wrote:
>>>>> On Thu., Feb. 14, 2019, 6:17 a.m. Wido den Hollander <wido@xxxxxxxx wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On a cluster running RGW only I'm running into BlueStore 12.2.11 OSDs
>>>>>> being 100% busy sometimes.
>>>>>>
>>>>>> This cluster has 85k stale indexes (stale-instances list) and I've been
>>>>>> slowly trying to remove them.
>>>>>>
>>>>>
>>>>> Is your implication here that 'stale-instances rm' isn't going to work
>>>>> for your case?
>>>>>
>>>>
>>>> No, just saying that I've been running it to remove stale indexes. But
>>>> suddenly the OSDs are spiking to 100% busy when reading from BlueFS.
>>>
>>> How do you "slowly" rm the stale-instances? If you called
>>> stale-instances rm, doesn't that mean there's some rgw in your cluster
>>> busily now trying to delete 85k indices? I imagine that could be
>>> triggering this load spike.
>>
>> I ran it with:
>>
>> $ timeout 900 <command>
>>
>> Just to see what would happen.
>>
>> The problem is that even without this running I now see OSDs spiking too
>> 100% busy.
>>
>> The bluefs debug logs tell me that when a particular object is queried
>> it will cause the OSD to scan it's RocksDB database and render the OSD
>> useless for a few minutes.
>>
>>> (I'm trying to grok the stale-instances rm implementation -- it seems
>>> to do the work 1000 keys at a time, but will nevertheless queue up the
>>> work to rm all indices from one call to the command <--- please
>>> correct me if I'm wrong).
>>>
>>> -- dan
>>>
>>>
>>>>
>>>>> (We haven't updated to 12.2.11 yet but have a similar number of stale
>>>>> instances to remove).
>>>>>
>>>>
>>>> In this case the index pool already went from 222k objects to 186k and
>>>> is still going down if we run the GC.
>>>>
>>>>> For the rest below, I didn't understand *when* are the osd's getting
>>>>> 100% busy -- is that just during normal operations (after the 12.2.11
>>>>> upgrade) or is it while listing the indexes?
>>>>>
>>>>
>>>> Not sure either. For certain Objects this can be triggered by just
>>>> listing the OMAP keys. This OSD eats all I/O at that moment.
>>>>
>>>>> Also, have you already tried compacting the omap on the relevant osds?
>>>>>
>>>>
>>>> I tried that, but on a offline OSD I cancelled it after 25 minutes as it
>>>> was still running.
>>>>
>>>> Not sure yet why this happens on those OSDs.
>>>>
> 
> Do the osd ops show anything suspicious? Are these coming from some
> rgw iterating over keys or some internal osd process not resulting
> from an op?
> 

Nope. It's just a random .dir (index object) of the RGW being queried
which results in the OSD going crazy.

We eventually managed to compact all the 80 OSDs, but that took about 1
hour per OSD and they also required a restart before it had proper effect.

> It reminds me of this old rgw scrub issue, which I resolved by simply
> rados rm'ing the object with millions of keys:
>    http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html
> 
> (Symptom is similar, though I doubt the cause it related).
> 

Thanks, but I doubt it. I have seen a similar case with RBD and Luminous
a few months ago. Opening a certain RBD image would also result in the
OSD eating up all CPU and (SSD!) disk I/O for a few minutes leading to
massive slow requests.

There is something with RocksDB going on in both cases where it starts
to scan the whole DB.

If that DB is a couple of GB in size that can take a very long time.

Wido

> -- dan
> 
> 
>>>> Wido
>>>>
>>>>> -- Dan
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> I noticed that regularly OSDs read their HDD heavily and that device
>>>>>> then becomes 100% busy. (iostat)
>>>>>>
>>>>>> $ radosgw-admin reshard stale-instances list > stale.json
>>>>>> $ cat stale.json|jq -r '.[]'|wc -l
>>>>>>
>>>>>> I increased debug_bluefs and debug_bluestore to 10 and I found:
>>>>>>
>>>>>> 2019-02-14 05:11:18.417097 7f627732d700 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.205_head oid
>>>>>> #13:a05231a1:::.dir.ams02.36062237.821.79:head# = 0
>>>>>> 2019-02-14 05:11:18.417127 7f627732d700 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.205_head
>>>>>> #13:a05231a1:::.dir.ams02.36062237.821.79:head#
>>>>>> 2019-02-14 05:11:18.417133 7f627732d700 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
>>>>>>
>>>>>> 2019-02-14 05:11:18.417169 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560cb77c5080 0x17a8cd0~fba from file(ino 71562 size 0x2d96a43 mtime
>>>>>> 2019-02-14 02:52:16.370746 bdev 1 allocated 2e00000 extents
>>>>>> [1:0x3228f00000+2e00000])
>>>>>> 2019-02-14 05:11:23.129645 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c14167780 0x17bb6b7~f52 from file(ino 68900 size 0x41919ef mtime
>>>>>> 2019-02-01 01:19:59.216218 bdev 1 allocated 4200000 extents
>>>>>> [1:0x8b31a00000+200000,1:0x8b31e00000+e00000,1:0x8b32d00000+1700000,1:0x8b3ce00000+1b00000])
>>>>>> 2019-02-14 05:11:23.144550 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c14c86b80 0x96d020~ef3 from file(ino 67189 size 0x419b603 mtime
>>>>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
>>>>>> [1:0x53da9a00000+4200000])
>>>>>>
>>>>>> 2019-02-14 05:11:23.149958 7f627732d700 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.e8_head oid
>>>>>> #13:171bcbd3:::.dir.ams02.39023047.682.114:head# = 0
>>>>>> 2019-02-14 05:11:23.149975 7f627732d700 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.e8_head
>>>>>> #13:171bcbd3:::.dir.ams02.39023047.682.114:head#
>>>>>> 2019-02-14 05:11:23.149981 7f627732d700 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
>>>>>>
>>>>>> 2019-02-14 05:11:23.150012 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c14e42500 0x1a18670~ff0 from file(ino 71519 size 0x417a60f mtime
>>>>>> 2019-02-14 02:51:35.125629 bdev 1 allocated 4200000 extents
>>>>>> [1:0x1c30d00000+4200000])
>>>>>> 2019-02-14 05:11:23.155679 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c1c1ab980 0xedad4c~fde from file(ino 71391 size 0x25d4a89 mtime
>>>>>> 2019-02-13 22:25:22.801676 bdev 1 allocated 2600000 extents
>>>>>> [1:0x38b00000+2600000])
>>>>>> 2019-02-14 05:11:23.158995 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c1c1ab980 0xedbd2a~fba from file(ino 71391 size 0x25d4a89 mtime
>>>>>> 2019-02-13 22:25:22.801676 bdev 1 allocated 2600000 extents
>>>>>> [1:0x38b00000+2600000])
>>>>>> 2019-02-14 05:11:23.159233 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c14c86b80 0x96df13~fca from file(ino 67189 size 0x419b603 mtime
>>>>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
>>>>>> [1:0x53da9a00000+4200000])
>>>>>> 2019-02-14 05:11:23.159456 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c14c86b80 0x96eedd~f1b from file(ino 67189 size 0x419b603 mtime
>>>>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
>>>>>> [1:0x53da9a00000+4200000])
>>>>>> 2019-02-14 05:11:23.159639 7f627732d700 10 bluefs _read_random h
>>>>>> 0x560c14c86b80 0x96fdf8~eba from file(ino 67189 size 0x419b603 mtime
>>>>>> 2019-02-01 00:45:12.743836 bdev 1 allocated 4200000 extents
>>>>>> [1:0x53da9a00000+4200000])
>>>>>>
>>>>>> After this the _random_read just continues for thousands of lines and
>>>>>> the OSD becomes very slow. Slow requests, heartbeat timeouts and even
>>>>>> OSDs being marked as down.
>>>>>>
>>>>>> So I tried to list the omap keys:
>>>>>>
>>>>>> $ rados -p .rgw.buckets.index listomapkeys .dir.ams02.36062237.821.79
>>>>>> $ rados -p .rgw.buckets.index listomapkeys .dir.ams02.39023047.682.114
>>>>>>
>>>>>> .dir.ams02.36062237.821.79: <1s
>>>>>> .dir.ams02.39023047.682.114: ~30s
>>>>>>
>>>>>> Both objects are on the same OSD (266), but if I list the omapkeys for
>>>>>> the last Object the disk jumps to 100% busy and stays there for ~1min.
>>>>>>
>>>>>> I've seen this before with RBD and SSD-backed OSDs where on a omap key
>>>>>> list the disks would jump to 100% busy and cause slow requests.
>>>>>>
>>>>>> This case seems very similar to that one.
>>>>>>
>>>>>> I can keep triggering it in this case by listing the omap keys for
>>>>>> object .dir.ams02.39023047.682.114, which doesn't have any keys though.
>>>>>>
>>>>>> Has anybody seen this before?
>>>>>>
>>>>>> Some information:
>>>>>>
>>>>>> - Ceph 12.2.11
>>>>>> - BlueStore (default memory target of 4G)
>>>>>> - RGW-only use-case
>>>>>> - WAL+DB+DATA on HDD
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Wido
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com