Re: Hangup during scrubbing - possible solutions

Andrey Korolyov <andrey@xxxxxxx> · Sat, 1 Dec 2012 20:04:11 +0300

On Sat, Dec 1, 2012 at 9:07 AM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
> Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
> Let me know if it persists.  Thanks for the logs!
> -Sam
>

Very nice, thanks!

There is one corner case - ``on-the-fly'' upgrade works well only if
your patch applied to ``generic'' 0.54 by cherry-picking, online
upgrade to the next-dccf6ee from tagged 0.54 causes osd processes on
the upgraded nodes to fall shortly after restart with backtrace you
may see below. Offline upgrade, e.g over shutting down entire cluster,
works fine, so only one problem is a preservation of running state of
the cluster over upgrade which may confuse some users(at least ones
who runs production suites).

http://xdel.ru/downloads/ceph-log/bt-recovery-sj-patch.out.gz

> On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>> Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
>> have a patch shortly.
>>
>> Thanks!
>> -Sam
>>
>> On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>> http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
>>> http://xdel.ru/downloads/ceph-log/cluster-w.log.gz
>>>
>>> Here, please.
>>>
>>> I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
>>> I/O requests in a short time(scrub `ll do the same). Second log may be
>>> useful for proper timestamps, as seeks on the original may took a long
>>> time. Osd processes on the specific node was restarted twice - at the
>>> beginning to be sure all config options were applied and at the end to
>>> do same plus to get rid of stuck requests.
>>>
>>>
>>> On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>>>> If you can reproduce it again, what we really need are the osd logs
>>>> from the acting set of a pg stuck in scrub with
>>>> debug osd = 20
>>>> debug ms = 1
>>>> debug filestore = 20.
>>>>
>>>> Thanks,
>>>> -Sam
>>>>
>>>> On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>> On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>>>>>> On Thu, 22 Nov 2012, Andrey Korolyov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> In the recent versions Ceph introduces some unexpected behavior for
>>>>>>> the permanent connections (VM or kernel clients) - after crash
>>>>>>> recovery, I/O will hang on the next planned scrub on the following
>>>>>>> scenario:
>>>>>>>
>>>>>>> - launch a bunch of clients doing non-intensive writes,
>>>>>>> - lose one or more osd, mark them down, wait for recovery completion,
>>>>>>> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
>>>>>>> or wait for ceph to do the same,
>>>>>>> - observe a raising number of pgs stuck in the active+clean+scrubbing
>>>>>>> state (they took a master role from ones which was on killed osd and
>>>>>>> almost surely they are being written in time of crash),
>>>>>>> - some time later, clients will hang hardly and ceph log introduce
>>>>>>> stuck(old) I/O requests.
>>>>>>>
>>>>>>> The only one way to return clients back without losing their I/O state
>>>>>>> is per-osd restart, which also will help to get rid of
>>>>>>> active+clean+scrubbing pgs.
>>>>>>>
>>>>>>> First of all, I`ll be happy to help to solve this problem by providing
>>>>>>> logs.
>>>>>>
>>>>>> If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
>>>>>> 1' logging on the OSD, that would be wonderful!
>>>>>>
>>>>>
>>>>> I have tested slightly different recovery flow, please see below.
>>>>> Since there is no real harm, like frozen I/O, placement groups also
>>>>> was stuck forever on the active+clean+scrubbing state, until I
>>>>> restarted all osds (end of the log):
>>>>>
>>>>> http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz
>>>>>
>>>>> - start the healthy cluster
>>>>> - start persistent clients
>>>>> - add an another host with pair of OSDs, let them be in the data placement
>>>>> - wait for data to rearrange
>>>>> - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
>>>>> have an 1/2 hour delay on readjust in such case, I did ``ceph osd
>>>>> out'' manually)
>>>>> - watch for data to rearrange again
>>>>> - [22:51 timestamp] when it ends, start a manual rescrub, with
>>>>> non-zero active+clean+scrubbing-state placement groups at the end of
>>>>> process which `ll stay in this state forever until something happens
>>>>>
>>>>> After that, I can restart osds one per one, if I want to get rid of
>>>>> scrubbing states immediately and then do deep-scrub(if I don`t, those
>>>>> states will return at next ceph self-scrubbing) or do per-osd
>>>>> deep-scrub, if I have a lot of time. The case I have described in the
>>>>> previous message took place when I remove osd from data placement
>>>>> which existed on the moment when client(s) have started and indeed it
>>>>> is more harmful than current one(frozen I/O leads to hanging entire
>>>>> guest, for example). Since testing those flow took a lot of time, I`ll
>>>>> send logs related to this case tomorrow.
>>>>>
>>>>>>> Second question is not directly related to this problem, but I
>>>>>>> have thought on for a long time - is there a planned features to
>>>>>>> control scrub process more precisely, e.g. pg scrub rate or scheduled
>>>>>>> scrub, instead of current set of timeouts which of course not very
>>>>>>> predictable on when to run?
>>>>>>
>>>>>> Not yet.  I would be interested in hearing what kind of control/config
>>>>>> options/whatever you (and others) would like to see!
>>>>>
>>>>> Of course it will be awesome to have any determined scheduler or at
>>>>> least an option to disable automated scrubbing, since it is not very
>>>>> determined in time and deep-scrub eating a lot of I/O if command
>>>>> issued against entire OSD. Rate limiting is not in the first place, at
>>>>> least it may be recreated in external script, but for those who prefer
>>>>> to leave control to Ceph, it may be very useful.
>>>>>
>>>>> Thanks!
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html