Re: :-( Re: F20 Kernel 3.14.x / raid-check and freezes

Reindl Harald <h.reindl@xxxxxxxxxxxxx> · Sun, 08 Jun 2014 01:53:40 +0200



Am 06.06.2014 00:28, schrieb Reindl Harald:
> sadly both re-opened today, after some hours of
> running shutdown and raid-check with 3.14 is a
> pain and taht is *for sure* 3.14.x because my
> co-developer stayed at the last 3.13 until i
> thought it's fine now, had never that problems
> and both 3.14.x shutdowns freezed with mdraid

the two commits below smell like related
not sure how to understand the second one
filestems on affected machines are mounted with nobarrier

https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.6
____________________________________________________________________________________________

commit 0bc4091108e8f2e65faef3082e5261f2c35cd2b4
Author: NeilBrown <neilb@xxxxxxx>
Date:   Tue May 6 09:36:08 2014 +1000

    md: avoid possible spinning md thread at shutdown.

    commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream.

    If an md array with externally managed metadata (e.g. DDF or IMSM)
    is in use, then we should not set safemode==2 at shutdown because:

    1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
    2/ The safemode management code doesn't cope with safemode==2 on external metadata
       and md_check_recover enters an infinite loop.

    Even at shutdown, an infinite-looping process can be problematic, so this
    could cause shutdown to hang.

    Signed-off-by: NeilBrown <neilb@xxxxxxx>
    Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
____________________________________________________________________________________________

commit 8c7311a1c4a8d804bde91b00a2f2c1a22a954c30
Author: NeilBrown <neilb@xxxxxxx>
Date:   Mon May 5 13:34:37 2014 +1000

    md/raid10: call wait_barrier() for each request submitted.

    commit cc13b1d1500656a20e41960668f3392dda9fa6e2 upstream.

    wait_barrier() includes a counter, so we must call it precisely once
    (unless balanced by allow_barrier()) for each request submitted.

    Since
    commit 20d0189b1012a37d2533a87fb451f7852f2418d1
        block: Introduce new bio_split()
    in 3.14-rc1, we don't call it for the extra requests generated when
    we need to split a bio.

    When this happens the counter goes negative, any resync/recovery will
    never start, and  "mdadm --stop" will hang.

    Reported-by: Chris Murphy <lists@xxxxxxxxxxxxxxxxx>
    Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
    Cc: Kent Overstreet <kmo@xxxxxxxxxxxxx>
    Signed-off-by: NeilBrown <neilb@xxxxxxx>
    Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
____________________________________________________________________________________________


> Am 03.06.2014 21:10, schrieb Reindl Harald:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1096414
>> https://bugzilla.redhat.com/show_bug.cgi?id=1092937
>>
>> *both* seems to be fixed with 3.14.5-200.fc20.x86_64
>> while i am unable to find the relevant change in
>> the kernel-upstream-changelog
>>
>> would be nice to know that it is fixed for a reason
>> and not by random as it came - *maybe* the change
>> to "data=ordered" instead "data=writeback" is
>> responsible but i don't want to test that since
>> there where enough data-damages by that freezes
>>
>> however, rebootet my workstation 40 times after the
>> update from koji and 4 raid-check runs without any
>> freeze on two different machines
>>
>> Am 08.05.2014 11:10, schrieb Reindl Harald:
>>> Am 02.05.2014 11:09, schrieb Reindl Harald:
>>>> Am 01.05.2014 15:06, schrieb Reindl Harald:
>>>>> Am 01.05.2014 14:55, schrieb Bruno Wolff III:
>>>>>> The bug I reported to upstream is:
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=68061
>>>>>>
>>>>>> That was for 3.13, so if you were running into a varient of the problem I was seeing, you should have seen it in
>>>>>> 3.13. I did not see the problem with 3.14 or 3.15 pre-release kernels. I'm using raid 1 instead of raid 10
>>>>>
>>>>> different problem class
>>>>> you where able to look at cat/proc/mdstat
>>>>>
>>>>> in my case the machine is just dead and don't react on any
>>>>> input including ACPI powerbutton, this morning i saw a frozen
>>>>> KDE desktop with high CPU load from the past in monitoring
>>>>> widgets, last week it froze while login (after enter username
>>>>> and password in KDM the first black screen before build the
>>>>> desktop appeared and any operatins stopped)
>>>>>
>>>>> i will wait some hours for possible feedback here and than
>>>>> file a bugreport, sadly there is not much constrcutive i
>>>>> can report caused by the complete system freeze and no logs
>>>>> or anything after hard power off and boot again
>>>>
>>>> ok, it's not only raid-check, 15 minutes ago it happened again
>>>> freeze number 3 - i fear a bugreport makes not much sense
>>>> because "system freezes once or two per week before lunch"
>>>> is not that much helpful, most likely that will not hit
>>>> only me and get away with a following 3.14.x update
>>>>
>>>> Apr 21 23:34:54 Installed: kernel-3.14.1-200.fc20.x86_64
>>>> Apr 28 20:50:54 Installed: kernel-3.14.2-200.fc20.x86_64
>>>
>>> 3.14.3-200.fc20.x86_64 and sadly today the same on two
>>> machines - the only thing i can say for sure that after
>>> whatever happens in the background *any* write to disks
>>> hangs forever
>>>
>>> * woke up -> no music
>>> * turn on screen, move the mouse, KDM login screen
>>> * try to enter password
>>> * see the first asterisk and followed input ignored
>>> * move the mousepointer still works
>>> * STRG+ALT+F3 -> thank god there is a active root session
>>> * type "sync" -> no disk activity, no response
>>> * well, power off hard
>>> _____________________________________________
>>>
>>> second machine in the office:
>>>
>>> * KDM hangs
>>> * login on TTY3 by luck possible
>>> * dmesg -> only the usual lines about starting raid-check
>>>   and delay the two other raid devices
>>> * nothing else in dmesg or syslog
>>> * also here: any write to disk stalls

_______________________________________________
kernel mailing list
kernel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/kernel