Re: Assemblin journaled array fails

Michal Soltys <msoltyspl@xxxxxxxxx> · Wed, 13 May 2020 12:58:21 +0200

On 5/12/20 3:27 AM, Song Liu wrote:
> On Mon, May 11, 2020 at 4:13 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote:
>>
>> On 5/10/20 1:57 AM, Michal Soltys wrote:
>>> Anyway, I did some tests with manually snapshotted component devices
>>> (using dm snapshot target to not touch underlying devices).
>>>
>>> The raid manages to force assemble in read-only mode with missing
>>> journal device, so we probably will be able to recover most data
>>> underneath this way (as a last resort).
>>>
>>> The situation I'm in now is likely from uncelan shutdown after all (why
>>> the machine failed to react to ups properly is another subject).
>>>
>>> I'd still want to find out why is - apparently - a journal device giving
>>> issues (contrary to what I'd expect it to do ...), with notable mention of:
>>>
>>> 1) mdadm hangs (unkillable, so I presume in kernel somewhere) and eats 1
>>> cpu when trying to assemble the raid with journal device present; once
>>> it happens I can't do anything with the array (stop, run, etc.) and can
>>> only reboot the server to "fix" that
>>>
>>> 2) mdadm -D shows nonsensical device size after assembly attempt (Used
>>> Dev Size : 18446744073709551615)
>>>
>>> 3) the journal device (which itself is md raid1 consisting of 2 ssds)
>>> assembles, checks (0 mismatch_cnt) fine - and overall looks ok.
>>>
>>>
>>>   From other interesting things, I also attempted to assemble the raid
>>> with snapshotted journal. From what I can see it does attempt to do
>>> something, judging from:
>>>
>>> dmsetup status:
>>>
>>> snap_jo2: 0 536870912 snapshot 40/33554432 16
>>> snap_sdi1: 0 7812500000 snapshot 25768/83886080 112
>>> snap_jo1: 0 536870912 snapshot 40/33554432 16
>>> snap_sdg1: 0 7812500000 snapshot 25456/83886080 112
>>> snap_sdj1: 0 7812500000 snapshot 25928/83886080 112
>>> snap_sdh1: 0 7812500000 snapshot 25352/83886080 112
>>>
>>> But it doesn't move from those values (with mdadm doing nothing eating
>>> 100% cpu as mentioned earlier).
>>>
>>>
>>> Any suggestions how to proceed would very be appreciated.
>>
>>
>> I've added Song to the CC. If you have any suggestions how to
>> proceed/debug this (mdadm stuck somewhere in kernel as far as I can see
>> - while attempting to assembly it).
>>
>> For the record, I can assemble the raid successfully w/o journal (using
>> snapshotted component devices as above), and we did recover some stuff
>> this way from some filesystems - but for some other ones I'd like to
>> keep that option as the very last resort.
> 
> Sorry for delayed response.
> 
> A few questions.
> 
> For these two outputs:
> #1
>                 Name : xs22:r5_big  (local to host xs22)
>                 UUID : d5995d76:67d7fabd:05392f87:25a91a97
>               Events : 56283
> 
>       Number   Major   Minor   RaidDevice State
>          -       0        0        0      removed
>          -       0        0        1      removed
>          -       0        0        2      removed
>          -       0        0        3      removed
> 
>          -       8      145        3      sync   /dev/sdj1
>          -       8      129        2      sync   /dev/sdi1
>          -       9      127        -      spare   /dev/md/xs22:r1_journal_big
>          -       8      113        1      sync   /dev/sdh1
>          -       8       97        0      sync   /dev/sdg1
> 
> #2
> /dev/md/r1_journal_big:
>             Magic : a92b4efc
>           Version : 1.1
>       Feature Map : 0x200
>        Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
>              Name : xs22:r5_big  (local to host xs22)
>     Creation Time : Tue Mar  5 19:28:58 2019
>        Raid Level : raid5
>      Raid Devices : 4
> 
>    Avail Dev Size : 536344576 (255.75 GiB 274.61 GB)
>        Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
>     Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
>       Data Offset : 262144 sectors
>      Super Offset : 0 sectors
>      Unused Space : before=261872 sectors, after=0 sectors
>             State : clean
>       Device UUID : c3a6f2f6:7dd26b0c:08a31ad7:cc8ed2a9
> 
>       Update Time : Sat May  9 15:05:22 2020
>     Bad Block Log : 512 entries available at offset 264 sectors
>          Checksum : c854904f - correct
>            Events : 56289
> 
>            Layout : left-symmetric
>        Chunk Size : 512K
> 
>      Device Role : Journal
>      Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
> 
> 
> Are these captured back to back? I am asking because they showed different
> "Events" number.

Nah, they were captured between reboots. Back to back all event fields show identical value (at 56291 now).

> 
> Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?
> 

The output is empty:

xs22:/☠ ps -eF fww | grep mdadm
root     10332  9362 97   740  1884  25 12:47 pts/1    R+     6:59  |   \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 /dev/sdg1 /dev/sdj1 /dev/sdh1
xs22:/☠ cd /proc/10332
xs22:/proc/10332☠ cat stack
xs22:/proc/10332☠ 

> 18446744073709551615 is 0xffffffffffffffffL, so it is not initialized
> by data from the disk.
> I suspect we hang somewhere before this value is initialized.
>