It seems that the problem didn't happened after upgrading to Debian 10
(mdadm 4.1-1) - this is not a proof, but a hope that it is solved! Let's
hope nobody will never complain for such a problem starting from this
version.
----------------------------------------------------------------------
Given the fact RAID is a pretty fixed standard for a long time already,
seeing every subject into the mailing list archive (unknown errors that
nobody understands clearly, reverts, patches everywhere, deadlocks, null
pointers, RFC...), after so much development years about mdadm, tends to
indicates that mdadm isn't converging into something really stable.
Unless it's really slow but progressing the right way ?
Only a precise bug tracking system would have the ability to provide
that information in a dependable way
Also as it's about a critical feature aiming to protect valuable data as
much as possible (even if it's not a sufficient backup in case something
is overwritten or several disks are lost), changes that are not proved to be
* mandatory
* proven to solve a really existing bug or problem,
* proven to have no chance of adding any new bug or any unnecessary
complexity
should probably better be forbidden in such a critical functionality. In
the other hand I also understand that freedom is an essential part of
this kind of projects, but here this is quite critical!
----------------------------------------------------------------------
Anyway that's may be all for this case, still thanks for a work that I'm
still happy to be able to use (yet a little more cautious than before!).
I'll probably propose some bits of new documentation and helps about
serious failures recovery into the wiki, using some knowledge and
observations I earned during my searches and tests (after ensuring that
none of those bits of information could turn to be wrong, or likely to
introduce any wrong behavior or data lost - because this can be really
serious and distressing subject).
Best regards
Julien
On 5/1/19 6:13 PM, Julien ROBIN wrote:
Hi folks,
tl;dr : Some more information/confirmation, and mdadm 4.1-1 test coming.
Even if I have some difficulty to reproduce the issue into another
computer (it happened only once in a big amount of tests), on the real
server, it failed exactly the same a second time during the night, so it
seems that this can be repeated more easily on it. This time the ext4
filesystem wasn't mounted.
So I'll upgrade it to Debian 10 which is using Linux 4.19 and mdadm
4.1-1, and do the test again; in order to tell you if the problem is
still here.
By the way, doing some tests on another computer, playing --create over
an existing array after switching from 3.4-4 to 4.1-1, needs to specify
the data-offset, because it changed. If changed and not given, the array
filesystem isn't readable until you create it with the right data-offset
value.
That is, in case of same failure (is no actual data changed - but mdadm
cannot assemble anymore), after the upgrade, the exact sentence for
recovering my server's RAID will be :
mdadm --create /dev/md0 --level=5 --chunk=512K --metadata=1.2 --layout
left-symmetric --data-offset=262144s --raid-devices=3 /dev/sdd1
/dev/sde1 /dev/sdb1 --assume-clean
It also implies that /dev/sdd1 /dev/sde1 /dev/sdb1 didn't moved (I know
what are the associated serial numbers - so it's easy to check). If
wrong, the array filesystem isn't readable until you create it with the
right positions.
By doing some archeology on the list archive about "grow" subjects, I
found this guy who suffered from what looks like the same problem on
same Debian 9 too (his thing about inserting the disk to the VM seems
not to be a real difference - and he found another way to get it started
again).
https://marc.info/?t=153183310600004&r=1&w=2
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884719
I'll probably keep you informed tonight !