Hi,
在 2023/05/02 19:30, Peter Neuwirth 写道:
Hello Kuai,
thank you for your suggestion!
It is true, as I read the source of error message in drivers/md/raid5.c,
I saw that growing and replacement is to much to handle.
So I did what you suggested and started the raid 5 (that was in a
raid 6 transformation with addition of two more drives) with only the
5 members, that should run a degraded raid 5.
mdadm --assemble --run /dev/md0 /dev/sdd /dev/sdc /dev/sdb /dev/sdi
/dev/sdj
this worked and it was assembled.
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0]
[raid1] [raid10]
md0 : active (auto-read-only) raid6 sdd[0] sdi[6] sdj[4] sdb[2] sdc[1]
4883151360 blocks super 1.2 level 6, 256k chunk, algorithm 18
[7/5] [UUU_UU_]
bitmap: 0/8 pages [0KB], 65536KB chunk
unused devices: <none>
mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Mon Mar 6 18:17:30 2023
Raid Level : raid6
Array Size : 4883151360 (4656.94 GiB 5000.35 GB)
Used Dev Size : 976630272 (931.39 GiB 1000.07 GB)
Raid Devices : 7
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Apr 28 04:21:03 2023
State : clean, degraded
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric-6
Chunk Size : 256K
Consistency Policy : bitmap
New Layout : left-symmetric
Name : solidsrv11:0 (local to host solidsrv11)
UUID : 1a87479e:7513dd65:37c61ca1:43184f65
Events : 6336
Number Major Minor RaidDevice State
0 8 48 0 active sync /dev/sdd
1 8 32 1 active sync /dev/sdc
2 8 16 2 active sync /dev/sdb
- 0 0 3 removed
4 8 144 4 active sync /dev/sdj
6 8 128 5 active sync /dev/sdi
- 0 0 6 removed
But when I try to mount it as xfs fs:
mount: /mnt/image: mount(2) system call failed: Structure needs cleaning.
When I try to repair the xfs fs, it tells me, that there was no superblock
found..
Sorry to hear that, it seems like data is corrupted already, and this
really is a kernel issue that somehow replacement(resync?) and reshape
is messed. And I suspect that reboot while reshape is in progress and
replacement exist can trigger this...
I have no idea for now, but I'll try to repoduce this problem and fix
it.
Thanks,
Kuai
xfs_repair -n /dev/md0
Phase 1 - find and verify superblock...
couldn't verify primary superblock - not enough secondary superblocks
with matching geometry !!!
attempting to find secondary superblock...
.................found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...
...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...
...........................................
Sadly I do not exactly understand, what happens in the grow+replacement
phase,
where all evil begun. As I could see, the two added hard disk drives
still have their old
partition table, so I suppose, the rebuild process was still in moving
the raid 5 geometry
to a raid-5-to-6 transient geometry. I'm not sure if in this process,
raid 5 promise (1 drive
may fail) still holds. However, the two additional drives were treated
as spare since
this moment after reboot. And one drive of the prior riad5, now raid6
seems to be defect.
Is it possible that the process restart somehow scrambled some raidset
informations and
messed up my raid level striping in continued growth process ? then the
still mounted device
crashed and disappeared from mounts. And from this point on, there was
no way to reconstruct
the messed raidset informations and striping?
This whole matter with striped data, transient raid geometries,
expansion and growth
processing, etc. seems so complex and intransparent to me, that I start
to consider
my data on this raidset as lost :(
For any tools and suggestions helping to save at least parts of the data
on the
raid, I would be very happy.
regards,
Peter
Am 28.04.23 um 04:01 schrieb Yu Kuai:
Hi,
在 2023/04/28 5:09, Peter Neuwirth 写道:
------------------------------------------------------------------------------------------------------------------------
Some Logs:
------------------------------------------------------------------------------------------------------------------------
uname -a ; mdadm --version
Linux srv11 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21)
x86_64 GNU/Linux
mdadm - v4.1 - 2018-10-01
srv11:~# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Mon Mar 6 18:17:30 2023
Raid Level : raid6
Used Dev Size : 976630272 (931.39 GiB 1000.07 GB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Thu Apr 27 17:36:15 2023
State : active, FAILED, Not Started
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric-6
Chunk Size : 256K
Consistency Policy : unknown
New Layout : left-symmetric
Name : solidsrv11:0 (local to host solidsrv11)
UUID : 1a87479e:7513dd65:37c61ca1:43184f65
Events : 4700
Number Major Minor RaidDevice State
- 0 0 0 removed
- 0 0 1 removed
- 0 0 2 removed
- 0 0 3 removed
- 0 0 4 removed
- 0 0 5 removed
- 0 0 6 removed
- 8 32 2 sync /dev/sdc
- 8 144 4 sync /dev/sdj
- 8 80 0 sync /dev/sdf
- 8 16 1 sync /dev/sdb
- 8 128 5 sync /dev/sdi
- 8 96 4 spare rebuilding /dev/sdg
Looks like the /dev/sdg is not the original device, above log shows that
RaidDevice 3 is missing, and /dev/sdg is replacement of /dev/sdj.
So reshapge is still in progress, and somehow sdg is the replacement of
sdj, this matches the condition in raid5_run:
7952 if
(rcu_access_pointer(conf->disks[i].replacement) &&
7953 conf->reshape_progress != MaxSector) {
7954 /* replacements and reshape simply do not
mix. */
7955 pr_warn("md: cannot handle concurrent
replacement and reshape.\n");
7956 goto abort;
7957 }
I'm by no means raid5 expert but I will suggest to remove /dev/sdg and
try again to assemble.
Thanks,
Kuai
.