Re: Fwd: mdadm RAID5 to RAID6 migration thrown exceptions, access to data lost

Krzysztof Jakóbczyk <krzysiek.jakobczyk@xxxxxxxxx> · Tue, 3 Sep 2019 17:27:06 +0200

Good afternoon Phil,

I'm sorry if I didn't follow the etiquette for the mailing lists. It's
actually the first time I'm using one, so I need to do some research
on that topic, but the situation was quite stressful in the begining
and I've skipped that. I hope that the message looks better now.

I think the situation looks worse now. The reshape finished and resync
has begun. During the resync I've found many errors concerning
/dev/sdf which is the part of the md127. Example below:

[77929.632264] md: md127: reshape done.
[77930.353038] md: resync of RAID array md127
[78358.476585] ata5.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
[78358.477393] ata5.00: irq_stat 0x40000008
[78358.477826] ata5.00: failed command: READ FPDMA QUEUED
[78358.478260] ata5.00: cmd 60/40:b0:f0:41:cd/05:00:07:00:00/40 tag 22
ncq dma 688128 in
                        res 41/40:00:b8:46:cd/00:00:07:00:00/40 Emask
0x409 (media error) <F>
[78358.479118] ata5.00: status: { DRDY ERR }
[78358.479520] ata5.00: error: { UNC }
[78358.481178] ata5.00: configured for UDMA/133
[78358.481348] sd 4:0:0:0: [sdf] tag#22 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[78358.481352] sd 4:0:0:0: [sdf] tag#22 Sense Key : Medium Error [current]
[78358.481355] sd 4:0:0:0: [sdf] tag#22 Add. Sense: Unrecovered read
error - auto reallocate failed
[78358.481359] sd 4:0:0:0: [sdf] tag#22 CDB: Read(16) 88 00 00 00 00
00 07 cd 41 f0 00 00 05 40 00 00
[78358.481362] print_req_error: I/O error, dev sdf, sector 130893496
[78358.481821] ata5: EH complete
[78363.606588] ata5.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
[78363.607550] ata5.00: irq_stat 0x40000008
[78363.608086] ata5.00: failed command: READ FPDMA QUEUED
[78363.608650] ata5.00: cmd 60/40:98:70:76:cd/05:00:07:00:00/40 tag 19
ncq dma 688128 in
                        res 41/40:00:78:77:cd/00:00:07:00:00/40 Emask
0x409 (media error) <F>
[78363.609867] ata5.00: status: { DRDY ERR }
[78363.610494] ata5.00: error: { UNC }
[78363.612536] ata5.00: configured for UDMA/133
[78363.612671] sd 4:0:0:0: [sdf] tag#19 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[78363.612675] sd 4:0:0:0: [sdf] tag#19 Sense Key : Medium Error [current]
[78363.612678] sd 4:0:0:0: [sdf] tag#19 Add. Sense: Unrecovered read
error - auto reallocate failed
[78363.612682] sd 4:0:0:0: [sdf] tag#19 CDB: Read(16) 88 00 00 00 00
00 07 cd 76 70 00 00 05 40 00 00
[78363.612685] print_req_error: I/O error, dev sdf, sector 130905976
[78363.613474] ata5: EH complete
[78367.196566] ata5.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
[78367.198040] ata5.00: irq_stat 0x40000008
[78367.198833] ata5.00: failed command: READ FPDMA QUEUED
[78367.199653] ata5.00: cmd 60/40:b8:30:47:cd/05:00:07:00:00/40 tag 23
ncq dma 688128 in
                        res 41/40:00:78:47:cd/00:00:07:00:00/40 Emask
0x409 (media error) <F>
[78367.201382] ata5.00: status: { DRDY ERR }
[78367.202276] ata5.00: error: { UNC }
[78367.204497] ata5.00: configured for UDMA/133
[78367.204626] sd 4:0:0:0: [sdf] tag#23 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[78367.204630] sd 4:0:0:0: [sdf] tag#23 Sense Key : Medium Error [current]
[78367.204634] sd 4:0:0:0: [sdf] tag#23 Add. Sense: Unrecovered read
error - auto reallocate failed
[78367.204638] sd 4:0:0:0: [sdf] tag#23 CDB: Read(16) 88 00 00 00 00
00 07 cd 47 30 00 00 05 40 00 00
[78367.204641] print_req_error: I/O error, dev sdf, sector 130893688
[78367.205718] ata5: EH complete
[78372.686561] ata5.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
[78372.688585] ata5.00: irq_stat 0x40000008
[78372.689654] ata5.00: failed command: READ FPDMA QUEUED
[78372.690751] ata5.00: cmd 60/40:78:b0:7b:cd/05:00:07:00:00/40 tag 15
ncq dma 688128 in
                        res 41/40:00:e8:7b:cd/00:00:07:00:00/40 Emask
0x409 (media error) <F>
[78372.693034] ata5.00: status: { DRDY ERR }
[78372.694194] ata5.00: error: { UNC }

These information repeats between 77929 and 78721 seconds. There's
really a lot of such information. What does that mean? Doesn't look
good, however I don't observer further errors since `dmesg | tail`
shows:

[78723.057686] raid5_end_read_request: 22 callbacks suppressed
[78723.057688] md/raid:md127: read error corrected (8 sectors at
130891448 on sdf1)
[78723.057719] md/raid:md127: read error corrected (8 sectors at
130903928 on sdf1)
[78723.057761] md/raid:md127: read error corrected (8 sectors at
130904144 on sdf1)
[78723.057804] md/raid:md127: read error corrected (8 sectors at
130904224 on sdf1)
[78723.061817] md/raid:md127: read error corrected (8 sectors at
130904280 on sdf1)
[78723.063796] md/raid:md127: read error corrected (8 sectors at
130891488 on sdf1)
[78723.063846] md/raid:md127: read error corrected (8 sectors at
130904024 on sdf1)
[78723.063889] md/raid:md127: read error corrected (8 sectors at
130904312 on sdf1)
[78723.063934] md/raid:md127: read error corrected (8 sectors at
130904376 on sdf1)
[78723.063980] md/raid:md127: read error corrected (8 sectors at
130904512 on sdf1)
[79225.164085] audit: type=1006 audit(1567523533.835:50): pid=19706
uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=3
res=1
[79225.182760] audit: type=1130 audit(1567523533.855:51): pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=user-runtime-dir@0
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=?
terminal=? res=success'
[79225.188489] audit: type=1006 audit(1567523533.865:52): pid=19722
uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=4
res=1
[79225.235782] audit: type=1130 audit(1567523533.905:53): pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=user@0 comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'

The contents of `/proc/mdstat` are the following:

[root@sysresccd ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sda1[5] sdg1[6] sdd1[4] sdf1[3]
      7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      [=>...................]  resync =  5.9% (234117656/3906885632)
finish=399.3min speed=153265K/sec
      bitmap: 7/30 pages [28KB], 65536KB chunk

unused devices: <none>

The `smartctl -a /def/sdf` shows 1 current pending sector:

[root@sysresccd ~]# smartctl -a /dev/sdf
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.34-1-lts] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E2ZTJ6S9
LU WWN Device Id: 5 0014ee 2629cfe27
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Sep  3 15:24:22 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (51060) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 511) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       229
  3 Spin_Up_Time            0x0027   196   175   021    Pre-fail
Always       -       7175
  4 Start_Stop_Count        0x0032   096   096   000    Old_age
Always       -       4476
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age
Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age
Always       -       6548
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age
Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
Always       -       0
 12 Power_Cycle_Count       0x0032   096   096   000    Old_age
Always       -       4476
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
Always       -       89
193 Load_Cycle_Count        0x0032   198   198   000    Old_age
Always       -       8128
194 Temperature_Celsius     0x0022   106   106   000    Old_age
Always       -       46
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age
Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Do you think this drive should be replaced?
Why the array is resyncing?
Is the data in danger in this state?

Best regards,
Krzysztof Jakobczyk