Re: Please show descriptive message about degraded raid when booting

Roger Heflin <rogerheflin@xxxxxxxxx> · Mon, 23 Mar 2020 13:13:50 -0500

The system had hung.  The disks are failing inside the SCSI subsystem,
I don't believe (raid, lvm, multipath) will know anything about what
is going on inside the scsi layer.

Those default timeouts are usually at least 30 seconds, but in the
past the scsi subsystem did some retrying internally.  The timeout
needs to be higher than the length of time the disk could take.
Non-enterprise, non-raid disks generally have this timeout set 60-120
seconds hence MD waiting to see if the failure is a sector read
failure (will be a no-response until the disk timeout) or a complete
disk failure (no response ever).

cat /sys/block/sda/device/timeout shows the timeout.

Read about seterc, tler and smartctl for discussions about what is going on.

If you can then turn down your disks max timeout with the smartctl
commands then the disk will report back a sector failure faster and
that is usually what is happening.  If you turn down the disks timeout
to a max of say 7 seconds then you can set the scsi layers timeout to
say 10 seconds.   Then the only time the scsi timeout matters if if
the disk is there but not responding.

On Fri, Mar 20, 2020 at 11:50 AM Patrick Dung <patdung100@xxxxxxxxx> wrote:
>
> Hello,
>
> Bump.
>
> Got a reply from Fedora support but asking me to find upstream.
> https://bugzilla.redhat.com/show_bug.cgi?id=1794139
>
> Thanks,
> Patrick
>
> On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung <patdung100@xxxxxxxxx> wrote:
> >
> > Hello,
> >
> > The system have Linux software raid (md) raid 1.
> > One of the disk is missing or have problem.
> >
> > The raid is degraded.
> > When the OS boot, it hangs at the message for outputting to kernel at
> > about three seconds.
> > There is no descriptive message that the RAID is degraded.
> > I know the problem because I had wrote zero to one of the disk of the
> > raid 1. If I don't know the problem (maybe cable is loose or disk
> > failure), it is confusing.
> >
> > Related log:
> >
> > [    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
> > (29.0 GB/27.0 GiB)
> > [    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
> > (29.0 GB/27.0 GiB)
> > [    2.917499] sd 32:0:0:0: [sda] Write Protect is off
> > [    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
> > [    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
> > [    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
> > [    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
> > [    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
> > [    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
> > [    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
> > [    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
> > [    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
> > [    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
> > [    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
> > vectors allocated
> > [    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
> > [  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
> > [  139.412176] md125: detected capacity change from 0 to 1073676288
> > [  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
> > [  139.434182] md126: detected capacity change from 0 to 314507264
> > [  139.436894]  md126:
> > [  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
> > [  139.456739] md127: detected capacity change from 0 to 27582726144
> >
> > So there are about 130 seconds without any descriptive messages. I
> > thought the system had hanged.
> >
> > Could the kernel display more descriptive messages about the RAID?
> >
> > If I use rd.debug boot parameters, I know the kernel is still running.
> > But it is scrolling very fast without actually knowing what is the the
> > problem.
> >
> > Thanks,
> > Patrick