Re: Please show descriptive message about degraded raid when booting

Patrick Dung <patdung100@xxxxxxxxx> · Tue, 24 Mar 2020 12:45:11 +0800

By the way, for my original post, it's a virtual machine. I disconnect
one of the members from the raid 1.
I can't simulate hardware failure with VM. So there are no 'SCT Error
Recovery Control/TLER' timeout involved.

Thanks,
Patrick

On Tue, Mar 24, 2020 at 2:33 AM Patrick Dung <patdung100@xxxxxxxxx> wrote:
>
> Thanks for reply.
>
> The problem occurs with my physical hardware and in virtual machine
> (can't set TLER).
> The log you see in my original post is captured/simulated from a
> virtual machine.
>
> The system is not 'hung '. If I run rd.debug it would have lots of
> messages scrolling quickly that you can't see clearly.
>
> What I am asking for is a more descriptive message from the MD raid,
> try to display the status like:
> Try to activate md/raid1:md125, currently 1 of of 2 disk online.
> Timeout in X seconds.
> Something like that.
>
> Thanks,
> Patrick
>
> On Tue, Mar 24, 2020 at 2:14 AM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
> >
> > The system had hung.  The disks are failing inside the SCSI subsystem,
> > I don't believe (raid, lvm, multipath) will know anything about what
> > is going on inside the scsi layer.
> >
> > Those default timeouts are usually at least 30 seconds, but in the
> > past the scsi subsystem did some retrying internally.  The timeout
> > needs to be higher than the length of time the disk could take.
> > Non-enterprise, non-raid disks generally have this timeout set 60-120
> > seconds hence MD waiting to see if the failure is a sector read
> > failure (will be a no-response until the disk timeout) or a complete
> > disk failure (no response ever).
> >
> > cat /sys/block/sda/device/timeout shows the timeout.
> >
> > Read about seterc, tler and smartctl for discussions about what is going on.
> >
> > If you can then turn down your disks max timeout with the smartctl
> > commands then the disk will report back a sector failure faster and
> > that is usually what is happening.  If you turn down the disks timeout
> > to a max of say 7 seconds then you can set the scsi layers timeout to
> > say 10 seconds.   Then the only time the scsi timeout matters if if
> > the disk is there but not responding.
> >
> >
> > On Fri, Mar 20, 2020 at 11:50 AM Patrick Dung <patdung100@xxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > Bump.
> > >
> > > Got a reply from Fedora support but asking me to find upstream.
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1794139
> > >
> > > Thanks,
> > > Patrick
> > >
> > > On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung <patdung100@xxxxxxxxx> wrote:
> > > >
> > > > Hello,
> > > >
> > > > The system have Linux software raid (md) raid 1.
> > > > One of the disk is missing or have problem.
> > > >
> > > > The raid is degraded.
> > > > When the OS boot, it hangs at the message for outputting to kernel at
> > > > about three seconds.
> > > > There is no descriptive message that the RAID is degraded.
> > > > I know the problem because I had wrote zero to one of the disk of the
> > > > raid 1. If I don't know the problem (maybe cable is loose or disk
> > > > failure), it is confusing.
> > > >
> > > > Related log:
> > > >
> > > > [    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
> > > > (29.0 GB/27.0 GiB)
> > > > [    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
> > > > (29.0 GB/27.0 GiB)
> > > > [    2.917499] sd 32:0:0:0: [sda] Write Protect is off
> > > > [    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
> > > > [    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
> > > > [    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
> > > > [    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
> > > > [    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
> > > > [    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
> > > > [    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
> > > > [    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
> > > > [    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
> > > > [    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
> > > > [    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
> > > > vectors allocated
> > > > [    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
> > > > [  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
> > > > [  139.412176] md125: detected capacity change from 0 to 1073676288
> > > > [  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
> > > > [  139.434182] md126: detected capacity change from 0 to 314507264
> > > > [  139.436894]  md126:
> > > > [  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
> > > > [  139.456739] md127: detected capacity change from 0 to 27582726144
> > > >
> > > > So there are about 130 seconds without any descriptive messages. I
> > > > thought the system had hanged.
> > > >
> > > > Could the kernel display more descriptive messages about the RAID?
> > > >
> > > > If I use rd.debug boot parameters, I know the kernel is still running.
> > > > But it is scrolling very fast without actually knowing what is the the
> > > > problem.
> > > >
> > > > Thanks,
> > > > Patrick