Re: RAID6: "Bad block number requested"

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 11 Jun 2018 08:24:28 -0700

On Mon, 2018-06-11 at 11:18 -0400, Douglas Gilbert wrote:
> On 2018-06-11 11:06 AM, James Bottomley wrote:
> > On Mon, 2018-06-11 at 16:24 +0200, Sebastian Hegler wrote:
> > > Dear all,
> > > 
> > > First off: sorry for cross-posting. I don't know if this is a
> > > RAID
> > > issue or a SCSI issue, so I'll just ask y'all.
> > > 
> > > 
> > > For a RAID6 capacity upgrade (higher capacity drives), we bought
> > > some
> > > 10TB disks:
> > > ==================
> > > Apr 17 11:16:05 kuiper kernel: [12795386.862031] scsi 6:0:36:0:
> > > Direct-Access     ATA      HGST HUH721010AL T21D PQ: 0 ANSI: 6
> > > Apr 17 11:16:05 kuiper kernel: [12795386.919904] scsi 6:0:36:0:
> > > atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y),
> > > sw_preserve(y)
> > > Apr 17 11:16:05 kuiper kernel: [12795386.974186] sd 6:0:36:0:
> > > [sdl]
> > > 2441609216 4096-byte logical blocks: (10.0 TB/9.10 TiB)
> > 
> > Well, this is the problem: a 4k logical (presumably 4k physical)
> > drive
> > cannot be addressed in block sectors that are not divisible by
> > 8.  This
> > type of drive configuration is very unusual (although it was
> > something
> > we tested years ago before the industry realised it had to ship
> > drives
> > with 4k physical but 512 byte logical sectors because of the legacy
> > problem).
> > 
> > > Apr 17 11:16:05 kuiper kernel: [12795386.998016] sd 6:0:36:0:
> > > [sdl]
> > > Write Protect is off
> > > Apr 17 11:16:05 kuiper kernel: [12795387.000625] sd 6:0:36:0:
> > > Attached scsi generic sg12 type 0
> > > Apr 17 11:16:05 kuiper kernel: [12795387.035341] sd 6:0:36:0:
> > > [sdl]
> > > Mode Sense: 7f 00 10 08
> > > Apr 17 11:16:05 kuiper kernel: [12795387.035679] sd 6:0:36:0:
> > > [sdl]
> > > Write cache: enabled, read cache: enabled, supports DPO and FUA
> > > Apr 17 11:16:05 kuiper kernel: [12795387.098315] sd 6:0:36:0:
> > > [sdl]
> > > Attached SCSI disk
> > > ==================
> > > 
> > > RAID add and rebuild operations went fine. However, some minutes
> > > after rebuild completion, several hundreds of these error
> > > messages
> > > started to appear:
> > > ==================
> > > Apr 20 03:37:29 kuiper kernel: [13027072.454811] sd 6:0:36:0:
> > > [sdl]
> > > Bad block number requested
> > 
> > This means that somehow, something sent a non 4k aligned 4k sized
> > request. SCSI here is just the messenger.  However, if you apply
> > this
> > patch, it will capture the stack trace of what above it triggered
> > this,
> > which may help us in debugging.  It could be we may also want to
> > see
> > what the values of block and blk_rq_sectors(rq) actually are, but
> > lets
> > begin with the stack trace.
> > 
> > James
> > 
> > ---
> > 
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 9421d9877730..ac865e048533 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -1109,6 +1109,7 @@ static int sd_setup_read_write_cmnd(struct
> > scsi_cmnd *SCpnt)
> >   		if ((block & 7) || (blk_rq_sectors(rq) & 7)) {
> >   			scmd_printk(KERN_ERR, SCpnt,
> >   				    "Bad block number
> > requested\n");
> 
> Not a very informative error message. How about a quasi SCSI one
> like:
>      Logical Block out of range, due to different block sizes

Well, this is supposed to be an impossible condition: if someone wants
to use non-512 byte logical drives, they're supposed to align the stack
to ensure it works.  In this case, it looks like the stack is aligned
but one component isn't completely 4k safe.  I don't think there's any
message we can print in SCSI that would help with debugging that.

I agree the message we print could be more informative about what went
wrong, but it's unlikely to be helpful to the user who sees it.

James