Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

"Verma, Vishal L" <vishal.l.verma@xxxxxxxxx> · Fri, 20 Jan 2017 00:55:15 +0000



On Tue, 2017-01-17 at 18:01 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@xxxxxxxxx>
> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@xxxxxxxxx> wrote:
> > > On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@inte
> > > l.com> wrote:
> > > > On 01/16, Darrick J. Wong wrote:
> > > > > On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > > > > > On 01/14, Slava Dubeyko wrote:
> > > > > > > 
> > > > > > > ---- Original Message ----
> > > > > > > Subject: [LSF/MM TOPIC] Badblocks checking/representation
> > > > > > > in filesystems
> > > > > > > Sent: Jan 13, 2017 1:40 PM
> > > > > > > From: "Verma, Vishal L" <vishal.l.verma@xxxxxxxxx>
> > > > > > > To: lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx
> > > > > > > Cc: linux-nvdimm@xxxxxxxxxxxx, linux-block@xxxxxxxxxxxxxxx
> > > > > > > , linux-fsdevel@xxxxxxxxxxxxxxx
> > > > > > > 
> > > > > > > > The current implementation of badblocks, where we
> > > > > > > > consult the
> > > > > > > > badblocks list for every IO in the block driver works,
> > > > > > > > and is a
> > > > > > > > last option failsafe, but from a user perspective, it
> > > > > > > > isn't the
> > > > > > > > easiest interface to work with.
> > > > > > > 
> > > > > > > As I remember, FAT and HFS+ specifications contain
> > > > > > > description of bad blocks
> > > > > > > (physical sectors) table. I believe that this table was
> > > > > > > used for the case of
> > > > > > > floppy media. But, finally, this table becomes to be the
> > > > > > > completely obsolete
> > > > > > > artefact because mostly storage devices are reliably
> > > > > > > enough. Why do you need
> > > > > 
> > > > > ext4 has a badblocks inode to own all the bad spots on disk,
> > > > > but ISTR it
> > > > > doesn't support(??) extents or 64-bit filesystems, and might
> > > > > just be a
> > > > > vestigial organ at this point.  XFS doesn't have anything to
> > > > > track bad
> > > > > blocks currently....
> > > > > 
> > > > > > > in exposing the bad blocks on the file system level?  Do
> > > > > > > you expect that next
> > > > > > > generation of NVM memory will be so unreliable that file
> > > > > > > system needs to manage
> > > > > > > bad blocks? What's about erasure coding schemes? Do file
> > > > > > > system really need to suffer
> > > > > > > from the bad block issue?
> > > > > > > 
> > > > > > > Usually, we are using LBAs and it is the responsibility of
> > > > > > > storage device to map
> > > > > > > a bad physical block/page/sector into valid one. Do you
> > > > > > > mean that we have
> > > > > > > access to physical NVM memory address directly? But it
> > > > > > > looks like that we can
> > > > > > > have a "bad block" issue even we will access data into
> > > > > > > page cache's memory
> > > > > > > page (if we will use NVM memory for page cache, of
> > > > > > > course). So, what do you
> > > > > > > imply by "bad block" issue?
> > > > > > 
> > > > > > We don't have direct physical access to the device's address
> > > > > > space, in
> > > > > > the sense the device is still free to perform remapping of
> > > > > > chunks of NVM
> > > > > > underneath us. The problem is that when a block or address
> > > > > > range (as
> > > > > > small as a cache line) goes bad, the device maintains a
> > > > > > poison bit for
> > > > > > every affected cache line. Behind the scenes, it may have
> > > > > > already
> > > > > > remapped the range, but the cache line poison has to be kept
> > > > > > so that
> > > > > > there is a notification to the user/owner of the data that
> > > > > > something has
> > > > > > been lost. Since NVM is byte addressable memory sitting on
> > > > > > the memory
> > > > > > bus, such a poisoned cache line results in memory errors and
> > > > > > SIGBUSes.
> > > > > > Compared to tradational storage where an app will get nice
> > > > > > and friendly
> > > > > > (relatively speaking..) -EIOs. The whole badblocks
> > > > > > implementation was
> > > > > > done so that the driver can intercept IO (i.e. reads) to
> > > > > > _known_ bad
> > > > > > locations, and short-circuit them with an EIO. If the driver
> > > > > > doesn't
> > > > > > catch these, the reads will turn into a memory bus access,
> > > > > > and the
> > > > > > poison will cause a SIGBUS.
> > > > > 
> > > > > "driver" ... you mean XFS?  Or do you mean the thing that
> > > > > makes pmem
> > > > > look kind of like a traditional block device? :)
> > > > 
> > > > Yes, the thing that makes pmem look like a block device :) --
> > > > drivers/nvdimm/pmem.c
> > > > 
> > > > > 
> > > > > > This effort is to try and make this badblock checking
> > > > > > smarter - and try
> > > > > > and reduce the penalty on every IO to a smaller range, which
> > > > > > only the
> > > > > > filesystem can do.
> > > > > 
> > > > > Though... now that XFS merged the reverse mapping support,
> > > > > I've been
> > > > > wondering if there'll be a resubmission of the device errors
> > > > > callback?
> > > > > It still would be useful to be able to inform the user that
> > > > > part of
> > > > > their fs has gone bad, or, better yet, if the buffer is still
> > > > > in memory
> > > > > someplace else, just write it back out.
> > > > > 
> > > > > Or I suppose if we had some kind of raid1 set up between
> > > > > memories we
> > > > > could read one of the other copies and rewrite it into the
> > > > > failing
> > > > > region immediately.
> > > > 
> > > > Yes, that is kind of what I was hoping to accomplish via this
> > > > discussion. How much would filesystems want to be involved in
> > > > this sort
> > > > of badblocks handling, if at all. I can refresh my patches that
> > > > provide
> > > > the fs notification, but that's the easy bit, and a starting
> > > > point.
> > > > 
> > > 
> > > I have some questions. Why moving badblock handling to file system
> > > level avoid the checking phase? In file system level for each I/O
> > > I
> > > still have to check the badblock list, right? Do you mean during
> > > mount
> > > it can go through the pmem device and locates all the data
> > > structures
> > > mangled by badblocks and handle them accordingly, so that during
> > > normal running the badblocks will never be accessed? Or, if there
> > > is
> > > replicataion/snapshot support, use a copy to recover the
> > > badblocks?
> > 
> > With ext4 badblocks, the main outcome is that the bad blocks would
> > be
> > pemanently marked in the allocation bitmap as being used, and they
> > would
> > never be allocated to a file, so they should never be accessed
> > unless
> > doing a full device scan (which ext4 and e2fsck never do).  That
> > would
> > avoid the need to check every I/O against the bad blocks list, if
> > the
> > driver knows that the filesystem will handle this.
> > 
> 
> Thank you for explanation. However this only works for free blocks,
> right? What about allocated blocks, like file data and metadata?
> 
Like Andreas said, the ext4 badblocks feature has not been in use, and
the current block layer badblocks are distinct and unrelated to these.
Can the ext4 badblocks infrastructure be revived and extended if we
decide to add badblocks to filesystems? Maybe - that was one of the
topics I was hoping to discuss/find out more about.��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥