On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote: > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@xxxxxxxxx> wrote: >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@xxxxxxxxx> wrote: >>> On 01/16, Darrick J. Wong wrote: >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote: >>>>> On 01/14, Slava Dubeyko wrote: >>>>>> >>>>>> ---- Original Message ---- >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems >>>>>> Sent: Jan 13, 2017 1:40 PM >>>>>> From: "Verma, Vishal L" <vishal.l.verma@xxxxxxxxx> >>>>>> To: lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx >>>>>> Cc: linux-nvdimm@xxxxxxxxxxxx, linux-block@xxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx >>>>>> >>>>>>> The current implementation of badblocks, where we consult the >>>>>>> badblocks list for every IO in the block driver works, and is a >>>>>>> last option failsafe, but from a user perspective, it isn't the >>>>>>> easiest interface to work with. >>>>>> >>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks >>>>>> (physical sectors) table. I believe that this table was used for the case of >>>>>> floppy media. But, finally, this table becomes to be the completely obsolete >>>>>> artefact because mostly storage devices are reliably enough. Why do you need >>>> >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a >>>> vestigial organ at this point. XFS doesn't have anything to track bad >>>> blocks currently.... >>>> >>>>>> in exposing the bad blocks on the file system level? Do you expect that next >>>>>> generation of NVM memory will be so unreliable that file system needs to manage >>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer >>>>>> from the bad block issue? >>>>>> >>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map >>>>>> a bad physical block/page/sector into valid one. Do you mean that we have >>>>>> access to physical NVM memory address directly? But it looks like that we can >>>>>> have a "bad block" issue even we will access data into page cache's memory >>>>>> page (if we will use NVM memory for page cache, of course). So, what do you >>>>>> imply by "bad block" issue? >>>>> >>>>> We don't have direct physical access to the device's address space, in >>>>> the sense the device is still free to perform remapping of chunks of NVM >>>>> underneath us. The problem is that when a block or address range (as >>>>> small as a cache line) goes bad, the device maintains a poison bit for >>>>> every affected cache line. Behind the scenes, it may have already >>>>> remapped the range, but the cache line poison has to be kept so that >>>>> there is a notification to the user/owner of the data that something has >>>>> been lost. Since NVM is byte addressable memory sitting on the memory >>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes. >>>>> Compared to tradational storage where an app will get nice and friendly >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad >>>>> locations, and short-circuit them with an EIO. If the driver doesn't >>>>> catch these, the reads will turn into a memory bus access, and the >>>>> poison will cause a SIGBUS. >>>> >>>> "driver" ... you mean XFS? Or do you mean the thing that makes pmem >>>> look kind of like a traditional block device? :) >>> >>> Yes, the thing that makes pmem look like a block device :) -- >>> drivers/nvdimm/pmem.c >>> >>>> >>>>> This effort is to try and make this badblock checking smarter - and try >>>>> and reduce the penalty on every IO to a smaller range, which only the >>>>> filesystem can do. >>>> >>>> Though... now that XFS merged the reverse mapping support, I've been >>>> wondering if there'll be a resubmission of the device errors callback? >>>> It still would be useful to be able to inform the user that part of >>>> their fs has gone bad, or, better yet, if the buffer is still in memory >>>> someplace else, just write it back out. >>>> >>>> Or I suppose if we had some kind of raid1 set up between memories we >>>> could read one of the other copies and rewrite it into the failing >>>> region immediately. >>> >>> Yes, that is kind of what I was hoping to accomplish via this >>> discussion. How much would filesystems want to be involved in this sort >>> of badblocks handling, if at all. I can refresh my patches that provide >>> the fs notification, but that's the easy bit, and a starting point. >>> >> >> I have some questions. Why moving badblock handling to file system >> level avoid the checking phase? In file system level for each I/O I >> still have to check the badblock list, right? Do you mean during mount >> it can go through the pmem device and locates all the data structures >> mangled by badblocks and handle them accordingly, so that during >> normal running the badblocks will never be accessed? Or, if there is >> replicataion/snapshot support, use a copy to recover the badblocks? > > With ext4 badblocks, the main outcome is that the bad blocks would be > pemanently marked in the allocation bitmap as being used, and they would > never be allocated to a file, so they should never be accessed unless > doing a full device scan (which ext4 and e2fsck never do). That would > avoid the need to check every I/O against the bad blocks list, if the > driver knows that the filesystem will handle this. > Thank you for explanation. However this only works for free blocks, right? What about allocated blocks, like file data and metadata? Thanks, Andiry > The one caveat is that ext4 only allows 32-bit block numbers in the > badblocks list, since this feature hasn't been used in a long time. > This is good for up to 16TB filesystems, but if there was a demand to > use this feature again it would be possible allow 64-bit block numbers. > > Cheers, Andreas > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html