Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

Andiry Xu <andiry@xxxxxxxxx> · Tue, 17 Jan 2017 18:01:14 -0800

On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
> On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@xxxxxxxxx> wrote:
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@xxxxxxxxx> wrote:
>>> On 01/16, Darrick J. Wong wrote:
>>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>>> On 01/14, Slava Dubeyko wrote:
>>>>>>
>>>>>> ---- Original Message ----
>>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>>> From: "Verma, Vishal L" <vishal.l.verma@xxxxxxxxx>
>>>>>> To: lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx
>>>>>> Cc: linux-nvdimm@xxxxxxxxxxxx, linux-block@xxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx
>>>>>>
>>>>>>> The current implementation of badblocks, where we consult the
>>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>>> easiest interface to work with.
>>>>>>
>>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>>>
>>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>>> blocks currently....
>>>>
>>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>>> from the bad block issue?
>>>>>>
>>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>>> imply by "bad block" issue?
>>>>>
>>>>> We don't have direct physical access to the device's address space, in
>>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>>> underneath us. The problem is that when a block or address range (as
>>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>>> every affected cache line. Behind the scenes, it may have already
>>>>> remapped the range, but the cache line poison has to be kept so that
>>>>> there is a notification to the user/owner of the data that something has
>>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>>> Compared to tradational storage where an app will get nice and friendly
>>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>>> catch these, the reads will turn into a memory bus access, and the
>>>>> poison will cause a SIGBUS.
>>>>
>>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>>> look kind of like a traditional block device? :)
>>>
>>> Yes, the thing that makes pmem look like a block device :) --
>>> drivers/nvdimm/pmem.c
>>>
>>>>
>>>>> This effort is to try and make this badblock checking smarter - and try
>>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>>> filesystem can do.
>>>>
>>>> Though... now that XFS merged the reverse mapping support, I've been
>>>> wondering if there'll be a resubmission of the device errors callback?
>>>> It still would be useful to be able to inform the user that part of
>>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>>> someplace else, just write it back out.
>>>>
>>>> Or I suppose if we had some kind of raid1 set up between memories we
>>>> could read one of the other copies and rewrite it into the failing
>>>> region immediately.
>>>
>>> Yes, that is kind of what I was hoping to accomplish via this
>>> discussion. How much would filesystems want to be involved in this sort
>>> of badblocks handling, if at all. I can refresh my patches that provide
>>> the fs notification, but that's the easy bit, and a starting point.
>>>
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>
> With ext4 badblocks, the main outcome is that the bad blocks would be
> pemanently marked in the allocation bitmap as being used, and they would
> never be allocated to a file, so they should never be accessed unless
> doing a full device scan (which ext4 and e2fsck never do).  That would
> avoid the need to check every I/O against the bad blocks list, if the
> driver knows that the filesystem will handle this.
>

Thank you for explanation. However this only works for free blocks,
right? What about allocated blocks, like file data and metadata?

Thanks,
Andiry

> The one caveat is that ext4 only allows 32-bit block numbers in the
> badblocks list, since this feature hasn't been used in a long time.
> This is good for up to 16TB filesystems, but if there was a demand to
> use this feature again it would be possible allow 64-bit block numbers.
>
> Cheers, Andreas
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html