Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

Andiry Xu <andiry@xxxxxxxxx> · Tue, 17 Jan 2017 17:58:43 -0800

On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@xxxxxxxxx> wrote:
> On 01/17, Andiry Xu wrote:
>
> <snip>
>
>> >>
>> >> The pmem_do_bvec() read logic is like this:
>> >>
>> >> pmem_do_bvec()
>> >>     if (is_bad_pmem())
>> >>         return -EIO;
>> >>     else
>> >>         memcpy_from_pmem();
>> >>
>> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> >> that even if a block is not in the badblock list, it still can be bad
>> >> and causes MCE? Does the badblock list get changed during file system
>> >> running? If that is the case, should the file system get a
>> >> notification when it gets changed? If a block is good when I first
>> >> read it, can I still trust it to be good for the second access?
>> >
>> > Yes, if a block is not in the badblocks list, it can still cause an
>> > MCE. This is the latent error case I described above. For a simple read()
>> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
>> > an MCE is inevitable.
>> >
>> > Yes the badblocks list may change while a filesystem is running. The RFC
>> > patches[1] I linked to add a notification for the filesystem when this
>> > happens.
>> >
>>
>> This is really bad and it makes file system implementation much more
>> complicated. And badblock notification does not help very much,
>> because any block can be bad potentially, no matter it is in badblock
>> list or not. And file system has to perform checking for every read,
>> using memcpy_mcsafe. This is disaster for file system like NOVA, which
>> uses pointer de-reference to access data structures on pmem. Now if I
>> want to read a field in an inode on pmem, I have to copy it to DRAM
>> first and make sure memcpy_mcsafe() does not report anything wrong.
>
> You have a good point, and I don't know if I have an answer for this..
> Assuming a system with MCE recovery, maybe NOVA can add a mce handler
> similar to nfit_handle_mce(), and handle errors as they happen, but I'm
> being very hand-wavey here and don't know how much/how well that might
> work..
>
>>
>> > No, if the media, for some reason, 'dvelops' a bad cell, a second
>> > consecutive read does have a chance of being bad. Once a location has
>> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
>> > been called to mark it as clean.
>> >
>>
>> I wonder what happens to write in this case? If a block is bad but not
>> reported in badblock list. Now I write to it without reading first. Do
>> I clear the poison with the write? Or still require a ACPI DSM?
>
> With writes, my understanding is there is still a possibility that an
> internal read-modify-write can happen, and cause a MCE (this is the same
> as writing to a bad DRAM cell, which can also cause an MCE). You can't
> really use the ACPI DSM preemptively because you don't know whether the
> location was bad. The error flow will be something like write causes the
> MCE, a badblock gets added (either through the mce handler or after the
> next reboot), and the recovery path is now the same as a regular badblock.
>

This is different from my understanding. Right now write_pmem() in
pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
clears poison and writes to pmem again. Seems to me writing to bad
blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

Thanks,
Andiry

>>
>> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>> >
>>
>> Thank you for the patchset. I will look into it.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html