Re: [PATCH 3/5] vfs: add a zero-initialization mode to fallocate

Jane Chu <jane.chu@xxxxxxxxxx> · Mon, 27 Sep 2021 14:57:45 -0700

On 9/27/2021 2:07 PM, Dave Chinner wrote:
On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote:
On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:

On 9/23/2021 6:18 PM, Dan Williams wrote:
On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:

On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
[..]
Hence this discussion leads me to conclude that fallocate() simply
isn't the right interface to clear storage hardware poison state and
it's much simpler for everyone - kernel and userspace - to provide a
pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
clear hardware error state before issuing this user write to the
hardware.

That flag would slot in nicely in dax_iomap_iter() as the gate for
whether dax_direct_access() should allow mapping over error ranges,
and then as a flag to dax_copy_from_iter() to indicate that it should
compare the incoming write to known poison and clear it before
proceeding.

I like the distinction, because there's a chance the application did
not know that the page had experienced data loss and might want the
error behavior. The other service the driver could offer with this
flag is to do a precise check of the incoming write to make sure it
overlaps known poison and then repair the entire page. Repairing whole
pages makes for a cleaner implementation of the code that tries to
keep poison out of the CPU speculation path, {set,clear}_mce_nospec().

This flag could also be useful for preadv2() as there is currently no
way to read the good data in a PMEM page with poison via DAX. So the
flag would tell dax_direct_access() to again proceed in the face of
errors, but then the driver's dax_copy_to_iter() operation could
either read up to the precise byte offset of the error in the page, or
autoreplace error data with zero's to try to maximize data recovery.

Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
everything that can be read from the bad range because it's the
other half of the problem RWF_RESET_HWERROR is trying to address.
That is, the operation we want to perform on a range with an error
state is -data recovery-, not "reinitialisation". Data recovery
requires two steps:

- "try to recover the data from the bad storage"; and
- "reinitialise the data and clear the error state"

These naturally map to read() and write() operations, not
fallocate(). With RWF flags they become explicit data recovery
operations, unlike fallocate() which needs to imply that "writing
zeroes" == "reset hardware error state". While that reset method
may be true for a specific pmem hardware implementation it is not a
requirement for all storage hardware. It's most definitely not a
requirement for future storage hardware, either.

It also means that applications have no choice in what data they can
use to reinitialise the damaged range with because fallocate() only
supports writing zeroes. If we've recovered data via a read() as you
suggest we could, then we can rebuild the data from other redundant
information and immediately write that back to the storage, hence
repairing the fault.

That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
write into an exclusive operation and hence allow the
reinitialisation with the recovered/repaired state to run atomically
w.r.t. all other filesystem operations.  i.e. the reset write
completes the recovery operation instead of requiring separate
"reset" and "write recovered data into zeroed range" steps that
cannot be executed atomically by userspace...

/me nods

Jane, want to take a run at patches for this ^^^?

Sure, I'll give it a try.

Thank you all for the discussions!

Cool, thank you!

I'd like to propose a slight modification to the API: a single RWF
flag called RWF_RECOVER_DATA. On read, this means the storage tries
to read all the data it can from the range, and for the parts it
can't read data from (cachelines, sectors, whatever) it returns as
zeroes.

On write, this means the errors over the range get cleared and the
user data provided gets written over the top of whatever was there.
Filesystems should perform this as an exclusive operation to that
range of the file.

That way we only need one IOCB_RECOVERY flag, and for communicating
with lower storage layers (e.g. dm/md raid and/or hardware) only one
REQ_RECOVERY flag is needed in the bio.

Thoughts?

Sounds cleaner.  Dan, your thoughts?

thanks,
-jane

Cheers,

Dave.