Re: [PATCH 0/7] RFC: high-order folio support for I/O

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 15, 2023 at 08:21:10AM +0200, Hannes Reinecke wrote:
> On 6/15/23 01:53, Dave Chinner wrote:
> > On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> > All you need to do now is run the BS > PS filesytems through a full
> > fstests pass (reflink + rmap enabled, auto group), and then we can
> > start on the real data integrity validation work. It'll need tens of
> > billions of fsx ops run on it, days of recoveryloop testing, days of
> > fstress based exercise, etc before we can actually enable it in
> > XFS....
> > 
> Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
> running tests and figure out corner cases (like NFS crashing on me :-).
> With this patchset we now have some infrastructure in place making it
> even _possible_ to run those tests.

I got to this same point several years ago. You know, that patchset
that Luis went back to when he brought up this whole topic again?
That's right when I started running fsx, and I realised it
didn't cover FICLONERANGE, FIDEDUPERANGE and copy_file_range().

Yep, that's when we first realised we had -zero- test coverage of
those operations. Darrick and I spent the next *3 months* pretty
much rewriting the VFS level of those operations and fixing all the
other bugs in the implementations, just so we could validate they
worked correct on BS <= PS.

But by then Willy had started working over iomap and filemap for
folios, and the bs > PS patches were completely bitrotted and needed
rewriting from scratch. Which I now didn't have time to do....

So today is deja vu all over again: the first time I run fsx on
a new 64kB BS on 4KB PS implementation it hangs doing something
-really weird- and unexpected in copy_file_range(). It shouldn't
even be in the splice code doing a physical data copy.  So something
went wrong in ->remap_file_range(), or maybe in the truncate before
it, before it bugged out over out of range readahead in the page
cache...

I got only 3 fsx ops in today, and at least three bugs have already
manifest themselves....

> Don't be so pessimistic ...

I'm not pessimistic. I'm being realistic. I'm speaking from
experience. Not just as a Linux filesystem engineer who has had to
run this fsx-based data integrity validation process from the ground
up multiple times in the past decade, but also as an Irix OS
engineer that spent many, many hours digging out nasty, subtle bs > ps
data corruption bugs of the Irix buffer/page cache.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux