Re: [RFC] Tux3 for review

Daniel Phillips <daniel@xxxxxxxxx> · Mon, 23 Jun 2014 17:19:28 -0700

On Saturday, June 21, 2014 12:29:01 PM PDT, James Bottomley wrote:
On Thu, 2014-06-19 at 14:58 -0700, Daniel Phillips wrote:
On Thursday, June 19, 2014 2:26:48 AM PDT, Lukáš Czerner wrote:
 ...

the concern has always been how page forking interacted with 
writeback.

More accurately, that is just one of several concerns that Tux3
necessarily addresses in order to benefit from this powerful
optimization. We are pleased that the details continue to be of
general interest.

Direct IO is a spurious issue. To recap: direct IO does 
notintroduce any new page forking issues. All of the page forking
issues already exist with normal buffered IO and mmap. We have 
little interest and scant available time for heading off on a 
tangent to implement direct IO at this point just as a 
precondition for merging.
 ...

The specific concern is that page forking cannot be made to work
with direct io. Asserting that it doesn't cause any additional
problems isn't an answer to that concern. 

Yes it is. We are satisfied that direct IO introduces no new issues
with page forking. If you are concerned about a specific issue then 
the onus is on you to specify it.

Direct IO isn't actually a huge issue for most filesystems (I mean
even vfat has it).

You might consider asking Hirofumi about that (VFAT maintainer).

...The fact that you think it is such a huge deal...

(Surely you could have found a less disparaging way to express
yourself...)

...to implement for tux3 tends to lend credence to this viewpoint.

It is purely a matter of concentrating on what is actually 
important, as opposed to imagined or manufactured. We do not wish 
to spend time on direct IO at this point in time. If you have 
identified a specific issue then please raise it.

For the record, there is a genuine reason why direct IO requires
extra work for Tux3, which has nothing to do with page forking. 
Tux3 has an asynchronous backend, unlike any other local Linux 
filesystem (but like Matt Dillon's Hammer, from which we took 
inspiration). Direct IO thus requires implementing a new 
synchronization mechanism to allow frontend direct IO to use the 
backend allocation and writeback mechanisms, because direct IO is 
synchronous. There is nothing new, magical or particularly 
challenging about that, it is just time consuming work that we do 
not intend to do right now because other more important things need 
to be done.

In the fullness of time, Tux3 will have direct IO just like VFAT,
however that work is a good candidate for post-merge development. 
For example, it could be a good ramp-up project for a new team 
member or a student looking to make their mark on the kernel world.

The bottom line is that direct IO has nothing to do with compiling
the kernel or operating a cell phone efficiently, so it is not 
interesting to us right now. It will become more interesting when 
Tux3 is ready to scale to servers running Oracle and the like.

The point is that if page forking won't work with direct IO at
all, then it's a broken design and there's no point merging it.

You can rest assured that direct IO will work with page forking,
given that buffered IO does. We are now discussing details of how 
to make core Linux a more hospitable environment for page forking, 
not whether page forking can be made to work at all, a question that 
was settled by example some time ago.

On the other hand, page forking itself has a number of
interesting issues. Hirofumi is currently preparing a set of 
core kernel patches for review. These patches explicitly do 
not attempt to package page forking up into a nice and easy 
API that other filesystems could patch in tomorrow. That would 
be an unreasonable research burden on our small development 
team. 
 ...

OK, can we take a step back and ask why you're so keen to push
this into the tree?

If you mean, why are we keen to merge Tux3, I should not need to
explain that to you.

If you mean, why are we keen to push page forking per se into
mainline, then the answer is, we are by no means keen to push page 
forking into core kernel. Rather, that request comes from other 
filesystem developers who recognize it as a plausible way to avoid 
the pain of stable pages.

Based on our experience, page forking is properly implemented within
the filesystem, not core kernel, and we are keen only to push the 
requisite hooks into core. If somebody disagrees and feels the need 
to prove their point by implementing page forking entirely in core, 
then they should post patches and we will be the first to applaud.

The usual reason is ease of maintenance because in-tree
filesystems get updated as the vfs and mm APIs change.  However,
the reciprocal side of that is using standard VFS and MM APIs to 
make this update and maintenance easy.  The reason no-one wants
an in-tree filesystem that implements its own writeback by 
hacking into the current writeback system is that it's a huge 
maintenance burden.

Every filesystem is a maintenance burden. Core kernel simply must
provide the mechanisms that are required to make the kernel a good 
place for filesystems to exist. The fact that some ancient core 
hackery needs to be tweaked to better accommodate the requirements 
of a modern filesystem is not unusual in any way. Essentially, that 
is the entire story of Linux kernel development.

Every time writeback gets tweaked, tux3 will break meaning either 
we double the burden on people updating writeback (to try to 
figure out how to replicate the change in tux3) or we just accept 
that tux3 gets broken.

No. Tux3 will be less of a burden for writeback maintenance than
other filesystems because it hooks in above the messy writepages 
machinery and therefore is not sensitive to subtle changes in that 
creaky code.

The former is unacceptable to the filesystem and mm people and the
latter would mean there's not really much point merging tux3 if we
just keep breaking it ... it's better to keep it out of tree
where the breakages can be fixed by people who understand them on 
their own timescales.

On the face of it you are arguing the case that Tux3 should be 
blocked from merging forever, as should every new filesystem, as 
Pavel succinctly pointed out. That is less than helpful. But if 
your goal is to buttress the public perception that LKML has
become a toxic forum for contributors then you do an admirable
job.

By the way, after reading your polemic an observer might draw the 
conclusion that I am not one of the "filesystem and mm people". When 
did that change?

...
That was already fixed as noted above, and all the relevant
changes were already posted as an independent patch set. After
that, some developers weighed in with half formed ideas about 
how the same thing could be done better, but without concrete 
suggestions. There is nothing wrong with half formed ideas, 
except when they turn into a way of blocking forward progress
 ...

Could you post the url to the new series, please, I must have  
missed it; seeing the patches that implement the API for 
insertion into the writeback code would certainly help frame
this discussion.

We think that our most recently posted patch is the best approach 
at this time. Which is to say that it relies on exactly the 
existing writeback scheduling heuristics. We think that Dave Chinner 
and others are wrong to advocate experimental development of a new 
writeback mechanism at this juncture while the current scheme 
already works perfectly well for Tux3, either with our writeback 
hack or with the new hook.

We further suggest that the new hook is easy to understand and
imposes insignificant new maintenance burden. In any case we will be 
happy to assume whatever maintenance burden might arise. Obviously, 
that is entirely academic while we are the only user.

It is worth noting that we (the kernel community) have been
thrashing away at the writeback problem for more than twenty 
years, and the current solution still leaves much to be 
desired. It is unfair to expect us, the Tux3 team, to fix that 
mess in a week or two, just to merge our filesystem. We prefer 
to adapt the existing infrastructure for now, as expressed in 
the currently proposed patch set. With that, we allow core to 
mark our inodes dirty just as it has always done, and we 
continue to use the usual inode writeback lists for writeback
scheduling, which work just fine.

So that's a misunderstanding of expectations...

I did not misunderstand. It is clear from the context you deleted
that we are being pushed to engineer a new core writeback mechanism 
instead of adapting the existing one.

...the actual expectation is that you won't make the writeback
problem more difficult to tackle.

We do not make the writeback problem more difficult, which is 
obvious from the patch.

Reimplementing writeback within your code in a way that's hacked
into the system is fragile and burdensome ... it becomes double 
the code to maintain ... and tux3 breaks if its not updated.

You are preaching to the converted. As you know, we posted a patch
set that eliminates this particular instance of core duplication. 
Upcoming patches will eliminate the remaining core duplication. It 
is unnecessary to belabor that point further.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html