Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools

Chris Mason <clm@xxxxxx> · Wed, 7 Feb 2018 16:29:16 -0500

On 7 Feb 2018, at 11:37, Jan Kara wrote:

On Wed 07-02-18 09:51:02, Chris Mason wrote:
On 7 Feb 2018, at 5:32, Jan Kara wrote:
On Thu 25-01-18 08:41:58, Chris Mason wrote:

With ext4, the data=ordered IO done during transaction commits 
makes
priority inversions that I don't see a way around.  It's
dramatically better
than ext3, but still happens enough that we can't enforce IO limits
at all.
It really only takes one low prio IO to sneak into kjournald's list
to wreck
everything.

AFAIU we could do a similar thing like what Tejun implemented for 
btrfs
metadata where the submitter can override blkcg to which the IO is
accounted. In ext4's case if kjournald is doing the writeback, it 
would
get
accounted to the root blkcg. It will allow containers to somewhat
violate
the bounds set to their blkcg but the priority inversion should be 
rarer
-
sadly we cannot easily make it completely go away as if the original
process
not only attaches the inode to the transaction but also submits the 
data
blocks with low priority, transaction commit still has to wait for 
this
IO to
complete so the whole commit will be still blocked.

Yeah, I think this was the problem we hit.  balance_dirty_pages and 
friends
will trigger low priority write back, and if kjournald ends up 
waiting on
that, we're out of luck.

So probably a better fix would be to introduce another data 
journalling
mode for ext4 where we'd unconditionally use unwritten extents for 
data
writeback. We actually have it implemented in ext4 hidden behind
'dioread_nolock' mount option but it needs more polishing and 
possibly
testing.

I wonder how that compares in performance to my old data=guarded 
idea.  I
think a better step one might be to add tracepoints when blocks are 
added to
the ordered list, so we can better understand if we're adding them in 
error.
It felt like it was happening more often than it should.

In ext4 / jbd2 this mechanism is actually different from ext3. We 
don't
track individual blocks in ordered list anymore, we just track inodes 
and
flush all mapped & dirty pages from those inodes when doing 
transaction
commit. It does potentially imply more work but the locking imposed by 
the
original jbd scheme was incompatible with reasonably efficient 
writeback
and generally very special data writeback path causing various 
troubles.

We do take care to add inode to a transaction only when we allocate 
new
block for it so in this sence we shouldn't ever be adding them "in 
error".
You are right that we could optimize this by doing a special handling 
for
writeback of blocks beyond end of file (which is what data=guarded was
about if I remember right) however with delayed allocation you then 
have to
be careful not to flush new inode size to disk before flushing the 
data
blocks so it gets more complicated.

But creating unwritten extents and converting them to written on IO
completion certainly has non-negligible cost (that's why this is not 
the
default yet) so possibly the complication is worth it.

This isn't too far from what btrfs does, where we just update the 
metadata to point to the new blocks after the IO is done.  data=guarded 
was similar but much more of a hack ;)

-chris