Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools

Chris Mason <clm@xxxxxx> · Wed, 7 Feb 2018 09:51:02 -0500

On 7 Feb 2018, at 5:32, Jan Kara wrote:

Hi Chris,

On Thu 25-01-18 08:41:58, Chris Mason wrote:

With ext4, the data=ordered IO done during transaction commits makes
priority inversions that I don't see a way around.  It's dramatically 
better
than ext3, but still happens enough that we can't enforce IO limits 
at all.
It really only takes one low prio IO to sneak into kjournald's list 
to wreck
everything.

AFAIU we could do a similar thing like what Tejun implemented for 
btrfs
metadata where the submitter can override blkcg to which the IO is
accounted. In ext4's case if kjournald is doing the writeback, it 
would get
accounted to the root blkcg. It will allow containers to somewhat 
violate
the bounds set to their blkcg but the priority inversion should be 
rarer -
sadly we cannot easily make it completely go away as if the original 
process
not only attaches the inode to the transaction but also submits the 
data
blocks with low priority, transaction commit still has to wait for 
this IO to
complete so the whole commit will be still blocked.

Yeah, I think this was the problem we hit.  balance_dirty_pages and 
friends will trigger low priority write back, and if kjournald ends up 
waiting on that, we're out of luck.

So probably a better fix would be to introduce another data 
journalling
mode for ext4 where we'd unconditionally use unwritten extents for 
data
writeback. We actually have it implemented in ext4 hidden behind
'dioread_nolock' mount option but it needs more polishing and possibly
testing.

I wonder how that compares in performance to my old data=guarded idea.  
I think a better step one might be to add tracepoints when blocks are 
added to the ordered list, so we can better understand if we're adding 
them in error.  It felt like it was happening more often than it should.

On the FB side, we found one more prio inversion in btrfs from the free 
space cache (IO going down as data instead of metadata) and we're 
testing the fix for that.  It should hopefully be the last one, and then 
we can compare how effective the different options are.

-chris