Re: [LSF/MM TOPIC] filesystems, btrfs, cgroups, debugging tools

Jan Kara <jack@xxxxxxx> · Wed, 7 Feb 2018 11:32:45 +0100

Hi Chris,

On Thu 25-01-18 08:41:58, Chris Mason wrote:
> On 01/25/2018 04:48 AM, Jan Kara wrote:
> > Hi Chris,
> > 
> > On Wed 24-01-18 17:02:47, Chris Mason wrote:
> > > I'm really looking forward to LSF/MM this year.  I can bring along a fair
> > > amount of data from production about benchmarking and stability.
> > > 
> > > We've been expanding our btrfs rollout, and we're also fixing up priority
> > > inversions when cgroup IO controllers are put in place.  I think we have
> > > btrfs fixed up, but ext4 seems to be incompatible with IO controllers due to
> > > data=ordered IO.
> > 
> > Yeah, I suspect I know what you hit but still I'd be interested in hearing
> > more details about your usecase and the problems you see. Maybe it could be
> > helped.
> > 
> 
> Both btrfs and ext4 are root drive filesystems for us.  The IO controller is
> basically making sure the root drive isn't saturated by lower priority
> tasks, which might be anything from system updates to log files to actually
> part of the workload.
> 
> With ext4, the data=ordered IO done during transaction commits makes
> priority inversions that I don't see a way around.  It's dramatically better
> than ext3, but still happens enough that we can't enforce IO limits at all.
> It really only takes one low prio IO to sneak into kjournald's list to wreck
> everything.

AFAIU we could do a similar thing like what Tejun implemented for btrfs
metadata where the submitter can override blkcg to which the IO is
accounted. In ext4's case if kjournald is doing the writeback, it would get
accounted to the root blkcg. It will allow containers to somewhat violate
the bounds set to their blkcg but the priority inversion should be rarer -
sadly we cannot easily make it completely go away as if the original process
not only attaches the inode to the transaction but also submits the data
blocks with low priority, transaction commit still has to wait for this IO to
complete so the whole commit will be still blocked.

So probably a better fix would be to introduce another data journalling
mode for ext4 where we'd unconditionally use unwritten extents for data
writeback. We actually have it implemented in ext4 hidden behind
'dioread_nolock' mount option but it needs more polishing and possibly
testing.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR