Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

Paolo Valente <paolo.valente@xxxxxxxxxx> · Mon, 20 May 2019 12:45:58 +0200

> Il giorno 20 mag 2019, alle ore 11:15, Jan Kara <jack@xxxxxxx> ha scritto:
> 
> On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
>> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for
>>> BFQ, which landed in the upcoming 5.2. If you give it a try, and
>>> still see the problem, then I'll be glad to reproduce it, and
>>> hopefully fix it for you.
>> 
>> Hi Paolo, I'm curious if you could give a quick summary about what you
>> changed in BFQ?
>> 
>> I was considering adding support so that if userspace calls fsync(2)
>> or fdatasync(2), to attach the process's CSS to the transaction, and
>> then charge all of the journal metadata writes the process's CSS.  If
>> there are multiple fsync's batched into the transaction, the first
>> process which forced the early transaction commit would get charged
>> the entire journal write.  OTOH, journal writes are sequential I/O, so
>> the amount of disk time for writing the journal is going to be
>> relatively small, and especially, the fact that work from other
>> cgroups is going to be minimal, especially if hadn't issued an
>> fsync().
> 
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?
> 
>> In the case where you have three cgroups all issuing fsync(2) and they
>> all landed in the same jbd2 transaction thanks to commit batching, in
>> the ideal world we would split up the disk time usage equally across
>> those three cgroups.  But it's probably not worth doing that...
>> 
>> That being said, we probably do need some BFQ support, since in the
>> case where we have multiple processes doing buffered writes w/o fsync,
>> we do charnge the data=ordered writeback to each block cgroup. Worse,
>> the commit can't complete until the all of the data integrity
>> writebacks have completed.  And if there are N cgroups with dirty
>> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
>> of idle time tacked onto the commit time.
> 
> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time.

Yep.  Issues like this are targeted exactly by the improvement I
mentioned in my previous reply.

> But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.
> 

That's an interesting point.  So far, I've assumed that nobody would
have told anything to BFQ.  But if you guys think that such a
communication may be acceptable at some degree, then I'd be glad to
try to come up with some solution.  For instance: some hook that any
I/O scheduler may export if meaningful.

Thanks,
Paolo

> 								Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR

Attachment:
signature.asc

Description: Message signed with OpenPGP