On Tue, Jan 22, 2019 at 11:52:06PM -0800, Liu Bo wrote: > On Fri, Jan 18, 2019 at 5:51 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > > > > On 1/18/19 6:39 PM, Liu Bo wrote: > > > On Fri, Jan 18, 2019 at 8:43 AM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > >> > > >> On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: > > >>> On 1/18/19 9:21 AM, Josef Bacik wrote: > > >>>> On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: > > >>>>> On 1/14/19 12:21 PM, Liu Bo wrote: > > >>>>>> Our test reported the following stack, and vmcore showed that > > >>>>>> ->inflight counter is -1. > > >>>>>> > > >>>>>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > > >>>>>> [ffffc9003fcc3958] schedule at ffffffff8173de26 > > >>>>>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > > >>>>>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > > >>>>>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > > >>>>>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > > >>>>>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > > >>>>>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > > >>>>>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > > >>>>>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > > >>>>>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > > >>>>>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > > >>>>>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > > >>>>>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > > >>>>>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > > >>>>>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > > >>>>>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > > >>>>>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > > >>>>>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > > >>>>>> > > >>>>>> The ->inflight counter may be negative (-1) if > > >>>>>> > > >>>>>> 0) blk-throttle had been enabled when the IO was issued, so its bio > > >>>>>> has a associated blkg, > > >>>>>> > > >>>>>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > > >>>>>> in this blkg was not available by then, > > >>>>>> > > >>>>>> 2) blk-iolatency was enabled before this IO reached its endio, so that > > >>>>>> iolatency_grp became available when the IO did the endio. > > >>>>>> > > >>>>>> 3) the ->inflight counter is decreased from 0 to -1. > > >>>>>> > > >>>>>> This uses atomic_dec_is_positive() instead to avoid the negative > > >>>>>> inflight counter. > > >>>>> > > >>>>> The problem with that is that it'll hide a lot of other issues, too. > > >>>>> Any way we can either track if this rqw is in flight, and only dec > > >>>>> if it is, or quiesce when enabling? > > >>>>> > > >>>> > > >>>> I worried about this too, but really the side-effect of allowing more through > > >>>> because of mis-counting means we just let more IO through. I think maybe we add > > >>>> a debug option that we can turn on to see if we're messing up accounting, but in > > >>>> general I don't see a problem with this approach. > > >>> > > >>> The problem is that a problem in accounting elsewhere (like missing increment) > > >>> will now go unnoticed, which could completely screw it up. An occasional > > >>> blip like the one described is totally fine, but that's not guaranteed to be > > >>> the case. > > >>> > > >> > > >> Yeah I agree it's kind of shitty. > > >> > > >>>> The problem we're running into here is there's not really a good way to tag a > > >>>> bio as "seen by io.latency." We just have to assume if we're on and there's a > > >>>> bi_blkg associated that we saw it at submit time. We can't just add a flag for > > >>>> every io controller that starts tracking inflight io's, so for now I think this > > >>>> is a reasonable solution. Thanks, > > >>> > > >>> Can we quiesce instead when enabling/disabling? > > >> > > >> Actually that's not a bad idea, you want to look into that Liu? That way we can > > >> make sure everything is correct always. Thanks, > > >> > > > > > > Just to confirm, does "quiesce when enabling/disabling" mean "waiting > > > for inflight IOs to complete when enabling/disabling iolatency"? > > > > So it might not be an easy thing to quiesce queue because... > > iolatency_set_limit() > blkg_conf_prep() // return with rcu and queue's lock > > neither freeze_queue nor quiesce_queue is allowed in rcu reader side context. > > If we do it outside of blkg_conf_prep/finish(), I feel like the > overhead is more than it worths. I had to do this for io.weight yesterday, I just put it in blkg_conf_prep/finish. I don't think that's unreasonable, it's not like we mess with settings a lot, it'll get setup on cgroup creation before something is really doing IO, and saves us if things are running currently. Thanks, Josef