Hello, Michal. On Thu, Jan 05, 2023 at 08:22:47PM +0100, Michal Koutný wrote: > On Thu, Jan 05, 2023 at 07:35:59AM -1000, Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hard limits tend to make this sort of problems a lot more pronounced because > > the existing mechanisms tend to break down for the users which are severely > > throttled down even while the device as a whole is fairly idle. cpu.max > > often triggers severe priority inversions too, so it isn't too surprising > > that people hit severe priority inversion issues w/ io.max. > > To be on the same page: > 1) severe PI == priority inversion across cgroups (progated e.g. via > global locks (as with cpu.max) or FS journal (as with io.max)), Another important inversion vector is memory reclaim as writeback operations get trapped in blk-throttle and the system can pile on waiting for clean pages. > 2) ordinary PI == priority inversion contained within a single cgroup, > i.e. no different from an under-provisioned system. I didn't use the term in a precise manner but yeah both can be problematic. The former a lot more so. > The reported issue sounds like 2) but even with the separated queues 1) > is still possible :-/ Yeah, definitely. > > Another problem with blk-throttle is that it doesn't prioritize shared IOs > > identified by bio_issue_as_root_blkg() like iolatency and iocost do, so > > there can be very severe priority inversions when e.g. journal commit gets > > trapped in a low priority cgroup further exacerbating issues like this. > > Thanks for the broader view. So the separated queues are certainly an > improvement but ultimately a mechanism based on bio_issue_as_root_blkg() > predicate and deferred throttling would be better? Or is permanent limit > enforcement more important? Generally true but the specific scenario raised by Jinke is rather unusual and isn't covered by issue_as_root mechanism as it doesn't promote REQ_SYNC data writes. Then again, this usually isn't a problem as in most cases dirty throttling through writeback should stave off severe starvations. blk-throttle is pretty outdated and kinda broken and separating out sync writes isn't gonna solve most of its problems. However, it will help in some cases like the one described by Jinke and can sometimes lessen wider scope inversions too. Given the partial nature, I don't feel too enthusiastic but at the same time can't find good reasons to reject either. I don't know. I don't feel too strong either way. On a tangential note, I've been thinking about implementing io.max on top of iocost so that each cgroup can just configure the max % of total device IO capacity that's allowed for it, which should be a lot easier to use and has many of the problems already solved. Thanks. -- tejun