On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote: > This series try to add Weighted Round Robin for block cgroup and nvme > driver. When multiple containers share a single nvme device, we want > to protect IO critical container from not be interfernced by other > containers. We add blkio.wrr interface to user to control their IO > priority. The blkio.wrr accept five level priorities, which contains > "urgent", "high", "medium", "low" and "none", the "none" is used for > disable WRR for this cgroup. The NVMe protocol really doesn't define WRR to be a mechanism to mitigate interference, though. It defines credits among the weighted queues only for command fetching, and an urgent strict priority class that starves the rest. It has nothing to do with how the controller should prioritize completion of those commands, even if it may be reasonable to assume influencing when the command is fetched should affect its completion. On the "weighted" strict priority, there's nothing separating "high" from "low" other than the name: the "set features" credit assignment can invert which queues have higher command fetch rates such that the "low" is favoured over the "high". There's no protection against the "urgent" class starving others: normal IO will timeout and trigger repeated controller resets, while polled IO will consume 100% of CPU cycles without making any progress if we make this type of queue available without any additional code to ensure the host behaves.. On the driver implementation, the number of module parameters being added here is problematic. We already have 2 special classes of queues, and defining this at the module level is considered too coarse when the system has different devices on opposite ends of the capability spectrum. For example, users want polled queues for the fast devices, and none for the slower tier. We just don't have a good mechanism to define per-controller resources, and more queue classes will make this problem worse. On the blk-mq side, this implementation doesn't work with the IO schedulers. If one is in use, requests may be reordered such that a request on your high-priority hctx may be dispatched later than more recent ones associated with lower priority. I don't think that's what you'd want to happen, so priority should be considered with schedulers too. But really, though, NVMe's WRR is too heavy weight and difficult to use. The techincal work group can come up with something better, but it looks like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics).