On Fri, 2024-04-12 at 08:06 +0200, Hannes Reinecke wrote: > > > We have gone into great pains in the kernel to ensure the queue > limits > are sane, and updated correctly. Even for stacking devices. This is true, but only for the creation of stacked devices (table activation, as far as device mapper is concerned). Admins are free to change max_sectors_kb any time; there's no propagation of changed settings along the device stack, and no sanity checking in the kernel prevents them from setting values that will cause I/O errors. > I sinserely doubt we need this patch from multipath anymore. > Having to adjust max_sectors_kb really should be reserved for > corner-cases where the user has a dodgy hardware which doesn't > report correct limits. Right. We've seen a couple of cases where decreasing max_sectors_kb from the default value was the only remedy for weird I/O failures. This happened with remote storage reporting wrong limits, misbehaving elements in the fabric, and even with virtualized IO stacks. > But even that should rather be handled by blacklisting. > Can't we just set max_sectors_kb to readonly in the kernel and > be done with it? Personally, I think this goes a bit too far. I believe the kernel should disallow changing (more specifically, decreasing) the max_sectors_kb sysfs attribute for block devices that are either in use (bd_openers > 0) or held by other block devices (bd_holder != NULL). That would eliminate a large portion of bad cases, AFAICS. Admins could still increase max_sectors_kb at the top of the device stack, but that would arguably count as shooting oneself into the foot. Errors in valid configurations are possible, even without changing max_sectors_kb in sysfs. Consider a multipath map consisting of devices with different max_sectors (for example mixed iSCSI/tcp and iSCSI/bnx2i). If only the paths with large max_sectors are initially detected, and others are added later, the map's max_sectors will be decreased while in use, and the change will not be propagated to stacked block layers above multipath: bummer. The only way to avoid this in general is implementing limit propagation. I assume that the implementation of block limit propagation in the kernel would be a major effort with lots of possible race conditions. It's far easier to have admins simply impose max_sectors_kb on multipath maps in corner case scenarios like this. Regards, Martin