Hello Ben, hello Etienne, On Wed, 2023-08-23 at 21:51 +0200, Martin Wilck wrote: > On Wed, 2023-08-23 at 16:24 +0000, Etienne Aujames wrote: > > From: Etienne AUJAMES <eaujames@xxxxxxx> > > > > A user can change the value of max_sectors_kb on the multipath > > device > > and its path devices. > > So when a path is deleted and then re-added, its value will not be > > the > > same as the multipath device. In that case the IOs could be mis- > > sized. > > > > On reload, this patch re-apply max_sectors_kb value of the > > multipath > > device on its path devices. > > > > Signed-off-by: Etienne AUJAMES <eaujames@xxxxxxx> > > Looks good to me. > Reviewed-by: Martin Wilck <mwilck@xxxxxxxx> We've seen issues in our beta-testing with this patch. Consider a device with an odd max_sectors number (in 512 byte blocks). If you look through the kernel sources, you'll see that there are quite a few devices with odd max_sectors defaults. In our testing, we used bnx2i (max_sectors = 127). When a multipath map is set up, the kernel will call blk_stack_limits(), making sure that the map has a max_sectors value that doesn't exceed the max of all path devices. The map will have max_sectors = 127 in the bnx2i case. When a device is added, multipathd will read max_sectors_kb from sysfs, and obtain (127 >> 1) == 63. It will then write this value to the path devices, causing the kernel-internal max_sectors value to be set to (63 << 1) = 126. But the map's max_sectors is still 127. The path devices have now a lower limit than the map. This leads to errors like this: [ 254.652934][ T820] blk_insert_cloned_request: over max size limit. (127 > 126) If the map was reloaded after that, the kernel would fix this discrepancy autmatically through the blk_stack_limits() algorithm. But as long as this doesn't happen, the blk_insert_cloned_request() error will repeat and possibly cause the system to hang, and even if a reload happens, if the map is currently active, there will be a race window during which IO may fail. I'm working on a fix for this. My first idea was just to revert Etienne's patch bbb77f3 ("libmultipath: fix max_sectors_kb on adding path"). The code that exhibits the above issue is actually from 8fd4868 ("libmultipath: don't set max_sectors_kb on reloads"). My thoughts so far: 1) We must strictly avoid max_sectors(path) < max_sectors(map). 2) max_sectors(path) > max_sectors(map) doesn't hurt. 3) We currently have no way to tell if the kernel value for max_sectors of a path device is even or odd. 4) As soon as we or someone else writes anything to max_sectors_kb, the kernel internal value will be even. This applies to both path and map devices. 5) The max_sectors value of a map device will be equal to the lowest max_sectors of all path devices after each reload. 6) Therefore, if max_sectors_kb is set in multipath.conf and we have set it at least once in a reload, we're on the safe side wrt the scenario above. This is why we see no issue with 8fd4868 ("libmultipath: don't set max_sectors_kb on reloads"), even though that commit introduced the code that rounds the max_sectors value down. 7) Arguably, the kernel should be patched to provide the raw max_sectors value to user space, but that will take time. 8) If users change max_sectors_kb, either for path devices or for map devices or other stacked devices, after the mapping has been set up, they are asking for trouble. We can't avoid errors resulting from such user actions, and I believe it's futile to try. This is not a multipath-specific problem, it applies to any stacked block device. AFAIU, bbb77f3 ("libmultipath: fix max_sectors_kb on adding path") just "fixes" a scenario where a user had manipulated max_sectors_kb, which is a situation we can't correctly deal with anyway (see (8)). So perhaps just reverting bbb77f3 is enough. OTOH, 8fd4868 is somewhat inconsistent. If it's correct to set max_sectors_kb when path devices are added to make the value consistent with the map, it should also be correct if max_sectors_kb was not configured. This is what bbb77f3 tried to fix. In light of (2) above, I think we shouldn't change a path devices' max_sectors_kb unless it's smaller than the value of the map. If we do it this way, the scenario above couldn't occur. Thoughts / comments? Martin