On 4/11/24 19:41, Martin Wilck wrote:
Hello Ben, hello Etienne,
On Wed, 2023-08-23 at 21:51 +0200, Martin Wilck wrote:
On Wed, 2023-08-23 at 16:24 +0000, Etienne Aujames wrote:
From: Etienne AUJAMES <eaujames@xxxxxxx>
A user can change the value of max_sectors_kb on the multipath
device
and its path devices.
So when a path is deleted and then re-added, its value will not be
the
same as the multipath device. In that case the IOs could be mis-
sized.
On reload, this patch re-apply max_sectors_kb value of the
multipath
device on its path devices.
Signed-off-by: Etienne AUJAMES <eaujames@xxxxxxx>
Looks good to me.
Reviewed-by: Martin Wilck <mwilck@xxxxxxxx>
We've seen issues in our beta-testing with this patch.
Consider a device with an odd max_sectors number (in 512 byte blocks).
If you look through the kernel sources, you'll see that there are quite
a few devices with odd max_sectors defaults. In our testing, we used
bnx2i (max_sectors = 127).
When a multipath map is set up, the kernel will call
blk_stack_limits(), making sure that the map has a max_sectors value
that doesn't exceed the max of all path devices.
The map will have max_sectors = 127 in the bnx2i case.
When a device is added, multipathd will read max_sectors_kb from sysfs,
and obtain (127 >> 1) == 63. It will then write this value to the path
devices, causing the kernel-internal max_sectors value to be set to
(63 << 1) = 126. But the map's max_sectors is still 127. The path
devices have now a lower limit than the map. This leads to errors like
this:
[ 254.652934][ T820] blk_insert_cloned_request: over max size limit. (127 > 126)
If the map was reloaded after that, the kernel would fix this
discrepancy autmatically through the blk_stack_limits() algorithm. But
as long as this doesn't happen, the blk_insert_cloned_request() error
will repeat and possibly cause the system to hang, and even if a reload
happens, if the map is currently active, there will be a race window
during which IO may fail.
I'm working on a fix for this. My first idea was just to revert
Etienne's patch bbb77f3 ("libmultipath: fix max_sectors_kb on adding
path"). The code that exhibits the above issue is actually from 8fd4868
("libmultipath: don't set max_sectors_kb on reloads").
My thoughts so far:
1) We must strictly avoid max_sectors(path) < max_sectors(map).
2) max_sectors(path) > max_sectors(map) doesn't hurt.
3) We currently have no way to tell if the kernel value for max_sectors
of a path device is even or odd.
4) As soon as we or someone else writes anything to max_sectors_kb, the
kernel internal value will be even. This applies to both path and map
devices.
5) The max_sectors value of a map device will be equal to the lowest
max_sectors of all path devices after each reload.
6) Therefore, if max_sectors_kb is set in multipath.conf and we have
set it at least once in a reload, we're on the safe side wrt the
scenario above. This is why we see no issue with 8fd4868
("libmultipath: don't set max_sectors_kb on reloads"), even though that
commit introduced the code that rounds the max_sectors value down.
7) Arguably, the kernel should be patched to provide the raw
max_sectors value to user space, but that will take time.
8) If users change max_sectors_kb, either for path devices or for map
devices or other stacked devices, after the mapping has been set up,
they are asking for trouble. We can't avoid errors resulting from such
user actions, and I believe it's futile to try. This is not a
multipath-specific problem, it applies to any stacked block device.
AFAIU, bbb77f3 ("libmultipath: fix max_sectors_kb on adding path") just
"fixes" a scenario where a user had manipulated max_sectors_kb, which
is a situation we can't correctly deal with anyway (see (8)). So
perhaps just reverting bbb77f3 is enough.
OTOH, 8fd4868 is somewhat inconsistent. If it's correct to set
max_sectors_kb when path devices are added to make the value consistent
with the map, it should also be correct if max_sectors_kb was not
configured. This is what bbb77f3 tried to fix.
In light of (2) above, I think we shouldn't change a path devices'
max_sectors_kb unless it's smaller than the value of the map.
If we do it this way, the scenario above couldn't occur.
We have gone into great pains in the kernel to ensure the queue limits
are sane, and updated correctly. Even for stacking devices.
I sinserely doubt we need this patch from multipath anymore.
Having to adjust max_sectors_kb really should be reserved for
corner-cases where the user has a dodgy hardware which doesn't
report correct limits.
But even that should rather be handled by blacklisting.
Can't we just set max_sectors_kb to readonly in the kernel and
be done with it?
BTW: Brilliant analysis!
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich