On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
Hi all,
I'd like to attend LSF/MM and would like to present my ideas for a multipath
redesign.
The overall idea is to break up the centralized multipath handling in
device-mapper (and multipath-tools) and delegate to the appropriate
sub-systems.
Individually the plan is:
a) use the 'wwid' sysfs attribute to detect multipath devices;
this removes the need of the current 'path_id' functionality
in multipath-tools
If all the devices that we support advertise their WWID through sysfs,
I'm all for this. Not needing to worry about callouts or udev sounds
great.
As of now, multipath-tools pretty much requires VPD page 0x83 to be
implemented. So that's not a big issue. Plus I would leave the old
infrastructure in place, as there are vendors which do provide their
own path_id mechanism.
b) leverage topology information from scsi_dh_alua (which we will
have once my ALUA handler update is in) to detect the multipath
topology. This removes the need of a 'prio' infrastructure
in multipath-tools
What about devices that don't use alua? Or users who want to be able to
pick a specific path to prefer? While I definitely prefer simple, we
can't drop real funtionality to get there. Have you posted your
scsi_dh_alua update somewhere?
Yep. Check the linux-scsi mailing list.
I've recently had requests from users to
1. make a path with the TPGS pref bit set be in its own path group with
the highest priority
Isn't that always the case?
Paths with TPGS pref bit set will have a different priority than
those without the pref bit, and they should always have the highest
priority.
I would rather consider this an error in the prioritizer ...
2. make the weighted prioritizer use persistent information to make its
choice, so its actually useful. This is to deal with the need to prefer a
specific path in a non-alua setup.
yeah, I had a similar request. And we should distinguish between the
individual transports, as paths might be coming in via different
protocols/transports.
Some of the complexity with priorities is there out of necessity.
Agree.
c) implement block or scsi events whenever a remote port becomes
unavailable. This removes the need of the 'path_checker'
functionality in multipath-tools.
I'm not convinced that we will be able to find out when paths come back
online in all cases without some sort of actual polling. Again, I'd love
this to be simpler, but asking all the types of storage we plan to
support to notify us when they are up and down may not be realistic.
Currently we have three main transports: FC, iSCSI, and SAS.
FC has reliable path events via RSCN, as this is also what the
drivers rely on internally (hello, zfcp :-)
If _that_ doesn't work we're in a deep hole anyway, cf the
eh_deadline mechanism we had to implement.
iSCSI has the NOP mechanism, which in effect is polling on the iSCSI
level. That would provide equivalent information; unfortunately not
every target supports that.
But even without iSCSI has it's own error recovery logic, which will
kick in whenever an error is detected. So we can as well hook into
that and use it to send events.
And for SAS we have a far better control over the attached fabric,
so it should be possible to get reliable events there, too.
That only leaves the non-transport drivers like virtio or the
various RAID-like cards, which indeed might not be able to provide
us with events.
So I would propose to make that optional; if events are supported
(which could be figured out via sysfs) we should be using them and
don't insist on polling, but fall back to the original methods if we
don't have them.
d) leverage these events to handle path-up/path-down events
in-kernel
If polling is necessary, I'd rather it be done in userspace. Personally,
I think the checker code is probably the least obectionable part of the
multipath-tools (It's getting all the device information to set up the
devices in the first place and coordinating with uevents that's really
ugly, IMHO).
And this is where I do disagree.
The checker code is causing massive lock congestion on large-scale
systems as there is precisely _one_ checker thread, having to check
all devices serially. If paths go down on a large system we're
having a flood of udev events, which we cannot handle in-time as the
checkerloop holds the lock trying to check all those paths.
So being able to do away with the checkerloop is a major improvement
there.
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
hare@xxxxxxx +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel