Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign

Hannes Reinecke <hare@xxxxxxx> · Thu, 14 Jan 2016 08:25:52 +0100

On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
Hi all,

I'd like to attend LSF/MM and would like to present my ideas for a multipath
redesign.

The overall idea is to break up the centralized multipath handling in
device-mapper (and multipath-tools) and delegate to the appropriate
sub-systems.

Individually the plan is:
a) use the 'wwid' sysfs attribute to detect multipath devices;
    this removes the need of the current 'path_id' functionality
    in multipath-tools

If all the devices that we support advertise their WWID through sysfs,
I'm all for this. Not needing to worry about callouts or udev sounds
great.

As of now, multipath-tools pretty much requires VPD page 0x83 to be 
implemented. So that's not a big issue. Plus I would leave the old 
infrastructure in place, as there are vendors which do provide their 
own path_id mechanism.

b) leverage topology information from scsi_dh_alua (which we will
    have once my ALUA handler update is in) to detect the multipath
    topology. This removes the need of a 'prio' infrastructure
    in multipath-tools

What about devices that don't use alua? Or users who want to be able to
pick a specific path to prefer? While I definitely prefer simple, we
can't drop real funtionality to get there. Have you posted your
scsi_dh_alua update somewhere?

Yep. Check the linux-scsi mailing list.

I've recently had requests from users to
1. make a path with the TPGS pref bit set be in its own path group with
the highest priority
Isn't that always the case?
Paths with TPGS pref bit set will have a different priority than 
those without the pref bit, and they should always have the highest 
priority.
I would rather consider this an error in the prioritizer ...

2. make the weighted prioritizer use persistent information to make its
choice, so its actually useful. This is to deal with the need to prefer a
specific path in a non-alua setup.

yeah, I had a similar request. And we should distinguish between the 
individual transports, as paths might be coming in via different 
protocols/transports.

Some of the complexity with priorities is there out of necessity.

Agree.

c) implement block or scsi events whenever a remote port becomes
    unavailable. This removes the need of the 'path_checker'
    functionality in multipath-tools.

I'm not convinced that we will be able to find out when paths come back
online in all cases without some sort of actual polling. Again, I'd love
this to be simpler, but asking all the types of storage we plan to
support to notify us when they are up and down may not be realistic.

Currently we have three main transports: FC, iSCSI, and SAS.
FC has reliable path events via RSCN, as this is also what the 
drivers rely on internally (hello, zfcp :-)
If _that_ doesn't work we're in a deep hole anyway, cf the 
eh_deadline mechanism we had to implement.
iSCSI has the NOP mechanism, which in effect is polling on the iSCSI 
level. That would provide equivalent information; unfortunately not 
every target supports that.
But even without iSCSI has it's own error recovery logic, which will 
kick in whenever an error is detected. So we can as well hook into 
that and use it to send events.
And for SAS we have a far better control over the attached fabric, 
so it should be possible to get reliable events there, too.

That only leaves the non-transport drivers like virtio or the 
various RAID-like cards, which indeed might not be able to provide 
us with events.

So I would propose to make that optional; if events are supported 
(which could be figured out via sysfs) we should be using them and 
don't insist on polling, but fall back to the original methods if we 
don't have them.

d) leverage these events to handle path-up/path-down events
    in-kernel

If polling is necessary, I'd rather it be done in userspace. Personally,
I think the checker code is probably the least obectionable part of the
multipath-tools (It's getting all the device information to set up the
devices in the first place and coordinating with uevents that's really
ugly, IMHO).

And this is where I do disagree.
The checker code is causing massive lock congestion on large-scale 
systems as there is precisely _one_ checker thread, having to check 
all devices serially. If paths go down on a large system we're 
having a flood of udev events, which we cannot handle in-time as the 
checkerloop holds the lock trying to check all those paths.

So being able to do away with the checkerloop is a major improvement 
there.

Cheers,

Hannes
--
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel