Re: Improve processing efficiency for addition and deletion of multipath devices

Hannes Reinecke <hare@xxxxxxx> · Mon, 28 Nov 2016 13:08:51 +0100

On 11/28/2016 12:51 PM, Zdenek Kabelac wrote:
> Dne 28.11.2016 v 11:42 Hannes Reinecke napsal(a):
>> On 11/28/2016 11:06 AM, Zdenek Kabelac wrote:
>>> Dne 28.11.2016 v 03:19 tang.junhui@xxxxxxxxxx napsal(a):
>>>> Hello Christophe, Ben, Hannes, Martin, Bart,
>>>> I am a member of host-side software development team of ZXUSP storage
>>>> project
>>>> in ZTE Corporation. Facing the market demand, our team decides to
>>>> write code to
>>>> promote multipath efficiency next month. The whole idea is in the mail
>>>> below.We
>>>> hope to participate in and make progress with the open source
>>>> community, so any
>>>> suggestion and comment would be welcome.
>>>>
>>>
>>>
>>> Hi
>>>
>>> First - we are aware of these issue.
>>>
>>> The solution proposed in this mail would surely help - but there is
>>> likely a bigger issue to be solved first.
>>>
>>> The core trouble is to avoid  'blkid' disk identification to be
>>> executed.
>>> Recent version of multipath is already marking plain 'RELOAD' operation
>>> of table (which should not be changing disk content) with extra DM bit,
>>> so udev rules ATM skips 'pvscan' - we also would like to extend the
>>> functionality to skip rules more and reimport existing 'symlinks' from
>>> udev database (so they would not get deleted).
>>>
>>> I believe the processing of udev rules is 'relatively' quick as long
>>> as it does not need to read/write ANYTHING from real disks.
>>>
>> Hmm. You sure this is an issue?
>> We definitely need to skip uevent handling when a path goes down (but I
>> think we do that already), but for 'add' events we absolutely need to
>> call blkid to figure out if the device has changed.
>> There are storage arrays out there who use a 'path down/path up' cycle
>> to inform initiators about any device layout change.
>> So we wouldn't be able to handle those properly if we don't call blkid
>> here.
> 
> The core trouble is -
> 
> 
> With multipath device - you ONLY want to 'scan' device (with blkid)  when
> only the initial first member device of multipath gets in.
> 
> So you start multipath (resume -> CHANGE) - it should be the ONLY place
> to run 'blkid' test (which really goes though over 3/4MB of disk read,
> to check if there is not ZFS somewhere)
> 
> Then any next disk being a member of multipath (recognized by 'multipath
> -c',
> should NOT scan)  - as far  as  I can tell current order is opposite,
> fist there is  'blkid' (60) and then rule (62) recognizes a mpath_member.
> 
> Thus every add disk fires very lengthy blkid scan.
> 
> Of course I'm not here an expert on dm multipath rules so passing this
> on to prajnoha@ -  but I'd guess this is primary source of slowdowns.
> 
> There should be exactly ONE blkid for a single multipath device - as
> long as 'RELOAD' only  add/remove  paths  (there is no reason to scan
> component devices)
> 
ATM 'multipath -c' is just a simple test if the device is supposed to be
handled by multipath.

And the number of bytes read by blkid should be _that_ large; a simple
'blkid' on my device caused it to read 35k ...

Also udev will become very unhappy if we're not calling blkid for every
device; you'd be having a hard time reconstructing the event for those
devices.
While it's trivial to import variables from parent devices, it's
impossible to do that from unrelated devices; you'd need a dedicated
daemon for that.
So we cannot skip blkid without additional tooling.

>>
>>> So while aggregation of 'uevents' in multipath would 'shorten' queue
>>> processing of events - it would still not speedup scan alone.
>>>
>>> We need to drastically shorten unnecessary disk re-scanning.
>>>
>>> Also note - if you have a lot of disks -  it might be worth to checkout
>>> whether udev picks  'right amount of udev workers'.
>>> There is heuristic logic to avoid system overload - but might be worth
>>> to check if in you system with your amount of CPU/RAM/DISKS  the
>>> computed number is the best for scaling - i.e. if you double the amount
>>> of workers - do you
>>> get any better performance ?
>>>
>> That doesn't help, as we only have one queue (within multipath) to
>> handle all uevents.
> 
> This was meant for systems with many different multipath devices.
> Obviously would not help with a single multipath device.
> 
I'm talking about the multipath daemon.
There will be exactly _one_ instance of the multipath daemon running for
the entire system, which will be handling _all_ udev events with a
single queue.
Independent on the number of attached devices.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel