Disconcerting observation with multipathd's dmevent polling code

Martin Wilck <mwilck@xxxxxxx> · Wed, 21 Nov 2018 11:50:38 +0100

Hi Ben,

I recently encountered a strange and disturbing phenomenon with the
latest upstream code.

The system in question had two iSCSI connections on separate networks.
On one of them, I tested iSCSI disconnect and re-connects, and broken
storage configurations (different LUNs diffenent sizes and with the
same WWID). The other one was idle.

During one test, I observed that multipathd's internal representation
of the map representing the *idle* connection had vanished. The map was
still present on the device-mapper level, but multipathd seemed to have
"forgotten" it completely. The respective paths were tracked as
orphans.

Here is a log excerpt. multipathd is working on path additions for the
map I was testing with (149455...), when a DM event is received. Right
after ev_add_path() releases the lock, the idle map
(36001405...) is removed (not flushed, as the messages suggest).

Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: add path (uevent)
# uev_add_path->pathinfo()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: udev property SCSI_IDENT_LUN_T10 whitelisted
# ev_add_path()->adopt_paths()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sde: ownership set to 149455400000000006d706174683100000000000000000000
...
# ev_add_path->setup_map()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: failback = "manual" (setting: multipath internal)
# ev_add_path->domap()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: load table [0 2097152 multipath 0 0 3 1 service-time 0 1 1 8:64 ...
# ev_add_path->setup_multipath()->update-multipath_strings
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: disassemble map [0 0 3 1 service-time 0 1 2 8:64 1 1 service-time ...
# (!!!!) devmap event for LIO map
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 36001405d08f748dd4c5481f9ec22b888: devmap event #20
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg [8:96]: path added to devmap 149455400000000006d706174683100000000000000000000
# ev_add_path() returns, vecs->lock is released
# uev_add_path() goes on to next path, lock currently not held
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdh: add path (uevent)
# (!!!!) The LIO map is flushed (36001405d08f748dd4c5481f9ec22b888)
# this looks like remove_map() or remove_map_and_stop_waiter()
> Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdd: orphan path, map flushed
> Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sysfs prioritizer refcount 2
> Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdc: orphan path, map flushed

To me this looks like the map was removed by the DM event handler.
More detail can be found on 
https://bugzilla.opensuse.org/show_bug.cgi?id=1116272.
I've also tried to reproduce the problem, with no success so far
(actually I haven't observed any dmevents in those reproduction
attempts).

After encountering this problem, I've scrutinized your polling dmevent
handler code over and over last week, but I couldn't spot any possible
problems :-)

The concern that I have is with the design of the polling dmevent API,
in particular the fact that the default action is EVENT_REMOVE. If, for
whatever reason, one map is missing from the return value of the
DM_DEVICE_LIST ioctl, multipath removes the map immediately, and
there's nothing short of a "reconfigure" or "add map" CLI command that
would reinstate the map. IOW, we remove the map not on a kernel event
saying "this map has been removed", but on receiving a list where this
element happens to be missing. We handle this differently for paths,
where we wait for a "remove" uevent before we really delete the path
from our data structures. (Note that the messages file I saved from the
event above shows no sign of such an uevent ever beeing sent - as I
said, the dm map was still present after the above occured).

What do you think about this? Could you maybe inspect those logs I
took, to make sure I didn't get on a totally wrong track here?

Regards,
Martin

-- 
Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel