On Fri, 2018-01-12 at 09:38 +0100, Julian Andres Klode wrote: > > and then we get I/O error on the device and it's rendered unusable. > It's > also crashing in uev_pathfail_check() occassionally because > find_path_by_devt() > returns NULL, so I applied the following patch to at least continue, > but that's > obviously wrong - We get an udev event for a device which does not > exist in /dev > (but it should)? Adding Guan, as the pathfail check is from his code. > --- a/multipathd/main.c > +++ b/multipathd/main.c > @@ -1090,6 +1090,11 @@ uev_pathfail_check(struct uevent *uev, s > lock(&vecs->lock); > pthread_testcancel(); > pp = find_path_by_devt(vecs->pathvec, devt); > + if (!pp) { > + condlog(3, "%s: Cannot find path by dm path %s", > uev->kernel, devt); > + FREE(devt); > + goto out; > + } > r = io_err_stat_handle_pathfail(pp); > lock_cleanup_pop(vecs->lock); You need to cleanup the lock in the error path. I'd pefer checking for a NULL path argument in io_err_stat_handle_pathfail(). See attachment. I'm assuming that you are not using the "marginal path" logic. In general I don't like the fact that PATH_FAILED events are handled at all in multipathd if this logic is inactive; that code path is only needed for this purpose. But that's just a side note. > Jan 12 09:17:52 autopkgtest kernel: device-mapper: multipath: Failing > path 8:16. > > Jan 12 09:17:52 autopkgtest kernel: sd 3:0:0:1: [sdb] Synchronizing > SCSI cache > > Jan 12 09:17:52 autopkgtest multipath[6909]: 8:16: cannot find > block device > Jan 12 09:17:52 autopkgtest multipath[6909]: 8:16: Empty device name > Jan 12 09:17:52 autopkgtest multipath[6909]: 8:16: Empty device name > > Jan 12 09:17:52 autopkgtest multipath[6909]: get_udev_device: > > failed to look up 8:16 with type 1 > > Jan 12 09:17:52 autopkgtest multipath[6909]: dm-0: usable paths > found > > Jan 12 09:17:53 autopkgtest iscsid[649]: Connection2:0 to [target: > iqn.2016-11.foo.com:target.iscsi, portal: 127.0.0.1,3260] through > [iface: default] is shutdown. > > We can see that it correctly removed the first device (sda) - > except well, it seems to try > >again and fail with the part where it would have crashed. But when > it tries to lookup the > second one it fails. > > Given that this works in 0.6.4, I think it's a bug that appeared > later on, > > but I can't really pin point the source of it. Well, it may be because of the locking being broken by your patch. If you look at the journal you sent, multipathd never prints a single message after the removal of sda, until it says Jan 12 09:18:37 autopkgtest multipathd[1980]: exit (signal) That makes me think it hangs somehow, which could well be explained by the lock not being released. Please retry with the attached patch. We are seeing the *multipath* messages ([6069]) which are printed from multipath during udev rule processing, because the map still holds references to the deleted path. Regards, Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
commit c4d48c633b0825941024a34acf2304a6f5a2d17d (HEAD -> upstream) Author: Martin Wilck <mwilck@xxxxxxxx> Date: Fri Jan 12 21:21:49 2018 +0100 libmultipath: deal with NULL path in pathfail handler This avoids a crash for paths which are already deleted. Reported-by: Julian Andres Klode <julian.klode@xxxxxxxxxxxxx> diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c index 75a6df67c207..d2d2276a523e 100644 --- a/libmultipath/io_err_stat.c +++ b/libmultipath/io_err_stat.c @@ -315,6 +315,10 @@ int io_err_stat_handle_pathfail(struct path *path) struct timespec curr_time; int res; + if (path == NULL) { + io_err_stat_log(1, "%s: called with empty path", __func__); + return 1; + } if (path->io_err_disable_reinstate) { io_err_stat_log(3, "%s: reinstate is already disabled", path->dev);
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel