Re: Failure of program map to recover after failure

Ian Kent <raven@xxxxxxxxxx> · Tue, 10 Dec 2019 12:49:02 +0800

On Tue, 2019-12-10 at 12:41 +0800, Ian Kent wrote:
> On Thu, 2019-12-05 at 04:26 -0500, Doug Nazar wrote:
> > On autofs 5.1.6, after an unsuccessful mount attempt (stopped
> > server) 
> > using a program map for /net, it'll never recover once the server
> > is 
> > started again.
> > 
> > Here's the initial debug log for the failure:
> > 
> > handle_packet: type = 3
> > handle_packet_missing_indirect: token 6631, name wraith, request
> > pid
> > 32245
> > attempting to mount entry /net/wraith
> > lookup_mount: lookup(program): looking up wraith
> > lookup_mount: lookup(program): wraith -> 
> > -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 / wraith:/
> > parse_mount: parse(sun): expanded entry: 
> > -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 / wraith:/
> > parse_mount: parse(sun): gathered options: 
> > fstype=nfs,hard,intr,nodev,nosuid,sec=krb5
> > parse_mount: parse(sun): dequote("/") -> /
> > parse_mapent: parse(sun): gathered options: 
> > fstype=nfs,hard,intr,nodev,nosuid,sec=krb5
> > parse_mapent: parse(sun): dequote("wraith:/") -> wraith:/
> > update_offset_entry: parse(sun): updated multi-mount offset / -> 
> > -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 wraith:/
> > parse_mapent: parse(sun): gathered options: 
> > fstype=nfs,hard,intr,nodev,nosuid,sec=krb5
> > parse_mapent: parse(sun): dequote("wraith:/") -> wraith:/
> > sun_mount: parse(sun): mounting root /net/wraith/, mountpoint
> > wraith, 
> > what wraith:/, fstype nfs, options hard,intr,nodev,nosuid,sec=krb5
> > mount(nfs): root=/net/wraith/ name=wraith what=wraith:/,
> > fstype=nfs, 
> > options=hard,intr,nodev,nosuid,sec=krb5
> > mount(nfs): nfs options="hard,intr,nodev,nosuid,sec=krb5",
> > nobind=0, 
> > nosymlink=0, ro=0
> > get_nfs_info: called with host wraith(192.168.21.90) proto 6
> > version
> > 0x20
> > get_nfs_info: called with host wraith(192.168.21.90) proto 17
> > version
> > 0x20
> > get_nfs_info: called with host wraith(fde2:2b6c:2d24:21::5a) proto
> > 6 
> > version 0x20
> > get_nfs_info: called with host wraith(fde2:2b6c:2d24:21::5a) proto
> > 17 
> > version 0x20
> > mount(nfs): no hosts available
> > dev_ioctl_send_fail: token = 6631
> > failed to mount /net/wraith
> > 
> > After a few minutes another attempt after I've re-started the
> > server
> > on 
> > target:
> > 
> > handle_packet: type = 3
> > handle_packet_missing_indirect: token 6635, name wraith, request
> > pid
> > 32309
> > attempting to mount entry /net/wraith
> > lookup_mount: lookup(program): wraith -> 
> > -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 / wraith:/
> > lookup(program): unexpected lookup for active multi-mount key
> > wraith, 
> > returning fail
> > dev_ioctl_send_fail: token = 6635
> > failed to mount /net/wraith
> > 
> > I'm currently running this patch but don't have much confidence in
> > it. 
> > I'm unsure of the lifetime rules for me->multi, maybe it should
> > have 
> > been cleared after failure mounting?
> 
> I've returned to look at this a few times now but don't have an
> proper answer for you just yet, thought I'd let you know I am
> thinking about it.
> 
> > diff --git a/modules/lookup_program.c b/modules/lookup_program.c
> > index fcb1af7..b6f854b 100644
> > --- a/modules/lookup_program.c
> > +++ b/modules/lookup_program.c
> > @@ -646,7 +646,7 @@ int lookup_mount(struct autofs_point *ap, const
> > char 
> > *name, int name_len, void *
> >                                   name_len, ent, ctxt->parse-
> > > context);
> >                          goto out_free;
> >                  } else {
> > -                       if (me->multi) {
> > +                       if (me->multi && me->multi != me) {
> >                                  cache_unlock(mc);
> >                                  warn(ap->logopt, MODPREFIX
> >                                       "unexpected lookup for
> > active 
> > multi-mount"
> 
> Yes, the problem occurs because it's a top level singleton multi-
> mount
> otherwise you wouldn't get a lookup taking this code path.

I also need to work out why you don't get caught by the negative
map entry check that's meant to prevent multiple retries for a
failing map entry for a configured time.

> 
> And even the entry delete below it should be ok because it will
> just lookup (aka. run the program map again to get the map entry)
> and then update the multi-mount during the entry parse.
> 
> So while the change above isn't strictly the way this should be
> handled it probably should be ok.
> 
> I haven't worked out how to handle it immediately after the fail
> just yet but the change above probably should be kept as part of
> that as well, not sure yet.
> 
> Ian