Re: Failure of program map to recover after failure

Ian Kent <raven@xxxxxxxxxx> · Tue, 10 Dec 2019 12:41:59 +0800

On Thu, 2019-12-05 at 04:26 -0500, Doug Nazar wrote:
> On autofs 5.1.6, after an unsuccessful mount attempt (stopped
> server) 
> using a program map for /net, it'll never recover once the server is 
> started again.
> 
> Here's the initial debug log for the failure:
> 
> handle_packet: type = 3
> handle_packet_missing_indirect: token 6631, name wraith, request pid
> 32245
> attempting to mount entry /net/wraith
> lookup_mount: lookup(program): looking up wraith
> lookup_mount: lookup(program): wraith -> 
> -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 / wraith:/
> parse_mount: parse(sun): expanded entry: 
> -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 / wraith:/
> parse_mount: parse(sun): gathered options: 
> fstype=nfs,hard,intr,nodev,nosuid,sec=krb5
> parse_mount: parse(sun): dequote("/") -> /
> parse_mapent: parse(sun): gathered options: 
> fstype=nfs,hard,intr,nodev,nosuid,sec=krb5
> parse_mapent: parse(sun): dequote("wraith:/") -> wraith:/
> update_offset_entry: parse(sun): updated multi-mount offset / -> 
> -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 wraith:/
> parse_mapent: parse(sun): gathered options: 
> fstype=nfs,hard,intr,nodev,nosuid,sec=krb5
> parse_mapent: parse(sun): dequote("wraith:/") -> wraith:/
> sun_mount: parse(sun): mounting root /net/wraith/, mountpoint
> wraith, 
> what wraith:/, fstype nfs, options hard,intr,nodev,nosuid,sec=krb5
> mount(nfs): root=/net/wraith/ name=wraith what=wraith:/, fstype=nfs, 
> options=hard,intr,nodev,nosuid,sec=krb5
> mount(nfs): nfs options="hard,intr,nodev,nosuid,sec=krb5", nobind=0, 
> nosymlink=0, ro=0
> get_nfs_info: called with host wraith(192.168.21.90) proto 6 version
> 0x20
> get_nfs_info: called with host wraith(192.168.21.90) proto 17 version
> 0x20
> get_nfs_info: called with host wraith(fde2:2b6c:2d24:21::5a) proto 6 
> version 0x20
> get_nfs_info: called with host wraith(fde2:2b6c:2d24:21::5a) proto
> 17 
> version 0x20
> mount(nfs): no hosts available
> dev_ioctl_send_fail: token = 6631
> failed to mount /net/wraith
> 
> After a few minutes another attempt after I've re-started the server
> on 
> target:
> 
> handle_packet: type = 3
> handle_packet_missing_indirect: token 6635, name wraith, request pid
> 32309
> attempting to mount entry /net/wraith
> lookup_mount: lookup(program): wraith -> 
> -fstype=nfs,hard,intr,nodev,nosuid,sec=krb5 / wraith:/
> lookup(program): unexpected lookup for active multi-mount key
> wraith, 
> returning fail
> dev_ioctl_send_fail: token = 6635
> failed to mount /net/wraith
> 
> I'm currently running this patch but don't have much confidence in
> it. 
> I'm unsure of the lifetime rules for me->multi, maybe it should have 
> been cleared after failure mounting?

I've returned to look at this a few times now but don't have an
proper answer for you just yet, thought I'd let you know I am
thinking about it.

> 
> diff --git a/modules/lookup_program.c b/modules/lookup_program.c
> index fcb1af7..b6f854b 100644
> --- a/modules/lookup_program.c
> +++ b/modules/lookup_program.c
> @@ -646,7 +646,7 @@ int lookup_mount(struct autofs_point *ap, const
> char 
> *name, int name_len, void *
>                                   name_len, ent, ctxt->parse-
> >context);
>                          goto out_free;
>                  } else {
> -                       if (me->multi) {
> +                       if (me->multi && me->multi != me) {
>                                  cache_unlock(mc);
>                                  warn(ap->logopt, MODPREFIX
>                                       "unexpected lookup for active 
> multi-mount"

Yes, the problem occurs because it's a top level singleton multi-mount
otherwise you wouldn't get a lookup taking this code path.

And even the entry delete below it should be ok because it will
just lookup (aka. run the program map again to get the map entry)
and then update the multi-mount during the entry parse.

So while the change above isn't strictly the way this should be
handled it probably should be ok.

I haven't worked out how to handle it immediately after the fail
just yet but the change above probably should be kept as part of
that as well, not sure yet.

Ian