Help with autofs hang

"NeilBrown" <neilb@xxxxxxx> · Mon, 20 Feb 2023 09:40:27 +1100

Hi,
 I have a customer who is experiencing problems with automountd.  I
 think I know what is happening, but I'm not sure if what I imagine is
 possible, or what the best solution is.

 The kernel is 4.12 and automountd is 5.1.3 - so not the newest, but not
 ancient.  I cannot see any changes since that look like they might be
 relevant.

 The problem is that after a while automountd stops expiring direct
 mounts, and doesn't mount any new direct mounts that are added to the
 map.

 When this happens an automountd thread has sent an
 AUTOFS_IOC_EXPIRE_MULTI ioctl to the kernel, the kernel has sent a
 NFY_EXPIRE back up to automountd.  automountd reported

   handle_packet_expire_direct: can't find map entry for ....

 and the kernel never gets an ACK for the message and things hang.

 When I look, the mount point that the kernel is asking automountd to
 expire has already been unmounted.

 The mount map uses LDAP and changes quite often.  My guess is that
 automountd notices that some directory has been removed from the map,
 and so removes the map entry.  This presumably races with the expiry
 process.  The mount gets unmounted because it is removed from the map
 at the same time that expiry wants to remove it, and confusion results.

 My current thought for a solution is to change the way the kernel waits
 for NFY_EXPIRE replies.  Instead of waiting indefinitely it waits with
 a timeout.  If the wait times out and the filesystem is still mounted,
 it just loops around and waits again.  If after the timeout the
 filesystem has been unmounted it waits one more time (just in case
 automountd is about to reply) and then aborts the wait with -EAGAIN.
 I've provided the customer with a patch to do this using a 5 second
 wait.  I don't have test results yet.

 So my questions are:
  - is this race really possible? Can removal-from-map race with expiry?
  - is my timeout fix reasonable?  Might it cause other problems?  Is
    there a better way to fix this inside automountd?

Thanks,
NeilBrown