Re: Help with autofs hang

Ian Kent <raven@xxxxxxxxxx> · Thu, 2 Mar 2023 08:37:09 +0800

On 20/2/23 12:49, NeilBrown wrote:
On Mon, 20 Feb 2023, Ian Kent wrote:
On 20/2/23 06:40, NeilBrown wrote:
Hi,
   I have a customer who is experiencing problems with automountd.  I
   think I know what is happening, but I'm not sure if what I imagine is
   possible, or what the best solution is.

   The kernel is 4.12 and automountd is 5.1.3 - so not the newest, but not
   ancient.  I cannot see any changes since that look like they might be
   relevant.

   The problem is that after a while automountd stops expiring direct
   mounts, and doesn't mount any new direct mounts that are added to the
   map.

   When this happens an automountd thread has sent an
   AUTOFS_IOC_EXPIRE_MULTI ioctl to the kernel, the kernel has sent a
   NFY_EXPIRE back up to automountd.  automountd reported

     handle_packet_expire_direct: can't find map entry for ....

   and the kernel never gets an ACK for the message and things hang.
Yes, that case is fatal.

Because the kernel communications pipe might not be able to convey

the direct mount path a bogus value is encoded into the packet and

an inode number to path index is used to lookup the path. Without

the path we can't continue.

But this hasn't happened to me for a long time.

   When I look, the mount point that the kernel is asking automountd to
   expire has already been unmounted.
That's not right ...

   The mount map uses LDAP and changes quite often.  My guess is that
   automountd notices that some directory has been removed from the map,
   and so removes the map entry.  This presumably races with the expiry
   process.  The mount gets unmounted because it is removed from the map
   at the same time that expiry wants to remove it, and confusion results.
That sounds different to the terminology I'd use but I think I get what

your saying.

I would describe it as, a map entry has been removed from the map when

it's in use causing expires for that map entry to be done on an entry

that's been removed from the index we need for the map entry lookup.

This map entry shouldn't be removed in this case.

   My current thought for a solution is to change the way the kernel waits
   for NFY_EXPIRE replies.  Instead of waiting indefinitely it waits with
   a timeout.  If the wait times out and the filesystem is still mounted,
   it just loops around and waits again.  If after the timeout the
   filesystem has been unmounted it waits one more time (just in case
   automountd is about to reply) and then aborts the wait with -EAGAIN.
   I've provided the customer with a patch to do this using a 5 second
   wait.  I don't have test results yet.
I really don't think this is a kernel problem, it's a user space problem.

Some time ago there was a weird case where an active map entry was being

removed from the map entry cache. I had a little trouble even working out

what I had done when I cam across it in a clean up a while ago. So if

this is what your seeing we'll need to do some work to work out what

I saw and what I was doing to fix it.

Let me check 5.1.3 and get back to you.

   So my questions are:
    - is this race really possible? Can removal-from-map race with expiry?
Well, maybe but it shouldn't because walking into an expiring mount

or one that's being mounted shouldn't be possible and I haven't seen

symptoms of that happening for a very long time, certainly not with

a kernel as recent as 4.12.

I really think it's a mistake I'm making in the user space code.

    - is my timeout fix reasonable?  Might it cause other problems?  Is
      there a better way to fix this inside automountd?
Probably and don't know.

I think user space is the problem here and I suspect trying to change

the kernel won't actually fix the problem because it's a user space

mistake that could still happen.

I'm not sure about the wisdom of my not trying to recover from this

either. Originally it was done because if this happened things would

only get worse and the problem would become hidden. So I made the fail

fatal so I could get a core of the state at the time it happened and

that would be more likely to yield information about the cause. And

this should never happen so the only choice is to fix it.

Thanks - you've given me some useful pointers.  I'll look some more.

I have a core of automountd while it is hanging (so after the initial
problem) and also a core of the kernel.  So if you do find more time to
look and want me to find something in a core file, just let me know.

Umm ... sounds like you didn't see my second reply to this.

It refers to a commit that resolves a problem that sounds a lot like

what your seeing?

https://www.spinics.net/lists/autofs/msg02557.html

Ian