Re: Help with autofs hang

Ian Kent <raven@xxxxxxxxxx> · Mon, 20 Feb 2023 09:40:50 +0800

On 20/2/23 08:42, Ian Kent wrote:
  The mount map uses LDAP and changes quite often.  My guess is that
  automountd notices that some directory has been removed from the map,
  and so removes the map entry.  This presumably races with the expiry
  process.  The mount gets unmounted because it is removed from the map
  at the same time that expiry wants to remove it, and confusion 
results.

That sounds different to the terminology I'd use but I think I get what

your saying.

I would describe it as, a map entry has been removed from the map when

it's in use causing expires for that map entry to be done on an entry

that's been removed from the index we need for the map entry lookup.

This map entry shouldn't be removed in this case.

    My current thought for a solution is to change the way the kernel 
waits
  for NFY_EXPIRE replies.  Instead of waiting indefinitely it waits with
  a timeout.  If the wait times out and the filesystem is still mounted,
  it just loops around and waits again.  If after the timeout the
  filesystem has been unmounted it waits one more time (just in case
  automountd is about to reply) and then aborts the wait with -EAGAIN.
  I've provided the customer with a patch to do this using a 5 second
  wait.  I don't have test results yet.

I really don't think this is a kernel problem, it's a user space problem.

Some time ago there was a weird case where an active map entry was being

removed from the map entry cache. I had a little trouble even working out

what I had done when I cam across it in a clean up a while ago. So if

this is what your seeing we'll need to do some work to work out what

I saw and what I was doing to fix it.

Let me check 5.1.3 and get back to you.

I had a look and what I was thinking of is already present in 5.1.3.

I did however find something that looks like it's work considering,

have a look at this, it might help, not sure though:

commit 21ce28df1f4529948df876243fc977908e070296
Author: Ian Kent <raven@xxxxxxxxxx>
Date:   Tue Aug 7 12:05:21 2018 +0800

    autofs-5.1.4 - mark removed cache entry negative

    When re-reading a map, entries that have been removed are detected
    and deleted from the map entry cache by lookup_prune_cache().

    If a removed map entry is mounted at the time lookup_prune_cache()
    is called the map entry is skipped. This is done becuase the next
    lookup (following the mount expire, which needs the cache entry to
    remain) will detect the stale cache entry and a map update done
    resulting in the stale entry being removed.

    But if a map re-read is performed while the cache entry is mounted
    the cache will appear to up to date so the removed entry will remain
    valid even after it has expired.

    To cover this case it's sufficient to mark the mounted cache entry
    negative during the cache prune which prevents further lookups from
    using the stale entry.

    Signed-off-by: Ian Kent <raven@xxxxxxxxxx>

There might have been other patches at the time but it doesn't look

like it from the patch description, worth checking though.

Mostly I would be looking at debug logs to find out where the map entry

is mistakenly gets deleted, not at all straight forward but I think the

only way to tackle this problem.

I'd like to do more to help but I have a difficult problem to work out

how to fix myself just now.

Anyway, maybe I can put some time into it a bit later if needed, ;)

Ian