umount times: 30 sec (open;close;unlink) vs 0.06 sec (open;unlink;close)

Lars Ellenberg <lars.ellenberg@xxxxxxxxxx> · Thu, 28 Jan 2016 14:56:05 +0100

We had a report that after "some runtime" (days)
the time to umount some file system became "huge" (minutes).
Even though the file system was nearly "empty" (few thousand files).

While umount after only some "short" run time was sub-second.

Analysis and workaround (for this use case) follow.

The file system in question is mostly used as a "spool" area,
so basically lots of
	echo $stuff > $tempfile
	process $tempfile
	rm $tempfile

Investigation shows that this creates huge amounts of negative entries
in the dentry cache. There is no memory pressure, the directory is not
removed either, so they stay around.

Reproducer in shell:
    while true; do
        F=$RANDOM
        touch $F
        rm $F
    done
and then
watch 'cat /proc/sys/fs/dentry-state ;
        slabtop -o | grep dentry ;
        grep ^SReclaimable /proc/meminfo'

(Obviously in C, perl or python,
you can get orders of magnitutes higher iterations per second).

So this accumulates unused negative dentries quickly,
and after some time, given enough RAM, we have gigabytes worth
of dentry cache, but no inodes used.

Umount of that empty spool file system takes 30 seconds.
It will take minutes, if you let it run even longer.
In real-life, after days of real load, umounting the spool
file system (with ~30 GB of accumulated dentry cache, but only a few
thousand remaining inodes) took minutes, and produced soft lockups
"BUG: soft lockup - CPU... stuck for XYZ seconds".

The Workaround:
---------------

The interesting part is, this (almost) same reproducer
behaves completely different:
    while true; do
        F=$RANDOM
        touch $F
        rm $F <$F       #### mind the redirection ####
    done
(unlink before last close)

This does *not* accumulate negative dentries at all.
Which is how I'd expected the other case to behave as well.
If we look at vfs_unlink: there is a d_delete(dentry) in there.

Total dentries vs seconds runtime (with a python reproducer).
Upper, linearly increasing dots are "open;close;unlink".
Mind the log scale on the "total dentries" y-axis.
Flat dots are for "open;unlink;close". 

         ++-----+-------+------+------+-------+------+------+-------+-----++
         ++     +       +      +      +       +    ....................   ++
         ++                         ................                      ++
         ++               ...........                                     ++
   1e+07 +++       ........                                              +++
         ++    .....                                                      ++
         ++  ...                                                          ++
         ++...                                                            ++
         +..                                                              ++
   1e+06 +.+                                                             +++
         ..                                                               ++
         .+                                                               ++
         ...............................................................  ++
         .+                                                               ++
  100000 +++                                                             +++
         ++                                                               ++
         ++                                                               ++
         ++                                                               ++
         ++     +       +      +      +       +      +      +       +     ++
   10000 +-+----+-------+------+------+-------+------+------+-------+----+-+
         0    100     200    300    400     500    600    700     800    900

time umount after 15 minutes and 50 million iterations,
so about 50 million dentries: 30 seconds.

time umount after 15 minutes and same number of iterations,
but no "stale negative dentries": 60 *milli* seconds.

So what I suggest is to fix the "touch $F; rm $F" case to
have the same effect as the "touch $F; rm $F <$F" case:
drop the corresponding dentry.

Would some VFS people kindly look into this,
or explain why, in case it "works as designed",
respectively point out what I am missing?

Pretty please? :-)

Also,
obviously anyone can produce negative dentries
by just stat()ing non-existing file names.
This comes up repeatedly (I found posts going
back ~15 years and more) in various contexts.

Once the dentry_cache grows beyond what is a reasonable working
set size for the cache levels, inserting new (negative) dentries
costs about as much as it was supposed to save...

I think it would be useful to introduce different sets of tunables
for positive and negative dentry "expiry", or a limit on the
number of negative dentries (maybe as relative vs total), to avoid
such massive (several gigabytes of slab) buildup of dentry caches
on a mostly empty file system, without the indirection via global
(or cgroup) memory pressure.

Maybe instead of always allocating and inserting a new dentry,
prefer to recycle some existing negative dentry, which has
not been looked at since a long time.

Thanks,

   Lars

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html