Re: Name hashing function causing a perf regression

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Mon, 15 Sep 2014 13:58:36 +0900

Linus Torvalds wrote:
> On Sun, Sep 14, 2014 at 7:49 PM, Tetsuo Handa
> <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > I care, at least as of RHEL6 (2.6.32) kernels. Some users are using tmpfs
> > in order to avoid dentry cache bouncing bomb (a series of simple stat() calls
> > take about 10 seconds when there are 300,000,000+ unused dentries, P31 of
> > http://I-love.SAKURA.ne.jp/tomoyo/LCJ2014-en.pdf ). I don't know performance
> > on recent kernels.
> 
> Hmm. Probably not hugely different. It's still fairly easy to generate
> a lot of negative dentries. Do you have an actual test program that
> does this and is annoying? Do people actually do that?

OK. I quote comment #5 and #9 of RHBZ 1061562 below.
The test program I used (source code of a.out) is in comment #5. The customer
in comment #9 is now using tmpfs and so far I have heard no problems.

----- comment #5 start -----

At least two enterprise systems which create a lot of dentries are suffering
unexpected system reboots/hangups. One of the systems has about 100GB of RAM
and the sar report showed that the unused dentries was over 300,000,000 just
before an unexpected reboot happened. Also, by using the dentry producer
program described in comment #0, the system successfully triggered another
unexpected reboot.

Therefore, I am examining the cause of unexpected reboots. Since I had a
chance to access a machine with 64GB of RAM this week, I ran a test using
SystemTap and an updated version of dentry producer program described below.

---------- a SystemTap command line start ----------
stap -e '
global stat_start;
global shrink_start;
global counter;
probe kernel.function("vfs_stat") {
  counter++;
  stat_start[tid()] = gettimeofday_ns();
}
probe kernel.function("vfs_stat").return {
  t = gettimeofday_ns() - stat_start[tid()];
  if (t >= 1000000)
    printf("%s: vfs_stat(%u) took %u ms\n", execname(), counter, t / 1000000);
}
probe kernel.function("shrink_slab") {
  printf("%s: shrink_slab() started\n", execname());
  shrink_start[tid()] = gettimeofday_ns();
}
probe kernel.function("shrink_slab").return {
  t = gettimeofday_ns() - shrink_start[tid()];
  printf("%s: shrink_slab() took %u ms\n", execname(), t / 1000000);
}
'
---------- a SystemTap command line end ----------

---------- an example of very-fast dentry producer start ----------
#include <stdio.h>
#include <sys/stat.h>

int main(int argc, char *argv[])
{
        unsigned int i = 0;
        struct stat buf;
        char buffer[128] = { };
        while (1) {
                snprintf(buffer, sizeof(buffer) - 1, "%u", i++);
                stat(buffer, &buf);
        }
        return 0;
}
---------- an example of very-fast dentry producer end ----------

The test result showed that a single stat() call takes a few seconds if kswapd
is reclaiming memory at shrink_slab(), and a series of a few stat() calls takes
more than 10 seconds.

---------- SystemTap output start ----------
bash: shrink_slab() started
bash: shrink_slab() took 18 ms
bash: shrink_slab() started
bash: shrink_slab() took 0 ms
crond: vfs_stat(9964303) took 10 ms
crond: vfs_stat(79137289) took 6 ms
sa1: vfs_stat(79200631) took 9 ms
bash: vfs_stat(98784249) took 10 ms
run-parts: vfs_stat(98787397) took 5 ms
run-parts: vfs_stat(98791464) took 3 ms
0anacron: vfs_stat(98801077) took 15 ms
0anacron: vfs_stat(98804416) took 10 ms
kswapd0: shrink_slab() started
kswapd1: shrink_slab() started
a.out: vfs_stat(305840783) took 282 ms
a.out: vfs_stat(305840784) took 999 ms
a.out: vfs_stat(305840785) took 335 ms
a.out: vfs_stat(305840786) took 3654 ms
a.out: vfs_stat(305840787) took 3009 ms
a.out: vfs_stat(305840788) took 999 ms
a.out: vfs_stat(305840790) took 1228 ms
a.out: vfs_stat(305847338) took 2 ms
a.out: vfs_stat(305849179) took 3 ms
a.out: vfs_stat(305849870) took 2 ms
a.out: vfs_stat(305850478) took 1 ms
a.out: vfs_stat(305854046) took 1 ms
a.out: vfs_stat(305856251) took 1 ms
a.out: vfs_stat(305864002) took 5 ms
a.out: vfs_stat(305876155) took 6 ms
a.out: vfs_stat(305885622) took 12 ms
a.out: vfs_stat(305886187) took 9 ms
a.out: vfs_stat(305888365) took 2 ms
a.out: vfs_stat(305892807) took 2 ms
a.out: vfs_stat(305896127) took 2 ms
a.out: vfs_stat(305897223) took 2 ms
a.out: vfs_stat(305910449) took 1 ms
a.out: vfs_stat(305911688) took 8 ms
a.out: vfs_stat(305914482) took 12 ms
kswapd1: shrink_slab() took 11810 ms
kswapd1: shrink_slab() started
a.out: vfs_stat(305939079) took 7 ms
a.out: vfs_stat(305943797) took 21 ms
a.out: vfs_stat(305944628) took 12 ms
a.out: vfs_stat(305964155) took 3 ms
a.out: vfs_stat(305965994) took 15 ms
a.out: vfs_stat(305966685) took 15 ms
a.out: vfs_stat(305967124) took 1 ms
a.out: vfs_stat(305973015) took 2 ms
a.out: vfs_stat(305975511) took 4 ms
a.out: vfs_stat(305978080) took 3 ms
a.out: vfs_stat(305982128) took 3 ms
a.out: vfs_stat(305988765) took 4 ms
a.out: vfs_stat(305991667) took 3 ms
a.out: vfs_stat(305995990) took 3 ms
a.out: vfs_stat(306000642) took 1 ms
a.out: vfs_stat(306010369) took 3 ms
a.out: vfs_stat(306023386) took 10 ms
a.out: vfs_stat(306023881) took 9 ms
a.out: vfs_stat(306024374) took 3 ms
a.out: vfs_stat(306033201) took 5 ms
kswapd1: shrink_slab() took 1373 ms
a.out: vfs_stat(306038979) took 4 ms
a.out: vfs_stat(306045321) took 1 ms
a.out: vfs_stat(306048306) took 1 ms
a.out: vfs_stat(306050274) took 5 ms
a.out: vfs_stat(306053855) took 4 ms
a.out: vfs_stat(306054220) took 1 ms
a.out: vfs_stat(306075005) took 8 ms
a.out: vfs_stat(306076840) took 17 ms
a.out: vfs_stat(306085567) took 3 ms
a.out: vfs_stat(306101199) took 1 ms
a.out: vfs_stat(306102299) took 20 ms
a.out: vfs_stat(306109915) took 2 ms
a.out: vfs_stat(306111825) took 14 ms
a.out: vfs_stat(306114245) took 3 ms
a.out: vfs_stat(306114878) took 4 ms
kswapd0: shrink_slab() took 14025 ms
kswapd0: shrink_slab() started
a.out: vfs_stat(306133947) took 29 ms
a.out: vfs_stat(306138301) took 5 ms
a.out: vfs_stat(306144529) took 5 ms
a.out: vfs_stat(306149390) took 5 ms
a.out: vfs_stat(306164696) took 8 ms
a.out: vfs_stat(306165424) took 5 ms
a.out: vfs_stat(306176530) took 1 ms
a.out: vfs_stat(306180342) took 20 ms
a.out: vfs_stat(306187037) took 7 ms
a.out: vfs_stat(306192183) took 12 ms
a.out: vfs_stat(306201618) took 9 ms
a.out: vfs_stat(306203887) took 2 ms
a.out: vfs_stat(306204688) took 4 ms
a.out: vfs_stat(306206134) took 1 ms
a.out: vfs_stat(306207974) took 2 ms
a.out: vfs_stat(306220292) took 12 ms
a.out: vfs_stat(306224972) took 6 ms
a.out: vfs_stat(306231423) took 6 ms
a.out: vfs_stat(306232489) took 1 ms
a.out: vfs_stat(306236307) took 4 ms
a.out: vfs_stat(306238302) took 1 ms
a.out: vfs_stat(306245840) took 1 ms
a.out: vfs_stat(306249541) took 14 ms
a.out: vfs_stat(306251562) took 6 ms
a.out: vfs_stat(306260444) took 2 ms
a.out: vfs_stat(306275088) took 4 ms
a.out: vfs_stat(306278510) took 2 ms
a.out: vfs_stat(306278795) took 3 ms
kswapd0: shrink_slab() took 1693 ms
kswapd0: shrink_slab() started
a.out: vfs_stat(306295322) took 22 ms
a.out: vfs_stat(306307052) took 17 ms
a.out: vfs_stat(306309420) took 1 ms
a.out: vfs_stat(306315293) took 4 ms
a.out: vfs_stat(306325204) took 10 ms
a.out: vfs_stat(306325362) took 17 ms
a.out: vfs_stat(306329728) took 2 ms
a.out: vfs_stat(306334637) took 5 ms
a.out: vfs_stat(306350149) took 7 ms
a.out: vfs_stat(306352355) took 1 ms
a.out: vfs_stat(306354047) took 1 ms
a.out: vfs_stat(306356301) took 1 ms
a.out: vfs_stat(306384456) took 1 ms
a.out: vfs_stat(306388446) took 10 ms
a.out: vfs_stat(306391249) took 2 ms
a.out: vfs_stat(306393632) took 1 ms
a.out: vfs_stat(306407549) took 6 ms
a.out: vfs_stat(306414708) took 2 ms
a.out: vfs_stat(306438538) took 1 ms
a.out: vfs_stat(306447413) took 9 ms
a.out: vfs_stat(306453184) took 2 ms
a.out: vfs_stat(306459036) took 1 ms
kswapd0: shrink_slab() took 1691 ms
---------- SystemTap output end ----------

If a watchdog program does something like

  while (1) {
    read from /proc/$pid/status for process check
    sleep for 1 second for CPU saving
    write to /dev/watchdog for resetting timer
  }

the watchdog program could be blocked for duration that is sufficient to fail
to write to /dev/watchdog within the timeout second.

So far, I'm suspecting that the cause of unexpected reboots is the watchdog
program which is unexpectedly blocked by dcache_lock contention due to kswapd
calling cond_resched_lock(&dcache_lock) from __shrink_dcache_sb() from
prune_dcache() from shrink_dcache_memory() from shrink_slab().

If my guess is correct, this is a time bomb for systems with a huge amount of
RAM and a huge amount of dentries.

----- comment #5 end -----

----- comment #9 start -----

If I remember correctly, the system is an active/standby cluster, and this
problem occurs on the standby machine which is almost idle.

On the standby machine, "service drbd status" is issued by heartbeat software
at the speed of twice per a second. The /etc/init.d/drbd uses /bin/sort , and
/bin/sort creates temporary files. Since the standby machine is nearly idle,
there is nothing but dentry which consumes the RAM. Therefore, by the time
uptime becomes 120 days, the sum of unused dentries (created by /bin/sort as
temporary files) exceeded 300,000,000.

Since the cause of these unused dentries turned out to be /etc/init.d/drbd ,
I suggested the customer to create temporary files on tmpfs (rather than ext4)
by doing

  # echo "export TMPDIR=/dev/shm" >> /etc/default/drbd

as a workaround.

Also, I suggested the customer to use serial console or netconsole to catch
kernel messages (if any) in case of unexpected reboots/hangups, for I'm
currently suspecting watchdog-like timeout when kswapd starts memory
reclaiming.

----- comment #9 end -----
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html