Re: [PATCH 0/4] seqlock: Add new blocking reader type & use rwlock

Waiman Long <waiman.long@xxxxxx> · Fri, 23 Aug 2013 18:49:54 -0400

On 07/03/2013 09:52 PM, Waiman Long wrote:
During some perf-record sessions of the kernel running the high_systime
workload of the AIM7 benchmark, it was found that quite a large portion
of the spinlock contention was due to the perf_event_mmap_event()
function itself. This perf kernel function calls d_path() which,
in turn, call path_get() and dput() indirectly. These 3 functions
were the hottest functions shown in the perf-report output of
the _raw_spin_lock() function in an 8-socket system with 80 cores
(hyperthreading off) with a 3.10-rc1 kernel:

-  13.91%           reaim  [kernel.kallsyms]     [k] _raw_spin_lock
    - _raw_spin_lock
       + 35.54% path_get
       + 34.85% dput
       + 19.49% d_path

In fact, the output of the "perf record -s -a" (without call-graph)
showed:

  13.37%           reaim  [kernel.kallsyms]     [k] _raw_spin_lock
   7.61%              ls  [kernel.kallsyms]     [k] _raw_spin_lock
   3.54%            true  [kernel.kallsyms]     [k] _raw_spin_lock

Without using the perf monitoring tool, the actual execution profile
will be quite different. In fact, with this patch set and my other
lockless reference count update patch applied, the output of the same
"perf record -s -a" command became:

   2.82%           reaim  [kernel.kallsyms]     [k] _raw_spin_lock
   1.11%              ls  [kernel.kallsyms]     [k] _raw_spin_lock
   0.26%            true  [kernel.kallsyms]     [k] _raw_spin_lock

So the time spent on _raw_spin_lock() function went down from 24.52%
to 4.19%. It can be seen that the performance data collected by the
perf-record command can be heavily skewed in some cases on a system
with a large number of CPUs. This set of patches enables the perf
command to give a more accurate and reliable picture of what is really
happening in the system. At the same time, they can also improve the
general performance of systems especially those with a large number
of CPUs.

The d_path() function takes the following two locks:

1. dentry->d_lock [spinlock] from dget()/dget_parent()/dput()
2. rename_lock    [seqlock]  from d_path()

This set of patches address the rename_lock bottleneck by changing the
way seqlock is implemented so that we can optionally use a read/write
lock as the underlying implementation instead of the default spinlock.

Incidentally, this patch also provides slight 5% performance boost
over just the the lockless reference count update patch in the short
workload of the AIM7 benchmark running on a 8-socket 80-core DL980
machine on a 3.10-based kernel. There were still a few percentage
points of contention in d_path() and getcwd syscall left due to their
use of the rename_lock.

Signed-off-by: Waiman Long<Waiman.Long@xxxxxx>

Waiman Long (4):
   seqlock: Add a new blocking reader type
   dcache: Use blocking reader seqlock when protected data are not
     changed
   seqlock: Allow the use of rwlock in seqlock
   dcache: Use rwlock as the underlying lock in rename_lock

  fs/dcache.c             |   28 ++++----
  include/linux/seqlock.h |  167 ++++++++++++++++++++++++++++++++++++++++-------
  2 files changed, 158 insertions(+), 37 deletions(-)

I haven't received any feedback on this patchset. Would you mind letting 
me know if any further change will be needed to make it acceptable to be 
merged?

Thank,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html