Re: [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list

Waiman Long <waiman.long@xxxxxxx> · Wed, 17 Feb 2016 10:40:27 -0500

On 02/17/2016 02:16 AM, Ingo Molnar wrote:
* Waiman Long<Waiman.Long@xxxxxxx>  wrote:

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a per-cpu list with
per-cpu spinlocks.

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

   Kernel            Elapsed Time    System Time
   ------            ------------    -----------
   Vanilla 4.5-rc4      65.29s         82m14s
   Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.
Pretty impressive IMHO!

Just for the record, here's your former 'batched list' number inserted into the
above table:

    Kernel                       Elapsed Time    System Time
    ------                       ------------    -----------
    Vanilla      [v4.5-rc4]      65.29s          82m14s
    batched list [v4.4]          45.69s          49m44s
    percpu list  [v4.5-rc4]      22.81s          23m03s

i.e. the proper per CPU data structure and the resulting improvement in cache
locality gave another doubling in performance.

Just out of curiosity, could you post the profile of the latest patches - is there
any (bigger) SMP overhead left, or is the profile pretty flat now?

Thanks,

	Ingo

Yes, there were still spinlock contention elsewhere in the exit path. 
Now the bulk of the CPU times was in:

-   79.23%    79.23%         a.out  [kernel.kallsyms]    [k] 
native_queued_spin
   - native_queued_spin_lock_slowpath
      - 99.99% queued_spin_lock_slowpath
         - 100.00% _raw_spin_lock
            - 99.98% list_lru_del
               - d_lru_del
                  - 100.00% select_collect
                       detach_and_collect
                       d_walk
                       d_invalidate
                       proc_flush_task
                       release_task
                       do_exit
                       do_group_exit
                       get_signal
                       do_signal
                       exit_to_usermode_loop
                       syscall_return_slowpath
                       int_ret_from_sys_call

The locks that were being contended were nlru->lock. For a 4-node system 
that I used, there will be four of those.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html