Hi all
Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks))
Long version :
For this second version, I removed the mntput()/mntget() optimization
since most reviewers are not convinced it is usefull.
This is a four lines patch that can be reconsidered later.
I chose the name SINGLE instead of SPECIAL to name
isolated dentries (for sockets, pipes, anonymous fd) that
have no parent and no relationship in the vfs.
Thanks all
To allocate a socket or a pipe we :
0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)
1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list (for sockets/pipes, this is useless)
- dirties superblock s_inodes. - dirties last_ino counter
All these are in different cache lines unfortunatly.
2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry
3) d_instantiate() dentry (dcache_lock taken again)
4) init_file() -> atomic_inc() on sock_mnt->refcount
At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)
This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.
This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd (signalfd, timerfd, ...)
Sample program :
for (i = 0; i < 1000000; i++)
close(socket(AF_INET, SOCK_STREAM, 0));
Cost if one cpu runs the program :
real 1.561s
user 0.092s
sys 1.469s
Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :
real 27.496s <<<< !!!! >>>>
user 0.657s
sys 3m39.092s
Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock
3301428 6648780 27.6388 55.6620 d_instantiate
2971130 9619910 24.8736 80.5355 d_alloc
241318 9861228 2.0203 82.5558 init_file
146190 10007418 1.2239 83.7797 __slab_free
144149 10151567 1.2068 84.9864 inotify_d_instantiate
143971 10295538 1.2053 86.1917 inet_create
137168 10432706 1.1483 87.3401 new_inode
117549 10550255 0.9841 88.3242 add_partial
110795 10661050 0.9275 89.2517 generic_drop_inode
107137 10768187 0.8969 90.1486 kmem_cache_alloc
94029 10862216 0.7872 90.9358 tcp_close
82837 10945053 0.6935 91.6293 dput
67486 11012539 0.5650 92.1943 dentry_iput
57751 11070290 0.4835 92.6778 iput
54327 11124617 0.4548 93.1326 tcp_v4_init_sock
49921 11174538 0.4179 93.5505 sysenter_past_esp
47616 11222154 0.3986 93.9491 kmem_cache_free
30792 11252946 0.2578 94.2069 clear_inode
27540 11280486 0.2306 94.4375 copy_from_user
26509 11306995 0.2219 94.6594 init_timer
26363 11333358 0.2207 94.8801 discard_slab
25284 11358642 0.2117 95.0918 __fput
22482 11381124 0.1882 95.2800 __percpu_counter_add
20369 11401493 0.1705 95.4505 sock_alloc
18501 11419994 0.1549 95.6054 inet_csk_destroy_sock
17923 11437917 0.1500 95.7555 sys_close
This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.
New cost if run on one cpu :
real 1.325s (instead of 1.561s)
user 0.091s
sys 1.234s
If run on 8 CPUS :
real 0m2.971s
user 0m0.726s
sys 0m21.310s
CPU: Core 2, speed 3000.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples cum. samples % cum. % symbol name
189772 189772 12.7205 12.7205 _atomic_dec_and_lock
140467 330239 9.4155 22.1360 __slab_free
128210 458449 8.5940 30.7300 add_partial
121578 580027 8.1494 38.8794 kmem_cache_alloc
72626 652653 4.8681 43.7475 init_file
62720 715373 4.2041 47.9517 __percpu_counter_add
51632 767005 3.4609 51.4126 sysenter_past_esp
49196 816201 3.2976 54.7102 tcp_close
47933 864134 3.2130 57.9231 kmem_cache_free
29628 893762 1.9860 59.9091 copy_from_user
28443 922205 1.9065 61.8157 init_timer
25602 947807 1.7161 63.5318 __slab_alloc
22139 969946 1.4840 65.0158 discard_slab
20428 990374 1.3693 66.3851 __call_rcu
18174 1008548 1.2182 67.6033 alloc_fd
17643 1026191 1.1826 68.7859 __fput
17374 1043565 1.1646 69.9505 d_alloc
17196 1060761 1.1527 71.1031 sys_close
17024 1077785 1.1411 72.2442 inet_create
15208 1092993 1.0194 73.2636 alloc_inode
12201 1105194 0.8178 74.0815 fd_install
12167 1117361 0.8156 74.8970 lock_sock_nested
12123 1129484 0.8126 75.7096 get_empty_filp
11648 1141132 0.7808 76.4904 release_sock
11509 1152641 0.7715 77.2619 dput
11335 1163976 0.7598 78.0216 sock_init_data
11038 1175014 0.7399 78.7615 inet_csk_destroy_sock
10880 1185894 0.7293 79.4908 drop_file_write_access
10083 1195977 0.6759 80.1667 inotify_d_instantiate
9216 1205193 0.6178 80.7844 local_bh_enable_ip
8881 1214074 0.5953 81.3797 sysenter_do_call
8759 1222833 0.5871 81.9668 setup_object
8489 1231322 0.5690 82.5359 iput_single
So we now hit mntput()/mntget() and SLUB.
The last point is about SLUB being hit hard, unless we
use slub_min_order=3 (or slub_min_objects=45) at boot,
or we use Christoph Lameter patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615
If we boot machine with slub_min_order=3, SLUB overhead disappears.
If run on 8 CPUS :
real 0m2.315s
user 0m0.752s
sys 0m17.324s
CPU: Core 2, speed 3000.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
199409 199409 15.6440 15.6440 _atomic_dec_and_lock (mntput())
141606 341015 11.1092 26.7532 kmem_cache_alloc
76071 417086 5.9679 32.7211 init_file
70595 487681 5.5383 38.2595 __percpu_counter_add
51595 539276 4.0477 42.3072 sysenter_past_esp
49313 588589 3.8687 46.1759 tcp_close
45503 634092 3.5698 49.7457 kmem_cache_free
41413 675505 3.2489 52.9946 __slab_free
29911 705416 2.3466 55.3412 copy_from_user
28979 734395 2.2735 57.6146 init_timer
22251 756646 1.7456 59.3602 get_empty_filp
19942 776588 1.5645 60.9247 __call_rcu
18348 794936 1.4394 62.3642 __fput
18328 813264 1.4379 63.8020 alloc_fd
17395 830659 1.3647 65.1667 sys_close
17301 847960 1.3573 66.5240 d_alloc
16570 864530 1.2999 67.8239 inet_create
15522 880052 1.2177 69.0417 alloc_inode
13185 893237 1.0344 70.0761 setup_object
12359 905596 0.9696 71.0456 fd_install
12275 917871 0.9630 72.0086 lock_sock_nested
11924 929795 0.9355 72.9441 release_sock
11790 941585 0.9249 73.8690 sock_init_data
11310 952895 0.8873 74.7563 dput
10924 963819 0.8570 75.6133 drop_file_write_access
10903 974722 0.8554 76.4687 inet_csk_destroy_sock
10184 984906 0.7990 77.2676 inotify_d_instantiate
9372 994278 0.7353 78.0029 local_bh_enable_ip
8901 1003179 0.6983 78.7012 sysenter_do_call
8569 1011748 0.6723 79.3735 iput_single
8194 1019942 0.6428 80.0163 inet_release
This patch serie contains 5 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)
[PATCH 1/5] fs: Use a percpu_counter to track nr_dentry
Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry
We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()
d_alloc() can avoid taking dcache_lock if parent is NULL
(socket8 bench result : 27.5s to 25s)
[PATCH 2/5] fs: Use a percpu_counter to track nr_inodes
Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.
(socket8 bench result : no difference at this point)
[PATCH 3/5] fs: Introduce a per_cpu last_ino allocator
new_inode() dirties a contended cache line to get increasing
inode numbers.
Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.
This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)
(socket8 bench result : no difference)
[PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
Sockets, pipes and anonymous fds have interesting properties.
Like other files, they use a dentry and an inode.
But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)
Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.
This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.
Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.
Differences betwen an SINGLE dentry and a normal one are :
1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty
(socket8 bench result : from 25s to 19.9s)
[PATCH 5/5] fs: new_inode_single() and iput_single()
Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.
SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid
taking it as well.
Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.
This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines
in iput()
(socket8 bench result : from 19.9s to 2.3s)
Signed-off-by: Eric Dumazet <dada1@xxxxxxxxxxxxx>
---
Overall diffstat :
fs/anon_inodes.c | 18 ------
fs/dcache.c | 100 ++++++++++++++++++++++++++++++--------
fs/fs-writeback.c | 2
fs/inode.c | 101 +++++++++++++++++++++++++++++++--------
fs/pipe.c | 25 +--------
include/linux/dcache.h | 9 +++
include/linux/fs.h | 17 ++++++
kernel/sysctl.c | 6 +-
mm/page-writeback.c | 2
net/socket.c | 26 +---------
10 files changed, 200 insertions(+), 106 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kernel-testers" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html