Hi Dmitry,
On Thu, Nov 26, 2020 at 10:44 AM Dmitry Antipov <dmantipov@xxxxxxxxx> wrote:
BTW, did someone try to profile the brick process? I do, and got this
for the default replica 3 volume ('perf record -F 2500 -g -p [PID]'):
+ 3.29% 0.02% glfs_epoll001 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 3.17% 0.01% glfs_epoll001 [kernel.kallsyms] [k] do_syscall_64
+ 3.17% 0.02% glfs_epoll000 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 3.06% 0.02% glfs_epoll000 [kernel.kallsyms] [k] do_syscall_64
+ 2.75% 0.01% glfs_iotwr00f [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.74% 0.01% glfs_iotwr00b [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.74% 0.01% glfs_iotwr001 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.73% 0.00% glfs_iotwr003 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.72% 0.00% glfs_iotwr000 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.72% 0.01% glfs_iotwr00c [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.70% 0.01% glfs_iotwr003 [kernel.kallsyms] [k] do_syscall_64
+ 2.69% 0.00% glfs_iotwr001 [kernel.kallsyms] [k] do_syscall_64
+ 2.69% 0.01% glfs_iotwr008 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.68% 0.00% glfs_iotwr00b [kernel.kallsyms] [k] do_syscall_64
+ 2.68% 0.00% glfs_iotwr00c [kernel.kallsyms] [k] do_syscall_64
+ 2.68% 0.00% glfs_iotwr00f [kernel.kallsyms] [k] do_syscall_64
+ 2.68% 0.01% glfs_iotwr000 [kernel.kallsyms] [k] do_syscall_64
+ 2.67% 0.00% glfs_iotwr00a [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.65% 0.00% glfs_iotwr008 [kernel.kallsyms] [k] do_syscall_64
+ 2.64% 0.00% glfs_iotwr00e [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.64% 0.01% glfs_iotwr00d [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.63% 0.01% glfs_iotwr00a [kernel.kallsyms] [k] do_syscall_64
+ 2.63% 0.01% glfs_iotwr007 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.63% 0.00% glfs_iotwr005 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.63% 0.01% glfs_iotwr006 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.63% 0.00% glfs_iotwr009 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.61% 0.01% glfs_iotwr004 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.61% 0.01% glfs_iotwr00e [kernel.kallsyms] [k] do_syscall_64
+ 2.60% 0.00% glfs_iotwr006 [kernel.kallsyms] [k] do_syscall_64
+ 2.59% 0.00% glfs_iotwr005 [kernel.kallsyms] [k] do_syscall_64
+ 2.59% 0.00% glfs_iotwr00d [kernel.kallsyms] [k] do_syscall_64
+ 2.58% 0.00% glfs_iotwr002 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.58% 0.01% glfs_iotwr007 [kernel.kallsyms] [k] do_syscall_64
+ 2.58% 0.00% glfs_iotwr004 [kernel.kallsyms] [k] do_syscall_64
+ 2.57% 0.00% glfs_iotwr009 [kernel.kallsyms] [k] do_syscall_64
+ 2.54% 0.00% glfs_iotwr002 [kernel.kallsyms] [k] do_syscall_64
+ 1.65% 0.00% glfs_epoll000 [unknown] [k] 0x0000000000000001
+ 1.65% 0.00% glfs_epoll001 [unknown] [k] 0x0000000000000001
+ 1.48% 0.01% glfs_rpcrqhnd [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.44% 0.08% glfs_rpcrqhnd libpthread-2.32.so [.] pthread_cond_wait@@GLIBC_2.3.2
+ 1.40% 0.01% glfs_rpcrqhnd [kernel.kallsyms] [k] do_syscall_64
+ 1.36% 0.01% glfs_rpcrqhnd [kernel.kallsyms] [k] __x64_sys_futex
+ 1.35% 0.03% glfs_rpcrqhnd [kernel.kallsyms] [k] do_futex
+ 1.34% 0.01% glfs_iotwr00a libpthread-2.32.so [.] __libc_pwrite64
+ 1.32% 0.00% glfs_iotwr00a [kernel.kallsyms] [k] __x64_sys_pwrite64
+ 1.32% 0.00% glfs_iotwr001 libpthread-2.32.so [.] __libc_pwrite64
+ 1.31% 0.01% glfs_iotwr002 libpthread-2.32.so [.] __libc_pwrite64
+ 1.31% 0.00% glfs_iotwr00b libpthread-2.32.so [.] __libc_pwrite64
+ 1.31% 0.01% glfs_iotwr00a [kernel.kallsyms] [k] vfs_write
+ 1.30% 0.00% glfs_iotwr001 [kernel.kallsyms] [k] __x64_sys_pwrite64
+ 1.30% 0.00% glfs_iotwr008 libpthread-2.32.so [.] __libc_pwrite64
+ 1.30% 0.00% glfs_iotwr00a [kernel.kallsyms] [k] new_sync_write
+ 1.30% 0.00% glfs_iotwr00c libpthread-2.32.so [.] __libc_pwrite64
+ 1.29% 0.00% glfs_iotwr00a [kernel.kallsyms] [k] xfs_file_write_iter
+ 1.29% 0.01% glfs_iotwr00a [kernel.kallsyms] [k] xfs_file_dio_aio_write
And on replica 3 with storage.linux-aio enabled:
+ 11.76% 0.05% glfs_posixaio [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 11.42% 0.01% glfs_posixaio [kernel.kallsyms] [k] do_syscall_64
+ 8.81% 0.00% glfs_posixaio [unknown] [k] 0x00000000baadf00d
+ 8.81% 0.00% glfs_posixaio [unknown] [k] 0x0000000000000004
+ 8.74% 0.06% glfs_posixaio libc-2.32.so [.] __GI___writev
+ 8.33% 0.02% glfs_posixaio [kernel.kallsyms] [k] do_writev
+ 8.23% 0.03% glfs_posixaio [kernel.kallsyms] [k] vfs_writev
+ 8.12% 0.05% glfs_posixaio [kernel.kallsyms] [k] do_iter_write
+ 8.02% 0.05% glfs_posixaio [kernel.kallsyms] [k] do_iter_readv_writev
+ 7.96% 0.04% glfs_posixaio [kernel.kallsyms] [k] sock_write_iter
+ 7.92% 0.01% glfs_posixaio [kernel.kallsyms] [k] sock_sendmsg
+ 7.86% 0.01% glfs_posixaio [kernel.kallsyms] [k] tcp_sendmsg
+ 7.28% 0.15% glfs_posixaio [kernel.kallsyms] [k] tcp_sendmsg_locked
+ 6.49% 0.01% glfs_posixaio [kernel.kallsyms] [k] __tcp_push_pending_frames
+ 6.48% 0.10% glfs_posixaio [kernel.kallsyms] [k] tcp_write_xmit
+ 6.31% 0.02% glfs_posixaio [unknown] [k] 0000000000000000
+ 6.05% 0.13% glfs_posixaio [kernel.kallsyms] [k] __tcp_transmit_skb
+ 5.71% 0.06% glfs_posixaio [kernel.kallsyms] [k] __ip_queue_xmit
+ 4.15% 0.03% glfs_rpcrqhnd [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 4.07% 0.08% glfs_posixaio [kernel.kallsyms] [k] ip_finish_output2
+ 3.75% 0.02% glfs_posixaio [kernel.kallsyms] [k] asm_call_sysvec_on_stack
+ 3.75% 0.01% glfs_rpcrqhnd [kernel.kallsyms] [k] do_syscall_64
+ 3.70% 0.03% glfs_rpcrqhnd [kernel.kallsyms] [k] __x64_sys_futex
+ 3.68% 0.06% glfs_posixaio [kernel.kallsyms] [k] __local_bh_enable_ip
+ 3.67% 0.07% glfs_rpcrqhnd [kernel.kallsyms] [k] do_futex
+ 3.62% 0.05% glfs_posixaio [kernel.kallsyms] [k] do_softirq
+ 3.61% 0.01% glfs_posixaio [kernel.kallsyms] [k] do_softirq_own_stack
+ 3.59% 0.06% glfs_posixaio [kernel.kallsyms] [k] __softirqentry_text_start
+ 3.44% 0.06% glfs_posixaio [kernel.kallsyms] [k] net_rx_action
+ 3.34% 0.04% glfs_posixaio [kernel.kallsyms] [k] process_backlog
+ 3.28% 0.02% glfs_posixaio [kernel.kallsyms] [k] __netif_receive_skb_one_core
+ 3.08% 0.02% glfs_epoll000 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 3.02% 0.03% glfs_epoll001 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 2.97% 0.01% glfs_epoll000 [kernel.kallsyms] [k] do_syscall_64
+ 2.89% 0.01% glfs_epoll001 [kernel.kallsyms] [k] do_syscall_64
+ 2.73% 0.08% glfs_posixaio [kernel.kallsyms] [k] nf_hook_slow
+ 2.25% 0.04% glfs_posixaio libc-2.32.so [.] fgetxattr
+ 2.16% 0.14% glfs_rpcrqhnd [kernel.kallsyms] [k] futex_wake
According to these tables, the brick process is just a thin wrapper for the system calls
and kernel network subsystem behind them.
Mostly. However there's one issue that doesn't seem so obvious in the perf capture but we have identified it in other setups: when the system calls are processed very fast (as it should be the case when NVMe is used), the io-threads' thread pool will be constantly processing the request queue. This queue is currently synchronized with a mutex. The small latency per request makes the contention on the mutex quite high. This means that the thread pool tends to be serialized by the lock, which kills most of the parallelism and also causes a lot of additional system calls (increased CPU utilization and higher latencies).
For now the only way I know to try to minimize this effect is to reduce the number of threads in the io-threads pool. It's hard to tell what would be a good number. It depends on many things. But you can run some tests with different values to try to find the best one (after changing the number of threads, it's better to restart the volume).
Reducing the number of threads reduces the CPU power that gluster can use, but also reduces the contention, so it's expected (though not guaranteed) that at some point, even with fewer threads the performance could be a bit better.
Regards,
Xavi
To whom it may be interesting, the following replica 3 volume options:
performance.io-cache-pass-through: on
performance.iot-pass-through: on
performance.md-cache-pass-through: on
performance.nl-cache-pass-through: on
performance.open-behind-pass-through: on
performance.read-ahead-pass-through: on
performance.readdir-ahead-pass-through: on
performance.strict-o-direct: on
features.ctime: off
features.selinux: off
performance.write-behind: off
performance.open-behind: off
performance.quick-read: off
storage.linux-aio: on
storage.fips-mode-rchecksum: off
are likely to improve the I/O performance of GFAPI clients (fio with gfapi and gfapi_async
engines, qemu -drive file=gluster://XXX, etc.) by ~20%. But beware of killing I/O performance
of FUSE clients.
Dmitry
________
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users