> On Dec 3, 2015, at 2:53 PM, Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote: > > Hi Chuck, > > On Thu, Dec 3, 2015 at 10:31 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >> Hi- >> >> I’m looking into NFS WRITE latency on the Linux NFS server. >> With a high speed network fabric and ultra-fast storage (think >> persistent memory) we can get the per-WRITE latency under 50us >> (as observed on the wire at the server). >> >> One source of the remaining latency appears to be nfsd_open. >> Here is an example of a call graph captured during a simple >> iozone 4KB direct write workload (v4.4-rc3): >> >> 2) | nfsd_open [nfsd]() { >> 2) | fh_verify [nfsd]() { >> 2) | rqst_exp_find [nfsd]() { >> 2) | exp_find [nfsd]() { >> 2) | exp_find_key [nfsd]() { >> 2) 0.043 us | svc_expkey_hash [nfsd](); >> 2) | sunrpc_cache_lookup [sunrpc]() { >> 2) 0.027 us | _raw_read_lock(); >> 2) | getboottime64() { >> 2) 0.022 us | ns_to_timespec(); >> 2) 0.210 us | } >> 2) 0.021 us | get_seconds(); >> 2) 0.854 us | } >> 2) | cache_check [sunrpc]() { >> 2) | getboottime64() { >> 2) 0.021 us | ns_to_timespec(); >> 2) 0.206 us | } >> 2) 0.022 us | get_seconds(); >> 2) 0.596 us | } >> 2) 2.050 us | } >> 2) | exp_get_by_name [nfsd]() { >> 2) | svc_export_lookup [nfsd]() { >> 2) | sunrpc_cache_lookup [sunrpc]() { >> 2) 0.022 us | _raw_read_lock(); >> 2) 0.024 us | svc_export_match [nfsd](); >> 2) | getboottime64() { >> 2) 0.023 us | ns_to_timespec(); >> 2) 0.210 us | } >> 2) 0.022 us | get_seconds(); >> 2) 0.997 us | } >> 2) 1.181 us | } >> 2) | cache_check [sunrpc]() { >> 2) | getboottime64() { >> 2) 0.022 us | ns_to_timespec(); >> 2) 0.210 us | } >> 2) 0.022 us | get_seconds(); >> 2) 0.592 us | } >> 2) 2.147 us | } >> 2) 4.560 us | } >> 2) 4.768 us | } >> 2) | prepare_creds() { >> 2) | kmem_cache_alloc() { >> 2) 0.021 us | _cond_resched(); >> 2) 0.233 us | } >> 2) | security_prepare_creds() { >> 2) | selinux_cred_prepare() { >> 2) | kmemdup() { >> 2) | __kmalloc_track_caller() { >> 2) 0.022 us | kmalloc_slab(); >> 2) 0.023 us | _cond_resched(); >> 2) 0.448 us | } >> 2) 0.640 us | } >> 2) 0.838 us | } >> 2) 1.129 us | } >> 2) 1.808 us | } >> 2) 0.022 us | override_creds(); >> 2) | exportfs_decode_fh() { >> 2) | shmem_fh_to_dentry() { >> 2) | ilookup5() { >> 2) | ilookup5_nowait() { >> 2) 0.026 us | _raw_spin_lock(); >> 2) | find_inode() { >> 2) 0.031 us | shmem_match(); >> 2) 0.029 us | _raw_spin_lock(); >> 2) 0.429 us | } >> 2) 0.826 us | } >> 2) 0.022 us | _cond_resched(); >> 2) 0.021 us | _cond_resched(); >> 2) 1.417 us | } >> 2) | d_find_alias() { >> 2) 0.022 us | _raw_spin_lock(); >> 2) 0.027 us | _raw_spin_lock(); >> 2) 0.423 us | } >> 2) 0.035 us | iput(); >> 2) 2.410 us | } >> 2) | find_acceptable_alias() { >> 2) 0.026 us | nfsd_acceptable [nfsd](); >> 2) 0.220 us | } >> 2) 3.032 us | } >> 2) | nfsd_setuser_and_check_port [nfsd]() { >> 2) 0.023 us | nfsexp_flags [nfsd](); >> 2) | nfsd_setuser [nfsd]() { >> 2) | revert_creds() { >> 2) | __put_cred() { >> 2) | call_rcu_sched() { >> 2) | __call_rcu() { >> 2) 0.049 us | __call_rcu_nocb_enqueue(); >> 2) 0.265 us | } >> 2) 0.446 us | } >> 2) 0.661 us | } >> 2) 0.854 us | } >> 2) | prepare_creds() { >> 2) | kmem_cache_alloc() { >> 2) 0.022 us | _cond_resched(); >> 2) 0.219 us | } >> 2) | security_prepare_creds() { >> 2) | selinux_cred_prepare() { >> 2) | kmemdup() { >> 2) | __kmalloc_track_caller() { >> 2) 0.024 us | kmalloc_slab(); >> 2) 0.022 us | _cond_resched(); >> 2) 0.412 us | } >> 2) 0.615 us | } >> 2) 0.801 us | } >> 2) 0.993 us | } >> 2) 1.651 us | } >> 2) 0.033 us | set_groups(); >> 2) 0.023 us | override_creds(); >> 2) 3.323 us | } >> 2) 0.029 us | nfserrno [nfsd](); >> 2) 3.919 us | } >> 2) 0.025 us | check_nfsd_access [nfsd](); >> 2) 0.022 us | nfsd_permission [nfsd](); >> 2) + 14.904 us | } >> >> So, 15us for fh_verify by itself. >> >> 2) | dentry_open() { >> 2) | get_empty_filp() { >> 2) | kmem_cache_alloc() { >> 2) 0.021 us | _cond_resched(); >> 2) 0.325 us | } >> 2) | security_file_alloc() { >> 2) | selinux_file_alloc_security() { >> 2) | kmem_cache_alloc() { >> 2) 0.022 us | _cond_resched(); >> 2) 0.251 us | } >> 2) 0.441 us | } >> 2) 0.633 us | } >> 2) 0.022 us | __mutex_init(); >> 2) 1.572 us | } >> 2) | vfs_open() { >> 2) | do_dentry_open() { >> 2) | path_get() { >> 2) 0.025 us | mntget(); >> 2) 0.313 us | } >> 2) 0.029 us | try_module_get(); >> 2) | security_file_open() { >> 2) | selinux_file_open() { >> 2) 0.024 us | avc_policy_seqno(); >> 2) | inode_has_perm.isra.21() { >> 2) 0.148 us | avc_has_perm(); >> 2) 0.332 us | } >> 2) 0.728 us | } >> 2) 0.023 us | __fsnotify_parent(); >> 2) 0.034 us | fsnotify(); >> 2) 1.359 us | } >> 2) 0.045 us | file_ra_state_init(); >> 2) 2.548 us | } >> 2) 2.733 us | } >> 2) 4.667 us | } >> 2) | ima_file_check() { >> 2) 0.028 us | process_measurement(); >> 2) 0.211 us | } >> 2) 0.023 us | nfserrno [nfsd](); >> 2) + 20.549 us | } >> >> And that’s 20us total for this nfsd_open. Seems like a lot of >> work for each NFSv3 WRITE operation, and plenty of opportunities >> for lock contention if my server happened to be otherwise busy. >> >> Note these figures are a little inflated because ftrace itself >> adds some latency. >> >> Anyway, making fh_verify faster could have a significant >> impact on per-op latency with fast storage. The recently-proposed >> NFSD open cache by itself could eliminate the need to perform >> fh_verify so often, for example. >> > > Have you compared with Jeff's patchset? It would be very interesting > to see how that affects your numbers. Bruce asked me to post what I had so far, to help justify adding the NFSD open cache, which has been tabled. Do you know where Jeff keeps this work, and is it merged up to 4.4-rc ? -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html