On 12/3/20 5:46 PM, Jeff Layton wrote:
On Thu, 2020-12-03 at 12:01 +0100, Stefan Kooman wrote:
Hi,
We have a cephfs linux kernel (5.4.0-53-generic) workload (rsync) that
seems to be limited by a single ceph-msgr thread (doing close to 100%
cpu). We would like to investigate what this thread is so busy with.
What would be the easiest way to do this? On a related note: what would
be the best way to scale cephfs client performance for a single process
(if at all possible)?
Thanks for any pointers.
Usually kernel profiling (a'la perf) is the way to go about this. You
may want to consider trying more recent kernels and see if they fare any
better. With a new enough MDS and kernel, you can try enabling async
creates as well, and see whether that helps performance any.
The thread is mostly busy with "build_snap_context":
+ 94.39% 94.23% kworker/4:1-cep [kernel.kallsyms] [k]
build_snap_context
Do I understand correctly if this code is checking for any potential
snapshots? As grepping through linux cephfs code gives a hit on snap.c
Our cephfs filesystem has been created in Luminous, and upgraded through
Mimic to Nautilus. We have never enabled snapshot support (ceph fs set
cephfs allow_new_snaps true). But the filesystem does seem to support it
(.snap dirs present). The data rsync is processing does contain a lot of
directories. It might explain the amount of time spent in this code path.
Would this be a plausible explanation?
Thanks,
Stefan