Re: One cephFS snapshot kills performance

Stefan Kooman <stefan@xxxxxx> · Fri, 5 Nov 2021 15:10:59 +0100

Hi,

On 11/5/21 01:36, Sebastian Mazza wrote:

However, If I take a single snapshots in another folder (e.g. `mkdir /mnt/shares/users/.snap/test-01`) that is not even related to the `/mnt/shares/backup-remote/` test folder, the runtime of `du` with cold client caches jumps to 19m 42s. An immediate second run of `du` take only 12s but after unmounting and mounting the cephFS it take again nearly 20 minutes. That is 10 times longer than without a single snapshot. I need to do a bit more testing but at the moment it looks like that every further snapshots add around 1 minute of additional runtime.

During such a run of `du` with a snapshot anywhere in the file system all the Ceph daemons seam to be bored, also the OSDs do hardly any IO. The only thing in the system that I can find that looks busy is a kernel worker of the client that mounts the FS and runs `du`. A process named “kworker/0:1+ceph-msgr" is constantly near 100% CPU usage. The fact that the kernel seams to spend all the time in a method called “ceph_update_snap_trace” makes me even more confident that the problem is a result of snapshots.

Your report looks very much like behavior we have on a backup system 
(also rsync) on a Nautilus cluster (upgraded from Luminous). Many small 
files in the fs. We could not reproduce the issue on a separate cluster 
with identical data. However, we have this behavior without any 
snapshots. No snapshots have ever been made on this CephFS. Even though 
there are no snapshots it will still spend a lot of time with "snap" 
tasks weird enough. Although the problem might get worse with more 
snapshots, having (a) snapshot(s) or not does seem to be a requirement 
per se. It might point in the right direction ...

<snip>

I would be very interested in an explanation for this behaviour. Of course I would be very thankful for a solution of the problem or an advice that could help.

No solution, but good to know there are more workloads out there that 
hit this issue. If there are any CephFS devs interested in investigating 
this issue we are more than happy to provide more info.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx