One cephFS snapshot kills performance

Sebastian Mazza <sebastian@xxxxxxxxxxx> · Fri, 5 Nov 2021 01:36:05 +0100

Hi all!

I’m new to cephFS. My test file system uses a replicated pool on NVMe SSDs for metadata and an erasure coded pool on HDDs for data. All OSDs uses bluestore.
I used the ceph version 16.2.6 for all daemons - created with this version and running this version. The linux kernel that I used for Mounting CephFS is from Debian: Linux file2 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

Until I created the first snapshot (e.g.: `mkdir /mnt/shares/users/.snap/test-01`) the performance of the cephFS (mount point: `/mnt/shares`) seams fine to me. I first noticed the performance problem while I was re-syncing a directory with `rsync`, since the re-sync / update took longer than the initial `rsync` run. After multiple days of investigation, I’m rather sure that the performance problem is directly related to snapshots. I first experienced my problem with `rsync` but the problem can be observed by a simple execution of `du`. Therefore, I guess that some sort of "stat" call in combination with snapshots are responsible for the bad performance.

My test folder `/mnt/shares/backup-remote/` contains lots small files and many hard links in lots of sub folder.

After a restart of the whole cluster and the client and without a single snapshot in the whole file system, a run of `du` takes 4m 17s.  When all the OSD, MON and client caches are warmed the same `du` takes only `12s`. After umount and mount the cephFS again, which should empty all the client caches but keep caches on the OSD and MON side warmed, the execution of `du` takes 1m 56s. This runtimes are all perfectly fine for me.

However, If I take a single snapshots in another folder (e.g. `mkdir /mnt/shares/users/.snap/test-01`) that is not even related to the `/mnt/shares/backup-remote/` test folder, the runtime of `du` with cold client caches jumps to 19m 42s. An immediate second run of `du` take only 12s but after unmounting and mounting the cephFS it take again nearly 20 minutes. That is 10 times longer than without a single snapshot. I need to do a bit more testing but at the moment it looks like that every further snapshots add around 1 minute of additional runtime. 

During such a run of `du` with a snapshot anywhere in the file system all the Ceph daemons seam to be bored, also the OSDs do hardly any IO. The only thing in the system that I can find that looks busy is a kernel worker of the client that mounts the FS and runs `du`. A process named “kworker/0:1+ceph-msgr" is constantly near 100% CPU usage. The fact that the kernel seams to spend all the time in a method called “ceph_update_snap_trace” makes me even more confident that the problem is a result of snapshots.

Kernel Stack Trace examples (`echo l > /proc/sysrq-trigger` and `dmesg`)
------------------------------------------
[11316.757494] Call Trace:
[11316.757494]  ceph_queue_cap_snap+0x37/0x4e0 [ceph]
[11316.757496]  ? ceph_put_snap_realm+0x28/0xd0 [ceph]
[11316.757497]  ceph_update_snap_trace+0x3f0/0x4f0 [ceph]
[11316.757498]  dispatch+0x79d/0x1520 [ceph]
[11316.757499]  ceph_con_workfn+0x1a5f/0x2850 [libceph]
[11316.757500]  ? finish_task_switch+0x72/0x250
[11316.757502]  process_one_work+0x1b6/0x350
[11316.757503]  worker_thread+0x53/0x3e0
[11316.757504]  ? process_one_work+0x350/0x350
[11316.757505]  kthread+0x11b/0x140
[11316.757506]  ? __kthread_bind_mask+0x60/0x60
[11316.757507]  ret_from_fork+0x22/0x30
------------------------------------------
[36120.030685] Call Trace:
[36120.030686]  sort_r+0x173/0x210
[36120.030687]  build_snap_context+0x115/0x260 [ceph]
[36120.030688]  rebuild_snap_realms+0x23/0x70 [ceph]
[36120.030689]  rebuild_snap_realms+0x3d/0x70 [ceph]
[36120.030690]  ceph_update_snap_trace+0x2eb/0x4f0 [ceph]
[36120.030691]  dispatch+0x79d/0x1520 [ceph]
[36120.030692]  ceph_con_workfn+0x1a5f/0x2850 [libceph]
[36120.030693]  ? finish_task_switch+0x72/0x250
[36120.030694]  process_one_work+0x1b6/0x350
[36120.030695]  worker_thread+0x53/0x3e0
[36120.030695]  ? process_one_work+0x350/0x350
[36120.030696]  kthread+0x11b/0x140
[36120.030697]  ? __kthread_bind_mask+0x60/0x60
[36120.030698]  ret_from_fork+0x22/0x30
[36120.030960] NMI backtrace for cpu 3 skipped: idling at native_safe_halt+0xe/0x10
------------------------------------------

Deleting all snapshots does not restore the original performance. Only after a recursive copy (with rsync) of the whole  `backup-remote` folder to a new location and using this new folder for `du`, the performance is as it was before taking the first snapshot.

Related issue reports I have found:
* https://tracker.ceph.com/issues/44100?next_issue_id=44099
* https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/IDMLNQMFGTJRR5QXFZ2YAYPN67UZH4Q4/

I would be very interested in an explanation for this behaviour. Of course I would be very thankful for a solution of the problem or an advice that could help.

Thanks in advance.

Best wishes,
Sebastian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx