CephFS Snapshot Mirroring slow due to repeating attribute sync

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Tue, 23 Aug 2022 16:30:47 +0000

Dear Ceph developers and users,

We are using ceph version 17.2.1 
(ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).
We are using cephadm since version 15 octopus.

We mirror several CephFS directories from our main cluster our to a 
second mirror cluster.
In particular with bigger directories (over 900 TB and 186 M of files), 
we noticed that mirroring is very slow.
On the mirror, most of the time we only observe a write speed of 0 to 10 
MB/s in the client IO.
The target peer directory often doesn't show increase in size during 
syncronization
(when we check with: getfattr -n ceph.dir.rbytes).

The status of the syncs is always fine, i.e. syncing and not failing:

0|0[root@osd-1 /var/run/ceph/55633ec3-6c0c-4a02-990c-0f87e0f7a01f]# ceph 
--admin-daemon 
ceph-client.cephfs-mirror.osd-1.ydsqsw.7.94552861013544.asok fs mirror 
peer status cephfs@1 c66afb80-593f-4c42-a120-dd3b6fca26bc
{
     "/irods/sodar": {
         "state": "syncing",
         "current_sycning_snap": {
             "id": 7552,
             "name": "scheduled-2022-08-22-13_00_00"
         },
         "last_synced_snap": {
             "id": 7548,
             "name": "scheduled-2022-08-22-12_00_00",
             "sync_duration": 37828.164744490001,
             "sync_time_stamp": "13240678.542916s"
         },
         "snaps_synced": 1,
         "snaps_deleted": 11,
         "snaps_renamed": 0
     }
}

The cluster nodes (6 per cluster) are connected with Dual 40G NICs to 
the switches.
Connection between switches are 2x 100G.
Simple write operations from other clients to the mirror cephfs result 
in writes of e.g. 300 to 400 MB/s.
So network doesn't seem to be the issue, here.

We started to dig into debug logs of the cephfs-mirror daemon / docker 
container.
We set the debug level to 20. Otherwise there are no messages at all (so 
no errors).

We observed a lot of messages with "need_data_sync=0, need_attr_sync=1".
Leading us to the assumption, that instead of actual data a lot of 
attributes are synced.

We started looking at specific examples in the logsband tried to make 
sence from the source code which steps are happening.
Most of the messages are coming from cephfs::mirror::PeerReplayer
https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc

We figured, the do_synchronize function checks if data (need_data_sync) 
or attributes (need_attr_sync) should be synchronized using 
should_sync_entry.
And if necessary performs the sync using remote_file_op.

should_sync_entry reports different ctimes for our examples, e.g.:
local cur statx: mode=33152, uid=996, gid=993, size=154701172, 
ctime=2022-01-28T12:54:21.176004+0000, ...
local prev statx: mode=33152, uid=996, gid=993, size=154701172, 
ctime=2022-08-22T11:03:18.578380+0000, ...

Based on these different ctimes, should_sync_entry decides then that 
attributes need to be synced:
*need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)
https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc#L911

We assume cur statx/cstx refers to the file in the snapshot currently 
mirrored.
But what exactly is prev statx/pstx? Is it the peer path or the last 
snapshot on the mirror peer?

We can confirm that ctimes are different on the main cluster and the mirror.
On the main cluster, the ctimes are consistent in every snapshot (since 
the files didn't change).
On the the mirror, the ctimes increase with every snapshot towards more 
current dates.

Given that the CephFS Mirror daemon writes the data to the mirror as a 
CephFS client,
it seems to make sense that data on the mirror has different / more 
recent ctimes (from writing).
Also, when the mirror daemon is syncing the attributes to the mirror, 
wouldn't this trigger an new/current ctime as well?
So our assumption is, syncing an old ctime will actually result in a new 
ctime.
And thus trigger the sync of attributes over and over (at least with 
every snapshot synced).

So is ctime the proper parameter to test if attributes need to be synced?
Or shouldn't it rather be excluded?
So is this check the right thing to do: *need_attr_sync = 
(cstx.stx_ctime != pstx.stx_ctime)

Is it reasonable to assume that these attribute syncs are responsible 
for our slow mirroring?
Or is there anything else we should look out for?

And are there actually commands or logs showing us the speed of the 
mirroring?
We only now about sync_duration and sync_time_stamp (as in the status 
above).
But then, how can we actually determine the size of a snapshot or the 
difference between snapshots?
So one can make speed calculations for the latest sync.

What is your general experience with mirroring performance?
In particular with bigger cephfs directories towards peta bytes.

Mirroring (backing up) our data is a really crucial issue for us (and 
certainly many others).
So we are lookin forward for you input. Thanks a lot in advance.

Best Wishes,
Mathias
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx