We transfer data (300 million small files) using rsync between cephfs
from version 12.2.13 to 18.2.1. After about the same time (about 7 hours
in this case), copying stops for a minute
```
health:HEALTH_WARN
1 clients failing to advance oldest client/flush tid
1 MDSs report slow metadata IOs
1 MDSs behind on trimming
```
The only interesting messages in the logs are:
```
ceph-mds[640738]: mds.beacon.cephfs.X missed beacon ack from the monitors
```
I watched debugging on the client (destination) via watch -n1
/sys/kernel/debug/ceph/451eea44-d7a0-11ee-9117-b496914b4c02.client32497
```
item total
------------------------------------------
opened files / total inodes 1 / 866757
pinned i_caps / total inodes 866757 / 866757
opened inodes / total inodes 1 / 866757
item total avg_lat(us) min_lat(us) max_lat(us) stdev(us)
--------------------------------------------------
---------------------------------
read 0 0 0 0 0
write 5129252 194689 9080 54693415 1023
metadata 29045143 670 161 87794124 369
item total miss hit
-------------------------------------------------
d_lease 5361 3026 315488442
caps 866757 11 483177478
```
During the copy, "pinned i_caps / total inodes" are gradually increased
until it reaches the value "mds_max_caps_per_client" (default: 1Mi).
Then "pinned i_caps / total inodes" begins to decrease to almost 0, at
which time HEALTH_WARN appears and transfer stops. "op/s wr" increases
from 200 to 1.5k. Then total inodes begin to increase again along with
the resumption of copying and the cluster goes into the HEALTHY state.
Mount options:
/mnt/cephfs-old 10.77.12.90:6789,10.77.12.91:6789,10.77.12.92:6789:/
ceph rw,noatime,nodiratime,name=admin,secret=<hidden>,acl
/mnt/cephfs-new 10.77.12.139:6789,10.77.12.140:6789,10.77.12.141:6789:/
ceph rw,noatime,nodiratime,name=admin,secret=<hidden>,acl,caps_max=10000
Client properties on the MDS server (removed unnecessary):
ceph daemon mds.cephfs.X client ls 32497
[
{
...
"id": 32497,
"state": "open",
"num_leases": 0,
"num_caps": 980679,
"request_load_avg": 7913,
"requests_in_flight": 466,
"num_completed_flushes": 464,
"recall_caps": {
"value": 0,
"halflife": 60
},
"release_caps": {
"value": 1732.2552002208533,
"halflife": 60
},
"recall_caps_throttle": {
"value": 0,
"halflife": 1.35000000000000001
},
"recall_caps_throttle2o": {
"value": 0,
"halflife": 0.5
},
"session_cache_liveness": {
"value": 42186.620275326415,
"halflife": 300
},
"cap_acquisition": {
"value": 0,
"halflife": 30
},
...
}
]
ceph daemonperf mds.cephfs.X
```
-------------------------------------------mds------
------------------------------------- --mds_cache--- ------mds_log
------------ -mds_mem- -------mds_server------- mds_ -----objecter------
purg
req rlat slr fwd inos caps exi imi hifc crev cgra ctru cfsa cfa hcc hccd
hccr prcr|stry recy recd|subm evts segs repl|ino dn |hcr hcs hsr cre cat
|sess|actv rd wr rdwr|purg|
114 0 0 0 1.9M 438k 0 0 0 0 253 0 0 0 0 0 0 59 | 0 0 0 |128 123k 129 0
|1.3M 1.9M|114 0 0 0 0 | 3 | 0 0 440 0 | 0
101 0 0 0 1.9M 438k 0 0 0 0 0 0 0 0 0 0 0 53 | 0 0 0 |106 123k 129 0
|1.3M 1.9M|101 0 0 0 0 | 3 | 0 0 0 0 | 0
...
``` - from this output it is clear that the client does not send cap
release at all (column hccr, decoding of the columns "ceph daemon mds.X
perf schema")
Then I was able to find the right query to google similar problems:
https://www.spinics.net/lists/ceph-users/msg50573.html
https://ceph-users.ceph.narkive.com/mcyPtEyz/rsync-kernel-client-cepfs-mkstemp-no-space-left-on-device
https://www.spinics.net/lists/ceph-users/msg50158.html
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/B7K6B5VXM3I7TODM4GRF3N7S254O5ETY/
It turns out that the problem is in rsync, in the way it works?
The only "solution" is to do it on the client according to a schedule
(or upon reaching a certain number of open caps) “echo 2 >
/proc/sys/vm/drop_caches”. After this command, the cephfs client
releases the cached caps. And if there were a lot of them, then MDS
becomes slow again.
We also tried to mount cephfs with the option "caps_max=10000" so that
the client would do a forced release when the specified value is
reached, but this did not help.
We can limit mds_max_caps_per_client (not tested), but this also affects
all clients at once.
The command "ceph daemon mds.cephfs.X cache drop" (with or without an
additional parameter) does not help
Tested on Linux kernels (client side): 5.10 and 6.1
Did I understand everything correctly? is this the expected behavior
when running rsync?
And one more problem (I don’t know if it’s related or not), when rsync
finishes copying, all caps are freed except the last two (pinned i_caps
/ total inodes 2 / 2)
At this moment a warning appears (or remains after releasing a large
number of caps): 1 clients failing to advance oldest client/flush tid
But then it doesn't disappear. I waited 12 hours.
Warning disappears only after executing the "sync" command on the
client. and in the client metrics you can see "pinned i_caps / total
inodes 1 / 1"
Note: running "echo 2 > /proc/sys/vm/drop_caches" does not help in this
case.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx