cephfs client not released caps when running rsync

Nikita Borisenkov <n.borisenkov@xxxxxxxxxxxxxx> · Tue, 26 Mar 2024 17:12:16 +0300

We transfer data (300 million small files) using rsync between cephfs 
from version 12.2.13 to 18.2.1. After about the same time (about 7 hours 
in this case), copying stops for a minute

```
health:HEALTH_WARN
         1 clients failing to advance oldest client/flush tid
         1 MDSs report slow metadata IOs
         1 MDSs behind on trimming
```

The only interesting messages in the logs are:
```
ceph-mds[640738]: mds.beacon.cephfs.X missed beacon ack from the monitors
```

I watched debugging on the client (destination) via watch -n1
/sys/kernel/debug/ceph/451eea44-d7a0-11ee-9117-b496914b4c02.client32497
```
item total
------------------------------------------
opened files / total inodes 1 / 866757
pinned i_caps / total inodes 866757 / 866757
opened inodes / total inodes 1 / 866757

item total avg_lat(us) min_lat(us) max_lat(us) stdev(us)
-------------------------------------------------- 
---------------------------------
read 0 0 0 0 0
write 5129252 194689 9080 54693415 1023
metadata 29045143 670 161 87794124 369

item total miss hit
-------------------------------------------------
d_lease 5361 3026 315488442
caps 866757 11 483177478
```
During the copy, "pinned i_caps / total inodes" are gradually increased 
until it reaches the value "mds_max_caps_per_client" (default: 1Mi). 
Then "pinned i_caps / total inodes" begins to decrease to almost 0, at 
which time HEALTH_WARN appears and transfer stops. "op/s wr" increases 
from 200 to 1.5k. Then total inodes begin to increase again along with 
the resumption of copying and the cluster goes into the HEALTHY state.

Mount options:
/mnt/cephfs-old 10.77.12.90:6789,10.77.12.91:6789,10.77.12.92:6789:/ 
ceph rw,noatime,nodiratime,name=admin,secret=<hidden>,acl
/mnt/cephfs-new 10.77.12.139:6789,10.77.12.140:6789,10.77.12.141:6789:/ 
ceph rw,noatime,nodiratime,name=admin,secret=<hidden>,acl,caps_max=10000

Client properties on the MDS server (removed unnecessary):
ceph daemon mds.cephfs.X client ls 32497
[
     {
     ...
         "id": 32497,
         "state": "open",
         "num_leases": 0,
         "num_caps": 980679,
         "request_load_avg": 7913,
         "requests_in_flight": 466,
         "num_completed_flushes": 464,
         "recall_caps": {
             "value": 0,
             "halflife": 60
         },
         "release_caps": {
             "value": 1732.2552002208533,
             "halflife": 60
         },
         "recall_caps_throttle": {
             "value": 0,
             "halflife": 1.35000000000000001
         },
         "recall_caps_throttle2o": {
             "value": 0,
             "halflife": 0.5
         },
         "session_cache_liveness": {
             "value": 42186.620275326415,
             "halflife": 300
         },
         "cap_acquisition": {
             "value": 0,
             "halflife": 30
         },
         ...
     }
]

ceph daemonperf mds.cephfs.X
```
-------------------------------------------mds------ 
------------------------------------- --mds_cache--- ------mds_log 
------------ -mds_mem- -------mds_server------- mds_ -----objecter------ 
purg
req rlat slr fwd inos caps exi imi hifc crev cgra ctru cfsa cfa hcc hccd 
hccr prcr|stry recy recd|subm evts segs repl|ino dn |hcr hcs hsr cre cat 
|sess|actv rd wr rdwr|purg|
114 0 0 0 1.9M 438k 0 0 0 0 253 0 0 0 0 0 0 59 | 0 0 0 |128 123k 129 0 
|1.3M 1.9M|114 0 0 0 0 | 3 | 0 0 440 0 | 0
101 0 0 0 1.9M 438k 0 0 0 0 0 0 0 0 0 0 0 53 | 0 0 0 |106 123k 129 0 
|1.3M 1.9M|101 0 0 0 0 | 3 | 0 0 0 0 | 0
...
``` - from this output it is clear that the client does not send cap 
release at all (column hccr, decoding of the columns "ceph daemon mds.X 
perf schema")

Then I was able to find the right query to google similar problems:
https://www.spinics.net/lists/ceph-users/msg50573.html
https://ceph-users.ceph.narkive.com/mcyPtEyz/rsync-kernel-client-cepfs-mkstemp-no-space-left-on-device
https://www.spinics.net/lists/ceph-users/msg50158.html
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/B7K6B5VXM3I7TODM4GRF3N7S254O5ETY/

It turns out that the problem is in rsync, in the way it works?

The only "solution" is to do it on the client according to a schedule 
(or upon reaching a certain number of open caps) “echo 2 > 
/proc/sys/vm/drop_caches”. After this command, the cephfs client 
releases the cached caps. And if there were a lot of them, then MDS 
becomes slow again.

We also tried to mount cephfs with the option "caps_max=10000" so that 
the client would do a forced release when the specified value is 
reached, but this did not help.

We can limit mds_max_caps_per_client (not tested), but this also affects 
all clients at once.

The command "ceph daemon mds.cephfs.X cache drop" (with or without an 
additional parameter) does not help

Tested on Linux kernels (client side): 5.10 and 6.1

Did I understand everything correctly? is this the expected behavior 
when running rsync?

And one more problem (I don’t know if it’s related or not), when rsync 
finishes copying, all caps are freed except the last two (pinned i_caps 
/ total inodes 2 / 2)

At this moment a warning appears (or remains after releasing a large 
number of caps): 1 clients failing to advance oldest client/flush tid
But then it doesn't disappear. I waited 12 hours.
Warning disappears only after executing the "sync" command on the 
client. and in the client metrics you can see "pinned i_caps / total 
inodes 1 / 1"

Note: running "echo 2 > /proc/sys/vm/drop_caches" does not help in this 
case.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx