We had some network problems (high packet drops) to some cephfs client
nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on
version 14.2.8). As a result a couple of clients got evicted (as one
would expect). What was really odd is that the clients were trying to
flush data they had in cache and kept getting rejected by OSD's for
almost an hour, and then magically the data flush worked. When asked
afterwards, the client reported that it was no longer backlisted. How
would that happen? I certainly didn't run any commands to un-blacklist
a client and the docs say that otherwise the client will stay
blacklisted until the file system gets remounted.
Here is the status of the client when it was backlisted:
[root@worker2033 ceph]# ceph daemon
/var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
"metadata": {
"ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
"ceph_version": "ceph version 14.2.13
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
"entity_id": "cephfs2",
"hostname": "worker2033",
"mount_point": "/mnt/ceph",
"pid": "7698",
"root": "/"
},
"dentry_count": 252,
"dentry_pinned_count": 9,
"id": 111995680,
"inst": {
"name": {
"type": "client",
"num": 111995680
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
}
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
},
"inst_str": "client.111995680 10.254.65.33:0/410851087",
"addr_str": "10.254.65.33:0/410851087",
"inode_count": 251,
"mds_epoch": 3376260,
"osd_epoch": 1717896,
"osd_epoch_barrier": 1717893,
"blacklisted": true
}
This corresponds to server side log messages:
2020-11-09 15:56:31.578 7fffe59a4700 1 mds.0.3376160 Evicting (and
blacklisting) client session 111995680 (10.254.65.33:0/410851087)
2020-11-09 15:56:31.578 7fffe59a4700 0 log_channel(cluster) log [INF] :
Evicting (and blacklisting) client session 111995680
(10.254.65.33:0/410851087)
2020-11-09 15:56:31.706 7fffe59a4700 1 mds.0.3376160 Evicting (and
blacklisting) client session 111995680 (10.254.65.33:0/410851087)
2020-11-09 15:56:31.706 7fffe59a4700 0 log_channel(cluster) log [INF] :
Evicting (and blacklisting) client session 111995680
(10.254.65.33:0/410851087)
and them some time later (perhaps half an hour or so) I got this from
the client:
[root@worker2033 ceph]# ceph daemon
/var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
"metadata": {
"ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
"ceph_version": "ceph version 14.2.13
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
"entity_id": "cephfs2",
"hostname": "worker2033",
"mount_point": "/mnt/ceph",
"pid": "7698",
"root": "/"
},
"dentry_count": 252,
"dentry_pinned_count": 9,
"id": 111995680,
"inst": {
"name": {
"type": "client",
"num": 111995680
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
}
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
},
"inst_str": "client.111995680 10.254.65.33:0/410851087",
"addr_str": "10.254.65.33:0/410851087",
"inode_count": 251,
"mds_epoch": 3376260,
"osd_epoch": 1717897,
"osd_epoch_barrier": 1717893,
"blacklisted": false
}
The cluster was otherwise healthy - nothing wrong with MDS's, or any
placement groups, etc. I also don't see any further log messages
regarding eviction/backlisting in the MDS logs. I didn't run any ceph
commands that would change the state of the cluster - I was just looking
around, increasing log levels.
Any ideas how could that have happened?
A separate problem (perhaps needs a ticket filed) that while the
ceph-fuse client was in a blacklisted state, it was retrying in an
infinite loop to flush data to the OSD's and got rejected every time. I
have some logs for the details of this too.
Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx