cephfs - blacklisted client coming back?

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Mon, 9 Nov 2020 17:58:11 -0500

We had some network problems (high packet drops) to some cephfs client 
nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on 
version 14.2.8).  As a result a couple of clients got evicted (as one 
would expect).  What was really odd is that the clients were trying to 
flush data they had in cache and kept getting rejected by OSD's for 
almost an hour, and then magically the data flush worked. When asked 
afterwards, the client reported that it was no longer backlisted.  How 
would that happen?  I certainly didn't run any commands to un-blacklist 
a client and the docs say that otherwise the client will stay 
blacklisted until the file system gets remounted.

Here is the status of the client when it was backlisted:
[root@worker2033 ceph]# ceph daemon 
/var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
    "metadata": {
        "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
        "ceph_version": "ceph version 14.2.13 
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
        "entity_id": "cephfs2",
        "hostname": "worker2033",
        "mount_point": "/mnt/ceph",
        "pid": "7698",
        "root": "/"
    },
    "dentry_count": 252,
    "dentry_pinned_count": 9,
    "id": 111995680,
    "inst": {
        "name": {
            "type": "client",
            "num": 111995680
        },
        "addr": {
            "type": "v1",
            "addr": "10.254.65.33:0",
            "nonce": 410851087
        }
    },
    "addr": {
        "type": "v1",
        "addr": "10.254.65.33:0",
        "nonce": 410851087
    },
    "inst_str": "client.111995680 10.254.65.33:0/410851087",
    "addr_str": "10.254.65.33:0/410851087",
    "inode_count": 251,
    "mds_epoch": 3376260,
    "osd_epoch": 1717896,
    "osd_epoch_barrier": 1717893,
    "blacklisted": true
}

This corresponds to server side log messages:
2020-11-09 15:56:31.578 7fffe59a4700  1 mds.0.3376160 Evicting (and 
blacklisting) client session 111995680 (10.254.65.33:0/410851087)
2020-11-09 15:56:31.578 7fffe59a4700  0 log_channel(cluster) log [INF] : 
Evicting (and blacklisting) client session 111995680 
(10.254.65.33:0/410851087)
2020-11-09 15:56:31.706 7fffe59a4700  1 mds.0.3376160 Evicting (and 
blacklisting) client session 111995680 (10.254.65.33:0/410851087)
2020-11-09 15:56:31.706 7fffe59a4700  0 log_channel(cluster) log [INF] : 
Evicting (and blacklisting) client session 111995680 
(10.254.65.33:0/410851087)

and them some time later (perhaps half an hour or so) I got this from 
the client:

[root@worker2033 ceph]# ceph daemon 
/var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
    "metadata": {
        "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
        "ceph_version": "ceph version 14.2.13 
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
        "entity_id": "cephfs2",
        "hostname": "worker2033",
        "mount_point": "/mnt/ceph",
        "pid": "7698",
        "root": "/"
    },
    "dentry_count": 252,
    "dentry_pinned_count": 9,
    "id": 111995680,
    "inst": {
        "name": {
            "type": "client",
            "num": 111995680
        },
        "addr": {
            "type": "v1",
            "addr": "10.254.65.33:0",
            "nonce": 410851087
        }
    },
    "addr": {
        "type": "v1",
        "addr": "10.254.65.33:0",
        "nonce": 410851087
    },
    "inst_str": "client.111995680 10.254.65.33:0/410851087",
    "addr_str": "10.254.65.33:0/410851087",
    "inode_count": 251,
    "mds_epoch": 3376260,
    "osd_epoch": 1717897,
    "osd_epoch_barrier": 1717893,
    "blacklisted": false
}

The cluster was otherwise healthy - nothing wrong with MDS's, or any 
placement groups, etc.  I also don't see any further log messages 
regarding eviction/backlisting in the MDS logs.  I didn't run any ceph 
commands that would change the state of the cluster - I was just looking 
around, increasing log levels.

Any ideas how could that have happened?

A separate problem (perhaps needs a ticket filed) that while the 
ceph-fuse client was in a blacklisted state, it was retrying in an 
infinite loop to flush data to the OSD's and got rejected every time.  I 
have some logs for the details of this too.

Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx