Re: cephfs - blacklisted client coming back?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Tue, 10 Nov 2020 01:21:53 +0100

Hi Andras,

The osd blocklist entries expire after 1hr by default:

    Option("mon_osd_blacklist_default_expire", Option::TYPE_FLOAT,
Option::LEVEL_ADVANCED)
    .set_default(1_hr)
    .add_service("mon")
    .set_description("Duration in seconds that blacklist entries for clients "
                     "remain in the OSD map"),

(Check mon/OSDMonitor.cc for the implementation)

Cheers, Dan

On Mon, Nov 9, 2020 at 11:59 PM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> We had some network problems (high packet drops) to some cephfs client
> nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on
> version 14.2.8).  As a result a couple of clients got evicted (as one
> would expect).  What was really odd is that the clients were trying to
> flush data they had in cache and kept getting rejected by OSD's for
> almost an hour, and then magically the data flush worked. When asked
> afterwards, the client reported that it was no longer backlisted.  How
> would that happen?  I certainly didn't run any commands to un-blacklist
> a client and the docs say that otherwise the client will stay
> blacklisted until the file system gets remounted.
>
> Here is the status of the client when it was backlisted:
> [root@worker2033 ceph]# ceph daemon
> /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
> {
>      "metadata": {
>          "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
>          "ceph_version": "ceph version 14.2.13
> (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
>          "entity_id": "cephfs2",
>          "hostname": "worker2033",
>          "mount_point": "/mnt/ceph",
>          "pid": "7698",
>          "root": "/"
>      },
>      "dentry_count": 252,
>      "dentry_pinned_count": 9,
>      "id": 111995680,
>      "inst": {
>          "name": {
>              "type": "client",
>              "num": 111995680
>          },
>          "addr": {
>              "type": "v1",
>              "addr": "10.254.65.33:0",
>              "nonce": 410851087
>          }
>      },
>      "addr": {
>          "type": "v1",
>          "addr": "10.254.65.33:0",
>          "nonce": 410851087
>      },
>      "inst_str": "client.111995680 10.254.65.33:0/410851087",
>      "addr_str": "10.254.65.33:0/410851087",
>      "inode_count": 251,
>      "mds_epoch": 3376260,
>      "osd_epoch": 1717896,
>      "osd_epoch_barrier": 1717893,
>      "blacklisted": true
> }
>
> This corresponds to server side log messages:
> 2020-11-09 15:56:31.578 7fffe59a4700  1 mds.0.3376160 Evicting (and
> blacklisting) client session 111995680 (10.254.65.33:0/410851087)
> 2020-11-09 15:56:31.578 7fffe59a4700  0 log_channel(cluster) log [INF] :
> Evicting (and blacklisting) client session 111995680
> (10.254.65.33:0/410851087)
> 2020-11-09 15:56:31.706 7fffe59a4700  1 mds.0.3376160 Evicting (and
> blacklisting) client session 111995680 (10.254.65.33:0/410851087)
> 2020-11-09 15:56:31.706 7fffe59a4700  0 log_channel(cluster) log [INF] :
> Evicting (and blacklisting) client session 111995680
> (10.254.65.33:0/410851087)
>
> and them some time later (perhaps half an hour or so) I got this from
> the client:
>
> [root@worker2033 ceph]# ceph daemon
> /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
> {
>      "metadata": {
>          "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
>          "ceph_version": "ceph version 14.2.13
> (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
>          "entity_id": "cephfs2",
>          "hostname": "worker2033",
>          "mount_point": "/mnt/ceph",
>          "pid": "7698",
>          "root": "/"
>      },
>      "dentry_count": 252,
>      "dentry_pinned_count": 9,
>      "id": 111995680,
>      "inst": {
>          "name": {
>              "type": "client",
>              "num": 111995680
>          },
>          "addr": {
>              "type": "v1",
>              "addr": "10.254.65.33:0",
>              "nonce": 410851087
>          }
>      },
>      "addr": {
>          "type": "v1",
>          "addr": "10.254.65.33:0",
>          "nonce": 410851087
>      },
>      "inst_str": "client.111995680 10.254.65.33:0/410851087",
>      "addr_str": "10.254.65.33:0/410851087",
>      "inode_count": 251,
>      "mds_epoch": 3376260,
>      "osd_epoch": 1717897,
>      "osd_epoch_barrier": 1717893,
>      "blacklisted": false
> }
>
> The cluster was otherwise healthy - nothing wrong with MDS's, or any
> placement groups, etc.  I also don't see any further log messages
> regarding eviction/backlisting in the MDS logs.  I didn't run any ceph
> commands that would change the state of the cluster - I was just looking
> around, increasing log levels.
>
> Any ideas how could that have happened?
>
> A separate problem (perhaps needs a ticket filed) that while the
> ceph-fuse client was in a blacklisted state, it was retrying in an
> infinite loop to flush data to the OSD's and got rejected every time.  I
> have some logs for the details of this too.
>
> Andras
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx