Hi Andras, The osd blocklist entries expire after 1hr by default: Option("mon_osd_blacklist_default_expire", Option::TYPE_FLOAT, Option::LEVEL_ADVANCED) .set_default(1_hr) .add_service("mon") .set_description("Duration in seconds that blacklist entries for clients " "remain in the OSD map"), (Check mon/OSDMonitor.cc for the implementation) Cheers, Dan On Mon, Nov 9, 2020 at 11:59 PM Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > > We had some network problems (high packet drops) to some cephfs client > nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on > version 14.2.8). As a result a couple of clients got evicted (as one > would expect). What was really odd is that the clients were trying to > flush data they had in cache and kept getting rejected by OSD's for > almost an hour, and then magically the data flush worked. When asked > afterwards, the client reported that it was no longer backlisted. How > would that happen? I certainly didn't run any commands to un-blacklist > a client and the docs say that otherwise the client will stay > blacklisted until the file system gets remounted. > > Here is the status of the client when it was backlisted: > [root@worker2033 ceph]# ceph daemon > /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status > { > "metadata": { > "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f", > "ceph_version": "ceph version 14.2.13 > (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)", > "entity_id": "cephfs2", > "hostname": "worker2033", > "mount_point": "/mnt/ceph", > "pid": "7698", > "root": "/" > }, > "dentry_count": 252, > "dentry_pinned_count": 9, > "id": 111995680, > "inst": { > "name": { > "type": "client", > "num": 111995680 > }, > "addr": { > "type": "v1", > "addr": "10.254.65.33:0", > "nonce": 410851087 > } > }, > "addr": { > "type": "v1", > "addr": "10.254.65.33:0", > "nonce": 410851087 > }, > "inst_str": "client.111995680 10.254.65.33:0/410851087", > "addr_str": "10.254.65.33:0/410851087", > "inode_count": 251, > "mds_epoch": 3376260, > "osd_epoch": 1717896, > "osd_epoch_barrier": 1717893, > "blacklisted": true > } > > This corresponds to server side log messages: > 2020-11-09 15:56:31.578 7fffe59a4700 1 mds.0.3376160 Evicting (and > blacklisting) client session 111995680 (10.254.65.33:0/410851087) > 2020-11-09 15:56:31.578 7fffe59a4700 0 log_channel(cluster) log [INF] : > Evicting (and blacklisting) client session 111995680 > (10.254.65.33:0/410851087) > 2020-11-09 15:56:31.706 7fffe59a4700 1 mds.0.3376160 Evicting (and > blacklisting) client session 111995680 (10.254.65.33:0/410851087) > 2020-11-09 15:56:31.706 7fffe59a4700 0 log_channel(cluster) log [INF] : > Evicting (and blacklisting) client session 111995680 > (10.254.65.33:0/410851087) > > and them some time later (perhaps half an hour or so) I got this from > the client: > > [root@worker2033 ceph]# ceph daemon > /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status > { > "metadata": { > "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f", > "ceph_version": "ceph version 14.2.13 > (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)", > "entity_id": "cephfs2", > "hostname": "worker2033", > "mount_point": "/mnt/ceph", > "pid": "7698", > "root": "/" > }, > "dentry_count": 252, > "dentry_pinned_count": 9, > "id": 111995680, > "inst": { > "name": { > "type": "client", > "num": 111995680 > }, > "addr": { > "type": "v1", > "addr": "10.254.65.33:0", > "nonce": 410851087 > } > }, > "addr": { > "type": "v1", > "addr": "10.254.65.33:0", > "nonce": 410851087 > }, > "inst_str": "client.111995680 10.254.65.33:0/410851087", > "addr_str": "10.254.65.33:0/410851087", > "inode_count": 251, > "mds_epoch": 3376260, > "osd_epoch": 1717897, > "osd_epoch_barrier": 1717893, > "blacklisted": false > } > > The cluster was otherwise healthy - nothing wrong with MDS's, or any > placement groups, etc. I also don't see any further log messages > regarding eviction/backlisting in the MDS logs. I didn't run any ceph > commands that would change the state of the cluster - I was just looking > around, increasing log levels. > > Any ideas how could that have happened? > > A separate problem (perhaps needs a ticket filed) that while the > ceph-fuse client was in a blacklisted state, it was retrying in an > infinite loop to flush data to the OSD's and got rejected every time. I > have some logs for the details of this too. > > Andras > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx