hi, I have a 3-node Proxmox Ceph cluster that's been acting up whenever I try to get it to do anything with one of the pools (fastwrx) on the cluster. `rbd pool stats fastwrx` just hangs on one node, but on the other two, responds instantaneously. `ceph -s` looks like this: root@ibnmajid:~# ceph -s cluster: id: 310af567-1607-402b-bc5d-c62286a129d5 health: HEALTH_WARN insufficient standby MDS daemons available services: mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 47h) mgr: riogrande(active, since 47h) mds: 2/2 daemons up, 1 hot standby osd: 18 osds: 18 up (since 47h), 18 in (since 47h) data: volumes: 2/2 healthy pools: 7 pools, 1537 pgs objects: 793.24k objects, 1.9 TiB usage: 4.1 TiB used, 10 TiB / 14 TiB avail pgs: 1537 active+clean io: client: 1.5 MiB/s rd, 243 KiB/s wr, 3 op/s rd, 19 op/s wr I don't really know where to begin here. nothing jumps out at me in syslog. it's like the rbd client, not even anything involved in serving data on the node, is just somehow broken. ceph status on that node works fine. it appears to be a problem limited to only the one pool. root@ibnmajid:~# ceph status cluster: id: 310af567-1607-402b-bc5d-c62286a129d5 health: HEALTH_WARN insufficient standby MDS daemons available services: mon: 3 daemons, quorum ibnmajid,ganges,riogrande (age 2d) mgr: riogrande(active, since 2d) mds: 2/2 daemons up, 1 hot standby osd: 18 osds: 18 up (since 2d), 18 in (since 2d) data: volumes: 2/2 healthy pools: 7 pools, 1537 pgs objects: 793.28k objects, 1.9 TiB usage: 4.1 TiB used, 10 TiB / 14 TiB avail pgs: 1537 active+clean io: client: 2.3 MiB/s rd, 137 KiB/s wr, 2 op/s rd, 18 op/s wr if I try a different pool, that works fine on the same node: root@ibnmajid:~# rbd pool stats largewrx Total Images: 0 Total Snapshots: 0 Provisioned Size: 0 B (those statistics are correct, it's not directly in use but rather in use with cephfs) similarly, the FS pools related to fastwrx don't work on this node either, but others do: root@ibnmajid:~# rbd pool stats fastwrxFS_data ^C root@ibnmajid:~# rbd pool stats fastwrxFS_metadata ^C root@ibnmajid:~# rbd pool stats largewrxFS_data Total Images: 0 Total Snapshots: 0 Provisioned Size: 0 B root@ibnmajid:~# rbd pool stats largewrxFS_metadata Total Images: 0 Total Snapshots: 0 Provisioned Size: 0 B root@ibnmajid:~# on another node, everything returns results instantly, but fastwrxFS is definitely in use so I'm not sure why it says this: root@ganges:~# rbd pool stats fastwrx Total Images: 17 Total Snapshots: 0 Provisioned Size: 1.3 TiB root@ganges:~# rbd pool stats fastwrxFS_data Total Images: 0 Total Snapshots: 0 Provisioned Size: 0 B root@ganges:~# rbd pool stats fastwrxFS_metadata Total Images: 0 Total Snapshots: 0 Provisioned Size: 0 B here's what happens if I try ceph osd pool stats on a "good" node: root@ganges:~# ceph osd pool stats pool fastwrx id 9 client io 0 B/s rd, 105 KiB/s wr, 0 op/s rd, 14 op/s wr pool largewrx id 10 nothing is going on pool fastwrxFS_data id 17 nothing is going on pool fastwrxFS_metadata id 18 client io 852 B/s rd, 1 op/s rd, 0 op/s wr pool largewrxFS_data id 20 client io 2.9 MiB/s rd, 2 op/s rd, 0 op/s wr pool largewrxFS_metadata id 21 nothing is going on pool .mgr id 22 nothing is going on and on the broken node: root@ibnmajid:~# ceph osd pool stats pool fastwrx id 9 client io 0 B/s rd, 93 KiB/s wr, 0 op/s rd, 5 op/s wr pool largewrx id 10 nothing is going on pool fastwrxFS_data id 17 nothing is going on pool fastwrxFS_metadata id 18 client io 852 B/s rd, 1 op/s rd, 0 op/s wr pool largewrxFS_data id 20 client io 1.9 MiB/s rd, 0 op/s rd, 0 op/s wr pool largewrxFS_metadata id 21 nothing is going on pool .mgr id 22 nothing is going on so whatever interface that uses seems to interact with the pool fine, I guess. how do I get started fixing this? thanks! _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx