Hi, I have defined pool hdd which is exclusively used by virtual disks of multiple KVMs / LXCs. Yesterday I run these commands osdmaptool om --upmap out.txt --upmap-pool hdd source out.txt and Ceph started rebalancing this pool. However since then no KVM / LXC is reacting anymore. If I try to start a new KVM it hangs in boot process. This is the output of ceph health detail: root@ld3955:/mnt/rbd# ceph health detail HEALTH_ERR 28 nearfull osd(s); 1 pool(s) nearfull; Reduced data availability: 1 pg inactive, 1 pg peering; Degraded data redundancy (low space): 8 pgs backfill_toofull; 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio; 2 pools have too many placement groups; 672 slow requests are blocked > 32 sec; 4752 stuck requests are blocked > 4096 sec OSD_NEARFULL 28 nearfull osd(s) osd.42 is near full osd.44 is near full osd.45 is near full osd.77 is near full osd.84 is near full osd.94 is near full osd.101 is near full osd.103 is near full osd.106 is near full osd.109 is near full osd.113 is near full osd.118 is near full osd.120 is near full osd.136 is near full osd.138 is near full osd.142 is near full osd.147 is near full osd.156 is near full osd.159 is near full osd.161 is near full osd.168 is near full osd.192 is near full osd.202 is near full osd.206 is near full osd.208 is near full osd.226 is near full osd.234 is near full osd.247 is near full POOL_NEARFULL 1 pool(s) nearfull pool 'hdb_backup' is nearfull PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg peering pg 30.1b9 is stuck peering for 4722.750977, current state peering, last acting [183,27,63] PG_DEGRADED_FULL Degraded data redundancy (low space): 8 pgs backfill_toofull pg 11.465 is active+remapped+backfill_wait+backfill_toofull, acting [308,351,58] pg 11.5c4 is active+remapped+backfill_wait+backfill_toofull, acting [318,336,54] pg 11.afd is active+remapped+backfill_wait+backfill_toofull, acting [347,220,315] pg 11.b82 is active+remapped+backfill_toofull, acting [314,320,254] pg 11.1803 is active+remapped+backfill_wait+backfill_toofull, acting [88,363,302] pg 11.1aac is active+remapped+backfill_wait+backfill_toofull, acting [328,275,95] pg 11.1c09 is active+remapped+backfill_wait+backfill_toofull, acting [55,124,278] pg 11.1e36 is active+remapped+backfill_wait+backfill_toofull, acting [351,92,315] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_bytes Pools ['hdb_backup'] overcommit available storage by 1.708x due to target_size_bytes 0 on pools [] POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_ratio Pools ['hdb_backup'] overcommit available storage by 1.708x due to target_size_ratio 0.000 on pools [] POOL_TOO_MANY_PGS 2 pools have too many placement groups Pool hdd has 512 placement groups, should have 128 Pool pve_cephfs_metadata has 32 placement groups, should have 4 REQUEST_SLOW 672 slow requests are blocked > 32 sec 249 ops are blocked > 2097.15 sec 284 ops are blocked > 1048.58 sec 108 ops are blocked > 524.288 sec 9 ops are blocked > 262.144 sec 22 ops are blocked > 131.072 sec osd.9 has blocked requests > 524.288 sec osds 0,2,6,68 have blocked requests > 1048.58 sec osd.3 has blocked requests > 2097.15 sec REQUEST_STUCK 4752 stuck requests are blocked > 4096 sec 1431 ops are blocked > 67108.9 sec 513 ops are blocked > 33554.4 sec 909 ops are blocked > 16777.2 sec 1809 ops are blocked > 8388.61 sec 90 ops are blocked > 4194.3 sec osd.63 has stuck requests > 67108.9 sec My interpretation is that Ceph a) is busy with remapping PGs of pool hdb_backup b) has identified several OSDs with either blocked or stuck requests. Any of these OSDs belongs to pool hdd, though. osd.9 belongs to node A, osd.63 and osd.68 belongs to node C (there are 4 nodes serving OSD in the cluster). I have tried to fix this issue, but it failed with - ceph osd set noout - restart of relevant OSD by systemctl restart ceph-osd@<id> and finally server reboot. I also tried to migrate the virtual disks to another pool, but this fails, too. There are no changes on server side, like network or disks or whatsoever. How can I resolve this issue? THX Thomas _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx