Hello, TL;DR: what to do when my cluster reports stuck unclean pgs? Detailed description: One of the nodes in my cluster died. CEPH correctly rebalanced itself, and reached the HEALTH_OK state. I have looked at the failed server, and decided to take it out of the cluster permanently, because the hardware is indeed faulty. It used to host two OSDs, which were marked down and out in "ceph osd dump". So from the HEALTH_OK I ran the following commands: # ceph auth del osd.20 # ceph auth del osd.21 # ceph osd rm osd.20 # ceph osd rm osd.21 After that, CEPH started to rebalance itself, but now it reports some PGs as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s": # ceph -s cluster 3065224c-ea2e-4558-8a81-8f935dde56e5 health HEALTH_WARN 350 pgs stuck unclean recovery 26/1596390 objects degraded (0.002%) recovery 58772/1596390 objects misplaced (3.682%) monmap e16: 3 mons at {...} election epoch 584, quorum 0,1,2 ... osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs flags require_jewel_osds pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects 6244 GB used, 40569 GB / 46814 GB avail 26/1596390 objects degraded (0.002%) 58772/1596390 objects misplaced (3.682%) 3426 active+clean 349 active+remapped 1 active client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr # ceph health detail HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded (0.002%); recovery 58772/1596390 objects misplaced (3.682%) pg 28.fa is stuck unclean for 14408925.966824, current state active+remapped, last acting [38,52,4] pg 28.e7 is stuck unclean for 14408925.966886, current state active+remapped, last acting [29,42,22] pg 23.dc is stuck unclean for 61698.641750, current state active+remapped, last acting [50,33,23] pg 23.d9 is stuck unclean for 61223.093284, current state active+remapped, last acting [54,31,23] pg 28.df is stuck unclean for 14408925.967120, current state active+remapped, last acting [33,7,15] pg 34.38 is stuck unclean for 60904.322881, current state active+remapped, last acting [18,41,9] pg 34.fe is stuck unclean for 60904.241762, current state active+remapped, last acting [58,1,44] [...] pg 28.8f is stuck unclean for 66102.059671, current state active, last acting [8,40,5] [...] recovery 26/1596390 objects degraded (0.002%) recovery 58772/1596390 objects misplaced (3.682%) Apart from that, the data stored in CEPH pools seems to be reachable and usable as before. The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH repository). What other debugging info should I provide, or what to do in order to unstuck the stuck pgs? Thanks! -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com