pgs stuck unclean after removing OSDs

Jan Kasprzak <kas@xxxxxxxxxx> · Wed, 28 Jun 2017 10:15:05 +0200

	Hello,

TL;DR: what to do when my cluster reports stuck unclean pgs?

Detailed description:

One of the nodes in my cluster died. CEPH correctly rebalanced itself,
and reached the HEALTH_OK state. I have looked at the failed server,
and decided to take it out of the cluster permanently, because the hardware
is indeed faulty. It used to host two OSDs, which were marked down and out
in "ceph osd dump".

So from the HEALTH_OK I ran the following commands:

# ceph auth del osd.20
# ceph auth del osd.21
# ceph osd rm osd.20
# ceph osd rm osd.21

After that, CEPH started to rebalance itself, but now it reports some PGs
as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":

# ceph -s
    cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
     health HEALTH_WARN
            350 pgs stuck unclean
            recovery 26/1596390 objects degraded (0.002%)
            recovery 58772/1596390 objects misplaced (3.682%)
     monmap e16: 3 mons at {...}
            election epoch 584, quorum 0,1,2 ...
     osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
            flags require_jewel_osds
      pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
            6244 GB used, 40569 GB / 46814 GB avail
            26/1596390 objects degraded (0.002%)
            58772/1596390 objects misplaced (3.682%)
                3426 active+clean
                 349 active+remapped
                   1 active
  client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr

# ceph health detail
HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
pg 28.fa is stuck unclean for 14408925.966824, current state active+remapped, last acting [38,52,4]
pg 28.e7 is stuck unclean for 14408925.966886, current state active+remapped, last acting [29,42,22]
pg 23.dc is stuck unclean for 61698.641750, current state active+remapped, last acting [50,33,23]
pg 23.d9 is stuck unclean for 61223.093284, current state active+remapped, last acting [54,31,23]
pg 28.df is stuck unclean for 14408925.967120, current state active+remapped, last acting [33,7,15]
pg 34.38 is stuck unclean for 60904.322881, current state active+remapped, last acting [18,41,9]
pg 34.fe is stuck unclean for 60904.241762, current state active+remapped, last acting [58,1,44]
[...]
pg 28.8f is stuck unclean for 66102.059671, current state active, last acting [8,40,5]
[...]
recovery 26/1596390 objects degraded (0.002%)
recovery 58772/1596390 objects misplaced (3.682%)

Apart from that, the data stored in CEPH pools seems to be reachable
and usable as before.

The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH repository).

What other debugging info should I provide, or what to do in order
to unstuck the stuck pgs? Thanks!

-Yenya

-- 
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.                          --pboddie at LWN <
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com