Re: pgs stuck unclean after removing OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've been using this procedure to remove OSDs...

OSD_ID=
ceph auth del osd.${OSD_ID}
ceph osd down ${OSD_ID}
ceph osd out ${OSD_ID}
ceph osd rm ${OSD_ID}
ceph osd crush remove osd.${OSD_ID}
systemctl disable ceph-osd@${OSD_ID}.service
systemctl stop ceph-osd@${OSD_ID}.service
sed -i "/ceph-$OSD_ID/d" /etc/fstab
umount /var/lib/ceph/osd/ceph-${OSD_ID}

Would you say this is the correct order of events?

Thanks!


On Wed, Jun 28, 2017 at 9:34 AM, David Turner <drakonstein@xxxxxxxxx> wrote:
A couple things.  You didn't `ceph osd crush remove osd.21` after doing the other bits.  Also you will want to remove the bucket (re: host) from the crush map as it will now be empty.  Right now you have a host in the crush map with a weight, but no osds to put that data on.  It has a weight because of the 2 OSDs that are still in it that were removed from the cluster but not from the crush map.  It's confusing to your cluster.

If you had removed the OSDs from the crush map when you ran the other commands, then the dead host would have still been in the crush map but with a weight of 0 and wouldn't cause any problems.

On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak <kas@xxxxxxxxxx> wrote:
        Hello,

TL;DR: what to do when my cluster reports stuck unclean pgs?

Detailed description:

One of the nodes in my cluster died. CEPH correctly rebalanced itself,
and reached the HEALTH_OK state. I have looked at the failed server,
and decided to take it out of the cluster permanently, because the hardware
is indeed faulty. It used to host two OSDs, which were marked down and out
in "ceph osd dump".

So from the HEALTH_OK I ran the following commands:

# ceph auth del osd.20
# ceph auth del osd.21
# ceph osd rm osd.20
# ceph osd rm osd.21

After that, CEPH started to rebalance itself, but now it reports some PGs
as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":

# ceph -s
    cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
     health HEALTH_WARN
            350 pgs stuck unclean
            recovery 26/1596390 objects degraded (0.002%)
            recovery 58772/1596390 objects misplaced (3.682%)
     monmap e16: 3 mons at {...}
            election epoch 584, quorum 0,1,2 ...
     osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
            flags require_jewel_osds
      pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
            6244 GB used, 40569 GB / 46814 GB avail
            26/1596390 objects degraded (0.002%)
            58772/1596390 objects misplaced (3.682%)
                3426 active+clean
                 349 active+remapped
                   1 active
  client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr

# ceph health detail
HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
pg 28.fa is stuck unclean for 14408925.966824, current state active+remapped, last acting [38,52,4]
pg 28.e7 is stuck unclean for 14408925.966886, current state active+remapped, last acting [29,42,22]
pg 23.dc is stuck unclean for 61698.641750, current state active+remapped, last acting [50,33,23]
pg 23.d9 is stuck unclean for 61223.093284, current state active+remapped, last acting [54,31,23]
pg 28.df is stuck unclean for 14408925.967120, current state active+remapped, last acting [33,7,15]
pg 34.38 is stuck unclean for 60904.322881, current state active+remapped, last acting [18,41,9]
pg 34.fe is stuck unclean for 60904.241762, current state active+remapped, last acting [58,1,44]
[...]
pg 28.8f is stuck unclean for 66102.059671, current state active, last acting [8,40,5]
[...]
recovery 26/1596390 objects degraded (0.002%)
recovery 58772/1596390 objects misplaced (3.682%)

Apart from that, the data stored in CEPH pools seems to be reachable
and usable as before.

The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH repository).

What other debugging info should I provide, or what to do in order
to unstuck the stuck pgs? Thanks!

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.                          --pboddie at LWN <
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux