PGs stuck unclean "active+remapped" after an osd marked out

Francois Lafont <flafdivers@xxxxxxx> · Wed, 11 Mar 2015 05:44:37 +0100

Hi,

I had a ceph cluster in "HEALTH_OK" state with Firefly 0.80.9. I just
wanted to remove an OSD (which worked well). So after:

    ceph osd out 3

I waited for the rebalancing but I had "PGs stuck unclean":

---------------------------------------------------------------
~# ceph -s
    cluster e865b3d0-535a-4f18-9883-2793079d400b
     health HEALTH_WARN 180 pgs stuck unclean; recovery -3/29968 objects degraded (-0.010%)
     monmap e3: 3 mons at {0=10.0.2.150:6789/0,1=10.0.2.151:6789/0,2=10.0.2.152:6789/0}, election epoch 224, quorum 0,1,2 0,1,2
     mdsmap e109: 1/1/1 up {0=1=up:active}, 1 up:standby
     osdmap e979: 21 osds: 21 up, 20 in
      pgmap v465311: 8704 pgs, 14 pools, 61438 MB data, 14984 objects
            160 GB used, 4886 GB / 5046 GB avail
            -3/29968 objects degraded (-0.010%)
                 180 active+remapped
                8524 active+clean
  client io 6862 B/s wr, 2 op/s
---------------------------------------------------------------

The cluster stayed in this state.

I have read the doc that says in this case "You may need to review
settings in the Pool, PG and CRUSH Config Reference and make appropriate
adjustments". But I don't see where is my mistake in my conf (I give
my detailed conf below). I don't know if it's important but I have
upgraded my cluster from 0.80.8 to 0.80.9 today (before my attempt to
remove osd.3) with for each node of my cluster:

    apt-get update && apt-get upgrade
    restart ceph-mon-all 
    restart ceph-osd-all 
    restart ceph-mds-all

and then:

    ceph osd crush set-tunable straw_calc_version 1
    ceph osd crush reweight-all

I had no problem with this upgraded, the rebalancing was very fast (my
cluster contains few data). 

So, my cluster was stuck in the state described above. I could come back
to HEALTH_OK with just:

    ceph osd in 3

But I really would like to remove this osd. Every time I try "ceph osd out 3",
it reproduces my issue above. I really think my conf is OK, so I have no idea
to solve my problem.

Thanks in advance for your help.
Regards

François Lafont

PS: here is my conf.

I have 3 nodes with the OS Ubuntu 14.04, the kernel version
3.16.0-31-generic and ceph version 0.80.9:
- node1 -> just a monitor
- node2 -> a monitor, some OSDs daemons and a mds
- node3 -> a monitor, some OSDs daemons and a mds

Each pool have "size == 2" and "min_size == 1":

---------------------------------------------------------------
~# ceph osd dump | grep -oE '^pool.*size [0-9]+' | column -t
pool  0   'data'                replicated  size  2  min_size  1
pool  1   'metadata'            replicated  size  2  min_size  1
pool  2   'rbd'                 replicated  size  2  min_size  1
pool  3   'volumes'             replicated  size  2  min_size  1
pool  4   'images'              replicated  size  2  min_size  1
pool  5   '.rgw.root'           replicated  size  2  min_size  1
pool  6   '.rgw.control'        replicated  size  2  min_size  1
pool  7   '.rgw'                replicated  size  2  min_size  1
pool  8   '.rgw.gc'             replicated  size  2  min_size  1
pool  9   '.users.uid'          replicated  size  2  min_size  1
pool  10  '.users.email'        replicated  size  2  min_size  1
pool  11  '.users'              replicated  size  2  min_size  1
pool  12  '.rgw.buckets.index'  replicated  size  2  min_size  1
pool  13  '.rgw.buckets'        replicated  size  2  min_size  1
---------------------------------------------------------------

Here is my ceph.conf:

---------------------------------------------------------------
~# cat /etc/ceph/ceph.conf 
### This file is managed by Puppet, don't edit it. ###

[global]
  auth client required      = cephx
  auth cluster required     = cephx
  auth service required     = cephx
  cluster network           = 192.168.22.0/24
  filestore xattr use omap  = true
  fsid                      = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
  mds cache size            = 1000000
  osd crush chooseleaf type = 1
  osd journal size          = 2048
  osd max backfills         = 2
  osd pool default min size = 1
  osd pool default pg num   = 512
  osd pool default pgp num  = 512
  osd pool default size     = 2
  osd recovery max active   = 2
  public network            = 10.0.2.0/24

[mon.0]
  host     = monitor-a
  mon addr = 10.0.2.150

[mon.1]
  host     = silo-1
  mon addr = 10.0.2.151

[mon.2]
  host     = silo-2
  mon addr = 10.0.2.152

[client.radosgw.gw1]
  host            = ostore-1
  rgw dns name    = ostore
  rgw socket path = /var/run/ceph/ceph.radosgw.gw1.fastcgi.sock
  keyring         = /etc/ceph/ceph.client.radosgw.gw1.keyring
  log file        = /var/log/radosgw/client.radosgw.gw1.log

[client.radosgw.gw2]
  host            = ostore-2
  rgw dns name    = ostore
  rgw socket path = /var/run/ceph/ceph.radosgw.gw2.fastcgi.sock
  keyring         = /etc/ceph/ceph.client.radosgw.gw2.keyring
  log file        = /var/log/radosgw/client.radosgw.gw2.log
---------------------------------------------------------------

Here is my crush map:

---------------------------------------------------------------
~# cat /tmp/crushmap.txt 
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host silo-2 {
	id -2		# do not change unnecessarily
	# weight 8.800
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 0.400
	item osd.2 weight 0.400
	item osd.4 weight 1.000
	item osd.6 weight 1.000
	item osd.8 weight 1.000
	item osd.10 weight 1.000
	item osd.12 weight 1.000
	item osd.14 weight 1.000
	item osd.16 weight 1.000
	item osd.18 weight 1.000
}
host silo-1 {
	id -3		# do not change unnecessarily
	# weight 9.800
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 0.400
	item osd.3 weight 0.400
	item osd.5 weight 1.000
	item osd.7 weight 1.000
	item osd.9 weight 1.000
	item osd.11 weight 1.000
	item osd.13 weight 1.000
	item osd.15 weight 1.000
	item osd.17 weight 1.000
	item osd.19 weight 1.000
	item osd.20 weight 1.000
}
root default {
	id -1		# do not change unnecessarily
	# weight 18.600
	alg straw
	hash 0	# rjenkins1
	item silo-2 weight 8.800
	item silo-1 weight 9.800
}

# rules
rule replicated_ruleset {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map
---------------------------------------------------------------

I have no problem of disk space:

---------------------------------------------------------------
~# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED 
    5172G     5009G         163G          3.16 
POOLS:
    NAME                   ID     USED       %USED     MAX AVAIL     OBJECTS 
    data                   0        512M         0         2468G         128 
    metadata               1        122M         0         2468G          51 
    rbd                    2           8         0         2468G           1 
    volumes                3      51457M      0.97         2468G       13237 
    images                 4       9345M      0.18         2468G        1186 
    .rgw.root              5         840         0         2468G           3 
    .rgw.control           6           0         0         2468G           8 
    .rgw                   7        2352         0         2468G          13 
    .rgw.gc                8           0         0         2468G          32 
    .users.uid             9         706         0         2468G           4 
    .users.email           10         18         0         2468G           2 
    .users                 11         18         0         2468G           2 
    .rgw.buckets.index     12          0         0         2468G           9 
    .rgw.buckets           13       802k         0         2468G         308 
---------------------------------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com