stuck unclean since forever

Joel Griffiths <joel.griffiths@xxxxxxxxxxxxxxxx> · Fri, 11 Nov 2016 17:38:59 -0800

I've been struggling with a broken ceph node and I have very limited ceph knowledge. With 3-4 days of actually using it, I was tasked with upgrading it. Everything seemed to go fine, at first, but it didn't last.

The next day I was informed people were unable to create volumes (we successfully created a volume immediately after the upgrade, but we were unable to do so now) After some investigation, I discovered that 'rados -p volumes ls' just hangs. I have another pool that behaves that way too (images). The rest don't seem to have any issues.

We are running 6 ceph servers with 72 OSD's. Here is what ceph status brings up (now):

root@CTR01:~# ceph -s
    cluster c14740db-4771-4f95-8268-689bba5598eb
     health HEALTH_WARN
            1538 pgs stale
            282 pgs stuck inactive
            1538 pgs stuck stale
            282 pgs stuck unclean
            too many PGs per OSD (747 > max 300)
     monmap e1: 3 mons at {Ceph02=192.168.0.12:6789/0,ceph04=192.168.90.14:6789/0,Ceph06=192.168.0.16:6789/0}
            election epoch 3066, quorum 0,1,2 Ceph02,Ceph04,Ceph06
     osdmap e1325: 72 osds: 72 up, 72 in
      pgmap v2515322: 18232 pgs, 19 pools, 1042 GB data, 437 kobjects
            3143 GB used, 127 TB / 130 TB avail
               16412 active+clean
                1538 stale+active+clean
                 282 creating

Some notes:

1538 stale+active+clean -
Most of these (1250, 1350, or so) were leftover from the initial installation. They weren't actually being used by the system. I inherited the system with those and was told nobody knew how to get rid of them. It was, apparently, part of a ceph false-start.

282 creating -
While I was looking at the issue, I noticed a 'ceph -s' warning about another pool (one we use for swift). It complained about too few PG's per osd, so I increased pg+num and pgp_num from 1024 to 2048; I was hoping the two problems were related.. That's what added the status line 'creating" I think (also all in the 19.xx - is that osd.19?)

root@MUC1-Tab-CTR01:~# ceph health detail | grep unclean
HEALTH_WARN 1538 pgs stale; 282 pgs stuck inactive; 1538 pgs stuck stale; 282 pgs stuck unclean; too many PGs per OSD (747 > max 300)
pg 19.5b1 is stuck unclean since forever, current state creating, last acting []
pg 19.c5 is stuck unclean since forever, current state creating, last acting []
pg 19.c6 is stuck unclean since forever, current state creating, last acting []
pg 19.c0 is stuck unclean since forever, current state creating, last acting []
pg 19.c2 is stuck unclean since forever, current state creating, last acting []
pg 19.726 is stuck unclean since forever, current state creating, last acting []
pg 19.727 is stuck unclean since forever, current state creating, last acting []
pg 19.412 is stuck unclean since forever, current state creating, last acting []
.
.
.
pg 19.26c is stuck unclean since forever, current state creating, last acting []
pg 19.5be is stuck unclean since forever, current state creating, last acting []
pg 19.264 is stuck unclean since forever, current state creating, last acting []
pg 19.5b4 is stuck unclean since forever, current state creating, last acting []
pg 19.260 is stuck unclean since forever, current state creating, last acting []

Looking at osd.19 logs, I get the same messages I get with osd.20.log:

root@Ceph02:~# tail -10 /var/log/ceph/ceph-osd.19.log
2016-11-12 02:18:36.047803 7f973fe58700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.195:6814/4099 pipe(0xc057000 sd=87 :57536 s=2 pgs=1039 cs=21 l=0 c=0xa3a34a0).fault with nothing to send, going to standby
2016-11-12 02:22:49.242045 7f974045e700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6812/4067 pipe(0xa402000 sd=25 :48529 s=2 pgs=986 cs=21 l=0 c=0xa3a5b20).fault with nothing to send, going to standby
2016-11-12 02:22:49.244093 7f973e741700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.196:6810/4118 pipe(0xba4e000 sd=51 :50137 s=2 pgs=933 cs=35 l=0 c=0xb7af760).fault with nothing to send, going to standby
2016-11-12 02:25:20.699763 7f97383e5700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.194:6806/4108 pipe(0xba76000 sd=134 :6818 s=2 pgs=972 cs=21 l=0 c=0xb7afb80).fault with nothing to send, going to standby
2016-11-12 02:28:02.526393 7f9720669700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6806/3964 pipe(0xbb54000 sd=210 :6818 s=0 pgs=0 cs=0 l=0 c=0xc5bc840).accept connect_seq 41 vs existing 41 state standby
2016-11-12 02:28:02.526750 7f9720669700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6806/3964 pipe(0xbb54000 sd=210 :6818 s=0 pgs=0 cs=0 l=0 c=0xc5bc840).accept connect_seq 42 vs existing 41 state standby
2016-11-12 02:33:40.838728 7f973d933700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6822/4147 pipe(0xbbae000 sd=92 :6818 s=0 pgs=0 cs=0 l=0 c=0x5a939c0).accept connect_seq 27 vs existing 27 state standby
2016-11-12 02:33:40.839052 7f973d933700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6822/4147 pipe(0xbbae000 sd=92 :6818 s=0 pgs=0 cs=0 l=0 c=0x5a939c0).accept connect_seq 28 vs existing 27 state standby
2016-11-12 02:34:00.187408 7f9719706700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6818/4140 pipe(0xc052000 sd=65 :6818 s=0 pgs=0 cs=0 l=0 c=0x5a91760).accept connect_seq 31 vs existing 31 state standby
2016-11-12 02:34:00.187686 7f9719706700  0 -- 192.168.92.12:6818/4289 >> 192.168.92.193:6818/4140 pipe(0xc052000 sd=65 :6818 s=0 pgs=0 cs=0 l=0 c=0x5a91760).accept connect_seq 32 vs existing 31 state standby

At this point I'm stuck. I have no idea what to do do fix the 'volumes' pool. Does anybody have any suggestions? 

-- Joel

JOEL GRIFFITHS
LINUX SYSTEMS ENGINEER
UNITAS GLOBAL 
M +1 480.717 5635 
JOEL.GRIFFITHS@UNITASGLOBAL.COM

This e-mail is confidential to the person or entity addressed and may be protected by legal privilege. If you are not the intended recipient, please notify the sender immediately and delete your copy from your system. You should not copy it, re-transmit it, use it or disclose its contents.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com