Placement groups stuck inactive after down & out of 1/9 OSDs

"Chris Murray" <chrismurray84@xxxxxxxxx> · Fri, 19 Dec 2014 16:08:31 -0000

Hello,

I'm a newbie to CEPH, gaining some familiarity by hosting some virtual
machines on a test cluster. I'm using a virtualisation product called
Proxmox Virtual Environment, which conveniently handles cluster setup,
pool setup, OSD creation etc.

During the attempted removal of an OSD, my pool appeared to cease
serving IO to virtual machines, and I'm wondering if I did something
wrong or if there's something more to the process of removing an OSD.

The CEPH cluster is small; 9 OSDs in total across 3 nodes. There's a
pool called 'vmpool', with size=3 and min_size=1. It's a bit slow, but I
see plenty of information on how to troubleshoot that, and understand I
should be separating cluster communication onto a separate network
segment to improve performance. CEPH version is Firefly - 0.80.7 

So, the issue was: I marked osd.0 as down & out (or possibly out & down,
if order matters), and virtual machines hung. Almost immediately, 78 pgs
were 'stuck inactive', and after some activity overnight, they remained
that way:

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 290 pgs degraded; 78 pgs stuck inactive; 496 pgs
stuck unclean; 4 requests are blocked > 32 sec; recovery 69696/685356
objects degraded (10.169%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e669: 9 osds: 8 up, 8 in
      pgmap v100175: 1216 pgs, 4 pools, 888 GB data, 223 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            69696/685356 objects degraded (10.169%)
                  78 inactive
                 720 active+clean
                 290 active+degraded
                 128 active+remapped

I started the OSD to bring it back 'up'. It was still 'out'.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 59 pgs degraded; 496 pgs stuck unclean; recovery
30513/688554 objects degraded (4.431%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e671: 9 osds: 9 up, 8 in
      pgmap v103181: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            30513/688554 objects degraded (4.431%)
                 720 active+clean
                  59 active+degraded
                 437 active+remapped
  client io 2303 kB/s rd, 153 kB/s wr, 85 op/s

The inactive pgs had disappeared. 
I stopped the OSD again, making it 'down' and 'out', as it was previous.
At this point, I started my virtual machines again, which functioned
correctly.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 368 pgs degraded; 496 pgs stuck unclean;
recovery 83332/688554 objects degraded (12.102%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e673: 9 osds: 8 up, 8 in
      pgmap v103248: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            83332/688554 objects degraded (12.102%)
                 720 active+clean
                 368 active+degraded
                 128 active+remapped
  client io 19845 B/s wr, 6 op/s

At this point, removing the OSD was successful, without any IO hanging.

--------

Have I tried to remove an OSD in an incorrect manner? I'm wondering what
would happen in a legitimate failure scenario; what if a disk failure
were followed with a host failure? Apologies if this is something that's
been observed already; I've seen mentions of the same symptom, but
seemingly for causes other than OSD removal.

Thanks you in advance,
Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com