Re: Placement groups stuck inactive after down & out of 1/9 OSDs

"Chris Murray" <chrismurray84@xxxxxxxxx> · Wed, 7 Jan 2015 20:13:40 -0000

Thank you for your assistance Craig. At the time, I hadn’t noted placement group details, but I know to do that if I get inactive placement groups again. I’m still getting familiar with the cluster, with 15 OSDs now across five hosts, a mix of good and bad drives, XFS/BTRFS and with/without SSD journals so I can start to understand what sort of differences the options make.

Thanks again.

From: Craig Lewis [mailto:clewis@xxxxxxxxxxxxxxxxxx] 
Sent: 19 December 2014 23:22
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] Placement groups stuck inactive after down & out of 1/9 OSDs

With only one OSD down and size = 3, you shouldn't've had any PGs inactive.  At worst, they should've been active+degraded.

The only thought I have is that some of your PGs aren't mapping to the correct number of OSDs.  That's not supposed to be able to happen unless you've messed up your crush rules.

You might go through ceph pg dump, and verify that all PGs have 3 OSDs in the reporting and acting columns, and that there are no duplicate OSDs in those lists.  

With your 1216 PGs, it might be faster to write a script to parse the JSON than to do it manually.  If you happen to remember some PGs that were inactive or degraded, you could spot check those.

On Fri, Dec 19, 2014 at 11:45 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote:
Interesting indeed, those tuneables were suggested on the pve-user mailing list too, and they certainly sound like they’ll ease the pressure during the recovery operation. What I might not have explained very well though is that the VMs hung indefinitely and past the end of the recovery process, rather than being slow; almost as if the 78 stuck inactive placement groups contained data which was critical to VM operation. Looking at IO and performance in the cluster is certainly on the to-do list, with a scale-out of nodes and move of journals to SSD, but of course that needs some investment and I’d like to prove things first. It’s a bit catch-22 :-)

To my knowledge, the cluster was HEALTH_OK before and it is HEALTH_OK now, BUT ... I've not followed my usual advice of stopping and thinking about things before trying something else, so I suppose the marking of the OSD 'up' this morning (which turned those 78 into some other ACTIVE+* states) has spoiled the chance of troubleshooting. I’ve been messing around with osd.0 since too, and the health is now:

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_OK
     monmap e3: 3 mons at {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0}, election epoch 58, quorum 0,1,2 0,1,2
     osdmap e1205: 9 osds: 9 up, 9 in
      pgmap v120175: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2679 GB used, 9790 GB / 12525 GB avail
                1216 active+clean

If it helps at all, the other details are as follows. Nothing from 'dump stuck' although I expect there would have been this morning.

root@ceph25:~# ceph osd tree
# id    weight  type name       up/down reweight
-1      12.22   root default
-2      4.3             host ceph25
3       0.9                     osd.3   up      1
6       0.68                    osd.6   up      1
0       2.72                    osd.0   up      1
-3      4.07            host ceph26
1       2.72                    osd.1   up      1
4       0.9                     osd.4   up      1
7       0.45                    osd.7   up      1
-4      3.85            host ceph27
2       2.72                    osd.2   up      1
5       0.68                    osd.5   up      1
8       0.45                    osd.8   up      1
root@ceph25:~# ceph osd dump | grep ^pool
pool 0 'data' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 3 'vmpool' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 187 flags hashpspool stripe_width 0
root@ceph25:~# ceph pg dump_stuck
ok

The more I think about this problem, the less I think there'll be an easy answer, and it's more likely that I'll have to reproduce the scenario and actually pause myself next time in order to troubleshoot it?

From: Craig Lewis [mailto:clewis@xxxxxxxxxxxxxxxxxx]
Sent: 19 December 2014 19:17
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] Placement groups stuck inactive after down & out of 1/9 OSDs

That seems odd.  So you have 3 nodes, with 3 OSDs each.  You should've been able to mark osd.0 down and out, then stop the daemon without having those issues.

It's generally best to mark an osd down, then out, and wait until the cluster has recovered completely before stopping the daemon and removing it from the cluster.  That guarantees that you always have 3+ copies of the data.

Disks don't always fail gracefully though.  If you have a sudden and complete failure, you can't do it the nice way.  At that point, just mark the osd down and out.  If your cluster was healthy before this event, you shouldn't have any data problems.  If the cluster wasn't HEALTH_OK before the event, you will likely have some problems.

Is your cluster HEALTH_OK now?  If not, can you give me the following?
• ceph -s
• ceph osd tree
• ceph osd dump | grep ^pool
• ceph pg dump_stuck
• ceph pg query <pgid>  # For just one of the stuck PGs

I'm a bit confused why your cluster has a bunch of PGs in the remapped state, but none in the remapping state.  It's supposed to be recovering, and something is blocking that.

As to the hung VMs, during any recovery or backfill, you'll probably have IO problems.  The ceph.conf defaults are intended for large clusters, probably with SSD journals.  In my 3 nodes, 24 OSD cluster with no SSD journals, recovery was IO starving my clients.  I de-prioritized recovery with:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

It was still painful, but those values kept my cluster usable.  Since I've grown to 5 nodes, and added SSD journals, I've been able to increase the backfills and recovery active to 3.  I found those values through trial and error, watching my RadosGW latency, and playing with ceph tell osd.\* injectargs ...

I've found that I have problems if more than 20% of my OSDs are involved in a backfilling operation.  With your 9 OSDs, you're guaranteeing that any single event will always hit at least 22% of your OSDS, and probably more.  If you're unable to add more disks, I would highly recommend adding SSD journals.

On Fri, Dec 19, 2014 at 8:08 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote:
Hello,

I'm a newbie to CEPH, gaining some familiarity by hosting some virtual
machines on a test cluster. I'm using a virtualisation product called
Proxmox Virtual Environment, which conveniently handles cluster setup,
pool setup, OSD creation etc.

During the attempted removal of an OSD, my pool appeared to cease
serving IO to virtual machines, and I'm wondering if I did something
wrong or if there's something more to the process of removing an OSD.

The CEPH cluster is small; 9 OSDs in total across 3 nodes. There's a
pool called 'vmpool', with size=3 and min_size=1. It's a bit slow, but I
see plenty of information on how to troubleshoot that, and understand I
should be separating cluster communication onto a separate network
segment to improve performance. CEPH version is Firefly - 0.80.7

So, the issue was: I marked osd.0 as down & out (or possibly out & down,
if order matters), and virtual machines hung. Almost immediately, 78 pgs
were 'stuck inactive', and after some activity overnight, they remained
that way:

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 290 pgs degraded; 78 pgs stuck inactive; 496 pgs
stuck unclean; 4 requests are blocked > 32 sec; recovery 69696/685356
objects degraded (10.169%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e669: 9 osds: 8 up, 8 in
      pgmap v100175: 1216 pgs, 4 pools, 888 GB data, 223 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            69696/685356 objects degraded (10.169%)
                  78 inactive
                 720 active+clean
                 290 active+degraded
                 128 active+remapped

I started the OSD to bring it back 'up'. It was still 'out'.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 59 pgs degraded; 496 pgs stuck unclean; recovery
30513/688554 objects degraded (4.431%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e671: 9 osds: 9 up, 8 in
      pgmap v103181: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            30513/688554 objects degraded (4.431%)
                 720 active+clean
                  59 active+degraded
                 437 active+remapped
  client io 2303 kB/s rd, 153 kB/s wr, 85 op/s

The inactive pgs had disappeared.
I stopped the OSD again, making it 'down' and 'out', as it was previous.
At this point, I started my virtual machines again, which functioned
correctly.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 368 pgs degraded; 496 pgs stuck unclean;
recovery 83332/688554 objects degraded (12.102%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e673: 9 osds: 8 up, 8 in
      pgmap v103248: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            83332/688554 objects degraded (12.102%)
                 720 active+clean
                 368 active+degraded
                 128 active+remapped
  client io 19845 B/s wr, 6 op/s

At this point, removing the OSD was successful, without any IO hanging.

--------

Have I tried to remove an OSD in an incorrect manner? I'm wondering what
would happen in a legitimate failure scenario; what if a disk failure
were followed with a host failure? Apologies if this is something that's
been observed already; I've seen mentions of the same symptom, but
seemingly for causes other than OSD removal.

Thanks you in advance,
Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
________________________________________
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5577 / Virus Database: 4253/8757 - Release Date: 12/18/14
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5577 / Virus Database: 4253/8757 - Release Date: 12/18/14
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com