Re: ceph warning

Christian Balzer <chibi@xxxxxxx> · Fri, 2 Sep 2016 09:59:01 +0900

Hello,

On Thu, 1 Sep 2016 16:24:28 +0200 Ishmael Tsoaela wrote:

> I did set configure the following during my initial setup:
> 
> osd pool default size = 3
> 
Ah yes, so not this.
(though the default "rbd" pool that's initially created tended to ignore
that parameter and would default to 3 in any case)

In fact I remember now writing about this before, you're looking at CRUSH
in action and the corner cases of a small cluster.

What happened here is that when the OSDs of your 3rd node were gone (down
and out) CRUSH recalculated the locations of PGs based on the new reality
and started to move things around. 
And unlike with a larger cluster (4+ nodes) or a single OSD failure, it
did NOT remove the "old" data after the move, since your replication level
wasn't achievable.

So this is what filled up your OSDs, moving (copying really) PGs to their
newly calculated location while not deleting the data at the old location
afterwards.

As said before, you will want to set 
"mon_osd_down_out_subtree_limit = host"
at least until your cluster is n+1 sized (4 nodes or more).

Adding more OSDs (and keeping usage below 60% or so) would also achieve
this, but a 4th node would be more helpful performance wise.

Christian

> 
> 
> root@nodeC:/mnt/vmimages# ceph osd dump | grep "replicated size"
> pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 217 flags hashpspool
> stripe_width 0
> pool 4 'vmimages' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 64 pgp_num 64 last_change 242 flags
> hashpspool stripe_width 0
> pool 5 'vmimage-backups' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 777 flags
> hashpspool stripe_width 0
> 
> 
> After adding 3 more osd, I see data is being replicated to the new osd
> :) and near full osd warning is gone.
> 
> recovery some hours ago:
> 
> 
> >>  recovery 389973/3096070 objects degraded (12.596%)
> >>  recovery 1258984/3096070 objects misplaced (40.664%)
> 
> recovery now:
> 
>             recovery 8917/3217724 objects degraded (0.277%)
>             recovery 1120479/3217724 objects misplaced (34.822%)
> 
> 
> 
> 
> 
> On Thu, Sep 1, 2016 at 4:13 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> > On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote:
> >
> >> more questions and I hope you don;t mind:
> >>
> >>
> >>
> >> My understanding is that if I have 3 hosts with 5 osd each, 1 host
> >> goes down, Ceph should not replicate to the osd that are down.
> >>
> > How could it replicate to something that is down?
> >
> >> When the host comes up, only then the replication will commence right?
> >>
> > Depends on your configuration.
> >
> >> If only 1 osd out of 5 comes up, then only data meant for that osd
> >> should be copied to the osd? if so then why do pg get full if they
> >> were not full before osd went down?
> >>
> > Definitely not.
> >
> >>
> > You need to understand how  CRUSH maps, rules and replication work.
> >
> > By default pools with Hammer and higher with will have a replicate size
> > of 3 and CRUSH picks OSDs based on a host failure domain, so that's why
> > you need at least 3 hosts with those default settings.
> >
> > So with these defaults Ceph would indeed have done nothing in a 3 node
> > cluster if one node had gone down.
> > It needs to put replicas on different nodes, but only 2 are available.
> >
> > However given what happened to your cluster it is obvious that your pools
> > have a replication size of 2 most likely.
> > Check with
> > ceph osd dump | grep "replicated size"
> >
> > In that case Ceph will try to recover and restore 2 replicas (original and
> > copy), resulting in what you're seeing.
> >
> > Christian
> >
> >>
> >> On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela <ishmaelt3@xxxxxxxxx> wrote:
> >> > Thank you again.
> >> >
> >> > I will add 3 more osd today and leave untouched, maybe over weekend.
> >> >
> >> > On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
> >> >>
> >> >>> thanks for the response
> >> >>>
> >> >>>
> >> >>>
> >> >>> > You really will want to spend more time reading documentation and this ML,
> >> >>> > as well as using google to (re-)search things.
> >> >>>
> >> >>>
> >> >>>  I did do some reading on the error but cannot understand why they do
> >> >>> not clear even after so long.
> >> >>>
> >> >>> > In your previous mail you already mentioned a 92% full OSD, that should
> >> >>> > combined with the various "full" warnings have impressed on you the need
> >> >>> > to address this issue.
> >> >>>
> >> >>> > When your nodes all rebooted, did everything come back up?
> >> >>>
> >> >>> One host with 5 osd were down nad came up later.
> >> >>>
> >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> >> >>> time?
> >> >>>
> >> >>> about 7 hours
> >> >>>
> >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> >> >>> time?   about 7 hours
> >> >>>
> >> >> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
> >> >> restore redundancy, but had not enough space to do so and got itself stuck
> >> >> in a corner.
> >> >>
> >> >> Lesson here is:
> >> >> a) have enough space to cover the loss of one node (rack, etc) or
> >> >> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
> >> >> can recover a failed node before re-balancing starts.
> >> >>
> >> >> Of course b) assumes that you have 24/7 monitoring and access to your
> >> >> cluster, so that restoring a failed node is likely faster that
> >> >> re-balancing the data.
> >> >>
> >> >>
> >> >>> True
> >> >>>
> >> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> >> >>> > full for that.
> >> >>> > And until something changes it will be stuck there.
> >> >>> > Your best bet is to add more OSDs, since you seem to be short on space
> >> >>> > anyway. Or delete unneeded data.
> >> >>> > Given your level of experience, I'd advice against playing with weights
> >> >>> > and the respective "full" configuration options.
> >> >>>
> >> >>> I did reweights some osd but everything is back to normal. No config
> >> >>> changes on "Full" config
> >> >>>
> >> >>> I deleted about 900G this morning and prepared 3 osd, should I add them now?
> >> >>>
> >> >> More OSDs will both make things less likely to get full again and give the
> >> >> nearfull OSDs a place to move data to.
> >> >>
> >> >> However they will also cause more data movement, so if your cluster is
> >> >> busy, maybe do that during the night or weekend.
> >> >>
> >> >>> > Are these numbers and the recovery io below still changing, moving along?
> >> >>>
> >> >>> original email:
> >> >>>
> >> >>> >             recovery 493335/3099981 objects degraded (15.914%)
> >> >>> >             recovery 1377464/3099981 objects misplaced (44.435%)
> >> >>>
> >> >>>
> >> >>> current email:
> >> >>>
> >> >>>
> >> >>>  recovery 389973/3096070 objects degraded (12.596%)
> >> >>>  recovery 1258984/3096070 objects misplaced (40.664%)
> >> >>>
> >> >>>
> >> >> So there is progress, it may recover by itself after all.
> >> >>
> >> >> Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
> >> >> correct?
> >> >>
> >> >> If so definitely progress, it's just taking a lot of time to recover.
> >> >>
> >> >> If the progress should stop before the cluster can get healthy again,
> >> >> write another mail with "ceph -s" and so forth for us to peruse.
> >> >>
> >> >> Christian
> >> >>
> >> >>> > Just to confirm, that's all the 15 OSDs your cluster ever had?
> >> >>>
> >> >>> yes
> >> >>>
> >> >>>
> >> >>> > Output from "ceph osd df" and "ceph osd tree" please.
> >> >>>
> >> >>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
> >> >>>  3 0.90868  1.00000   930G   232G  698G 24.96 0.40 105
> >> >>>  5 0.90868  1.00000   930G   139G  791G 14.99 0.24 139
> >> >>>  6 0.90868  1.00000   930G 61830M  870G  6.49 0.10 138
> >> >>>  0 0.90868  1.00000   930G   304G  625G 32.76 0.53 128
> >> >>>  2 0.90868  1.00000   930G 24253M  906G  2.55 0.04 130
> >> >>>  1 0.90868  1.00000   930G   793G  137G 85.22 1.37 162
> >> >>>  4 0.90868  1.00000   930G   790G  140G 84.91 1.36 160
> >> >>>  7 0.90868  1.00000   930G   803G  127G 86.34 1.39 144
> >> >>> 10 0.90868  1.00000   930G   792G  138G 85.16 1.37 145
> >> >>> 13 0.90868  1.00000   930G   811G  119G 87.17 1.40 163
> >> >>> 15 0.90869  1.00000   930G   794G  136G 85.37 1.37 157
> >> >>> 16 0.90869  1.00000   930G   757G  172G 81.45 1.31 159
> >> >>> 17 0.90868  1.00000   930G   800G  129G 86.06 1.38 144
> >> >>> 18 0.90869  1.00000   930G   786G  144G 84.47 1.36 166
> >> >>> 19 0.90868  1.00000   930G   793G  137G 85.26 1.37 160
> >> >>>               TOTAL 13958G  8683G 5274G 62.21
> >> >>> MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10
> >> >>>
> >> >>>
> >> >>>
> >> >>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> >>> -1 13.63019 root default
> >> >>> -2  4.54338     host nodeB
> >> >>>  3  0.90868         osd.3       up  1.00000          1.00000
> >> >>>  5  0.90868         osd.5       up  1.00000          1.00000
> >> >>>  6  0.90868         osd.6       up  1.00000          1.00000
> >> >>>  0  0.90868         osd.0       up  1.00000          1.00000
> >> >>>  2  0.90868         osd.2       up  1.00000          1.00000
> >> >>> -3  4.54338     host nodeC
> >> >>>  1  0.90868         osd.1       up  1.00000          1.00000
> >> >>>  4  0.90868         osd.4       up  1.00000          1.00000
> >> >>>  7  0.90868         osd.7       up  1.00000          1.00000
> >> >>> 10  0.90868         osd.10      up  1.00000          1.00000
> >> >>> 13  0.90868         osd.13      up  1.00000          1.00000
> >> >>> -6  4.54343     host nodeD
> >> >>> 15  0.90869         osd.15      up  1.00000          1.00000
> >> >>> 16  0.90869         osd.16      up  1.00000          1.00000
> >> >>> 17  0.90868         osd.17      up  1.00000          1.00000
> >> >>> 18  0.90869         osd.18      up  1.00000          1.00000
> >> >>> 19  0.90868         osd.19      up  1.00000          1.00000
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 1, 2016 at 10:56 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >> >>> >
> >> >>> >
> >> >>> > Hello,
> >> >>> >
> >> >>> > On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote:
> >> >>> >
> >> >>> > > Hi All,
> >> >>> > >
> >> >>> > > Can someone please decipher this errors for me, after all nodes rebooted in
> >> >>> > > my cluster on Monday. the warning has not gone.
> >> >>> > >
> >> >>> > You really will want to spend more time reading documentation and this ML,
> >> >>> > as well as using google to (re-)search things.
> >> >>> > Like searching for "backfill_toofull", "near full", etc.
> >> >>> >
> >> >>> >
> >> >>> > > Will the warning ever clear?
> >> >>> > >
> >> >>> > Unlikely.
> >> >>> >
> >> >>> > In your previous mail you already mentioned a 92% full OSD, that should
> >> >>> > combined with the various "full" warnings have impressed on you the need
> >> >>> > to address this issue.
> >> >>> >
> >> >>> > When your nodes all rebooted, did everything come back up?
> >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> >> >>> > time?
> >> >>> > My guess is that some nodes/OSDs where restarted a lot later than others.
> >> >>> >
> >> >>> > See inline:
> >> >>> > >
> >> >>> > >   cluster df3f96d8-3889-4baa-8b27-cc2839141425
> >> >>> > >      health HEALTH_WARN
> >> >>> > >             2 pgs backfill_toofull
> >> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> >> >>> > full for that.
> >> >>> > And until something changes it will be stuck there.
> >> >>> >
> >> >>> > Your best bet is to add more OSDs, since you seem to be short on space
> >> >>> > anyway. Or delete unneeded data.
> >> >>> > Given your level of experience, I'd advice against playing with weights
> >> >>> > and the respective "full" configuration options.
> >> >>> >
> >> >>> > >             532 pgs backfill_wait
> >> >>> > >             3 pgs backfilling
> >> >>> > >             330 pgs degraded
> >> >>> > >             537 pgs stuck unclean
> >> >>> > >             330 pgs undersized
> >> >>> > >             recovery 493335/3099981 objects degraded (15.914%)
> >> >>> > >             recovery 1377464/3099981 objects misplaced (44.435%)
> >> >>> > Are these numbers and the recovery io below still changing, moving along?
> >> >>> >
> >> >>> > >             8 near full osd(s)
> >> >>> > 8 out of 15, definitely needs more OSD.
> >> >>> > Output from "ceph osd df" and "ceph osd tree" please.
> >> >>> >
> >> >>> > >      monmap e7: 3 mons at {Monitors}
> >> >>> > >             election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
> >> >>> > >      osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs
> >> >>> >
> >> >>> > Just to confirm, that's all the 15 OSDs your cluster ever had?
> >> >>> >
> >> >>> > Christian
> >> >>> >
> >> >>> > >             flags sortbitwise
> >> >>> > >       pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
> >> >>> > >             8242 GB used, 5715 GB / 13958 GB avail
> >> >>> > >             493335/3099981 objects degraded (15.914%)
> >> >>> > >             1377464/3099981 objects misplaced (44.435%)
> >> >>> > >                  327 active+undersized+degraded+remapped+wait_backfill
> >> >>> > >                  205 active+remapped+wait_backfill
> >> >>> > >                  103 active+clean
> >> >>> > >                    3 active+undersized+degraded+remapped+backfilling
> >> >>> > >                    2 active+remapped+backfill_toofull
> >> >>> > > recovery io 367 MB/s, 96 objects/s
> >> >>> > >   client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > Christian Balzer        Network/Systems Engineer
> >> >>> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> >>> > http://www.gol.com/
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Christian Balzer        Network/Systems Engineer
> >> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> >> http://www.gol.com/
> >>
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com