Re: ceph warning

Ishmael Tsoaela <ishmaelt3@xxxxxxxxx> · Thu, 1 Sep 2016 16:24:28 +0200

I did set configure the following during my initial setup:

osd pool default size = 3

root@nodeC:/mnt/vmimages# ceph osd dump | grep "replicated size"
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 217 flags hashpspool
stripe_width 0
pool 4 'vmimages' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 64 pgp_num 64 last_change 242 flags
hashpspool stripe_width 0
pool 5 'vmimage-backups' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 777 flags
hashpspool stripe_width 0

After adding 3 more osd, I see data is being replicated to the new osd
:) and near full osd warning is gone.

recovery some hours ago:

>>  recovery 389973/3096070 objects degraded (12.596%)
>>  recovery 1258984/3096070 objects misplaced (40.664%)

recovery now:

            recovery 8917/3217724 objects degraded (0.277%)
            recovery 1120479/3217724 objects misplaced (34.822%)

On Thu, Sep 1, 2016 at 4:13 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote:
>
>> more questions and I hope you don;t mind:
>>
>>
>>
>> My understanding is that if I have 3 hosts with 5 osd each, 1 host
>> goes down, Ceph should not replicate to the osd that are down.
>>
> How could it replicate to something that is down?
>
>> When the host comes up, only then the replication will commence right?
>>
> Depends on your configuration.
>
>> If only 1 osd out of 5 comes up, then only data meant for that osd
>> should be copied to the osd? if so then why do pg get full if they
>> were not full before osd went down?
>>
> Definitely not.
>
>>
> You need to understand how  CRUSH maps, rules and replication work.
>
> By default pools with Hammer and higher with will have a replicate size
> of 3 and CRUSH picks OSDs based on a host failure domain, so that's why
> you need at least 3 hosts with those default settings.
>
> So with these defaults Ceph would indeed have done nothing in a 3 node
> cluster if one node had gone down.
> It needs to put replicas on different nodes, but only 2 are available.
>
> However given what happened to your cluster it is obvious that your pools
> have a replication size of 2 most likely.
> Check with
> ceph osd dump | grep "replicated size"
>
> In that case Ceph will try to recover and restore 2 replicas (original and
> copy), resulting in what you're seeing.
>
> Christian
>
>>
>> On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela <ishmaelt3@xxxxxxxxx> wrote:
>> > Thank you again.
>> >
>> > I will add 3 more osd today and leave untouched, maybe over weekend.
>> >
>> > On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >>
>> >> Hello,
>> >>
>> >> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
>> >>
>> >>> thanks for the response
>> >>>
>> >>>
>> >>>
>> >>> > You really will want to spend more time reading documentation and this ML,
>> >>> > as well as using google to (re-)search things.
>> >>>
>> >>>
>> >>>  I did do some reading on the error but cannot understand why they do
>> >>> not clear even after so long.
>> >>>
>> >>> > In your previous mail you already mentioned a 92% full OSD, that should
>> >>> > combined with the various "full" warnings have impressed on you the need
>> >>> > to address this issue.
>> >>>
>> >>> > When your nodes all rebooted, did everything come back up?
>> >>>
>> >>> One host with 5 osd were down nad came up later.
>> >>>
>> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> >>> time?
>> >>>
>> >>> about 7 hours
>> >>>
>> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> >>> time?   about 7 hours
>> >>>
>> >> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
>> >> restore redundancy, but had not enough space to do so and got itself stuck
>> >> in a corner.
>> >>
>> >> Lesson here is:
>> >> a) have enough space to cover the loss of one node (rack, etc) or
>> >> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
>> >> can recover a failed node before re-balancing starts.
>> >>
>> >> Of course b) assumes that you have 24/7 monitoring and access to your
>> >> cluster, so that restoring a failed node is likely faster that
>> >> re-balancing the data.
>> >>
>> >>
>> >>> True
>> >>>
>> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
>> >>> > full for that.
>> >>> > And until something changes it will be stuck there.
>> >>> > Your best bet is to add more OSDs, since you seem to be short on space
>> >>> > anyway. Or delete unneeded data.
>> >>> > Given your level of experience, I'd advice against playing with weights
>> >>> > and the respective "full" configuration options.
>> >>>
>> >>> I did reweights some osd but everything is back to normal. No config
>> >>> changes on "Full" config
>> >>>
>> >>> I deleted about 900G this morning and prepared 3 osd, should I add them now?
>> >>>
>> >> More OSDs will both make things less likely to get full again and give the
>> >> nearfull OSDs a place to move data to.
>> >>
>> >> However they will also cause more data movement, so if your cluster is
>> >> busy, maybe do that during the night or weekend.
>> >>
>> >>> > Are these numbers and the recovery io below still changing, moving along?
>> >>>
>> >>> original email:
>> >>>
>> >>> >             recovery 493335/3099981 objects degraded (15.914%)
>> >>> >             recovery 1377464/3099981 objects misplaced (44.435%)
>> >>>
>> >>>
>> >>> current email:
>> >>>
>> >>>
>> >>>  recovery 389973/3096070 objects degraded (12.596%)
>> >>>  recovery 1258984/3096070 objects misplaced (40.664%)
>> >>>
>> >>>
>> >> So there is progress, it may recover by itself after all.
>> >>
>> >> Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
>> >> correct?
>> >>
>> >> If so definitely progress, it's just taking a lot of time to recover.
>> >>
>> >> If the progress should stop before the cluster can get healthy again,
>> >> write another mail with "ceph -s" and so forth for us to peruse.
>> >>
>> >> Christian
>> >>
>> >>> > Just to confirm, that's all the 15 OSDs your cluster ever had?
>> >>>
>> >>> yes
>> >>>
>> >>>
>> >>> > Output from "ceph osd df" and "ceph osd tree" please.
>> >>>
>> >>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
>> >>>  3 0.90868  1.00000   930G   232G  698G 24.96 0.40 105
>> >>>  5 0.90868  1.00000   930G   139G  791G 14.99 0.24 139
>> >>>  6 0.90868  1.00000   930G 61830M  870G  6.49 0.10 138
>> >>>  0 0.90868  1.00000   930G   304G  625G 32.76 0.53 128
>> >>>  2 0.90868  1.00000   930G 24253M  906G  2.55 0.04 130
>> >>>  1 0.90868  1.00000   930G   793G  137G 85.22 1.37 162
>> >>>  4 0.90868  1.00000   930G   790G  140G 84.91 1.36 160
>> >>>  7 0.90868  1.00000   930G   803G  127G 86.34 1.39 144
>> >>> 10 0.90868  1.00000   930G   792G  138G 85.16 1.37 145
>> >>> 13 0.90868  1.00000   930G   811G  119G 87.17 1.40 163
>> >>> 15 0.90869  1.00000   930G   794G  136G 85.37 1.37 157
>> >>> 16 0.90869  1.00000   930G   757G  172G 81.45 1.31 159
>> >>> 17 0.90868  1.00000   930G   800G  129G 86.06 1.38 144
>> >>> 18 0.90869  1.00000   930G   786G  144G 84.47 1.36 166
>> >>> 19 0.90868  1.00000   930G   793G  137G 85.26 1.37 160
>> >>>               TOTAL 13958G  8683G 5274G 62.21
>> >>> MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10
>> >>>
>> >>>
>> >>>
>> >>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> >>> -1 13.63019 root default
>> >>> -2  4.54338     host nodeB
>> >>>  3  0.90868         osd.3       up  1.00000          1.00000
>> >>>  5  0.90868         osd.5       up  1.00000          1.00000
>> >>>  6  0.90868         osd.6       up  1.00000          1.00000
>> >>>  0  0.90868         osd.0       up  1.00000          1.00000
>> >>>  2  0.90868         osd.2       up  1.00000          1.00000
>> >>> -3  4.54338     host nodeC
>> >>>  1  0.90868         osd.1       up  1.00000          1.00000
>> >>>  4  0.90868         osd.4       up  1.00000          1.00000
>> >>>  7  0.90868         osd.7       up  1.00000          1.00000
>> >>> 10  0.90868         osd.10      up  1.00000          1.00000
>> >>> 13  0.90868         osd.13      up  1.00000          1.00000
>> >>> -6  4.54343     host nodeD
>> >>> 15  0.90869         osd.15      up  1.00000          1.00000
>> >>> 16  0.90869         osd.16      up  1.00000          1.00000
>> >>> 17  0.90868         osd.17      up  1.00000          1.00000
>> >>> 18  0.90869         osd.18      up  1.00000          1.00000
>> >>> 19  0.90868         osd.19      up  1.00000          1.00000
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Sep 1, 2016 at 10:56 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >>> >
>> >>> >
>> >>> > Hello,
>> >>> >
>> >>> > On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote:
>> >>> >
>> >>> > > Hi All,
>> >>> > >
>> >>> > > Can someone please decipher this errors for me, after all nodes rebooted in
>> >>> > > my cluster on Monday. the warning has not gone.
>> >>> > >
>> >>> > You really will want to spend more time reading documentation and this ML,
>> >>> > as well as using google to (re-)search things.
>> >>> > Like searching for "backfill_toofull", "near full", etc.
>> >>> >
>> >>> >
>> >>> > > Will the warning ever clear?
>> >>> > >
>> >>> > Unlikely.
>> >>> >
>> >>> > In your previous mail you already mentioned a 92% full OSD, that should
>> >>> > combined with the various "full" warnings have impressed on you the need
>> >>> > to address this issue.
>> >>> >
>> >>> > When your nodes all rebooted, did everything come back up?
>> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> >>> > time?
>> >>> > My guess is that some nodes/OSDs where restarted a lot later than others.
>> >>> >
>> >>> > See inline:
>> >>> > >
>> >>> > >   cluster df3f96d8-3889-4baa-8b27-cc2839141425
>> >>> > >      health HEALTH_WARN
>> >>> > >             2 pgs backfill_toofull
>> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
>> >>> > full for that.
>> >>> > And until something changes it will be stuck there.
>> >>> >
>> >>> > Your best bet is to add more OSDs, since you seem to be short on space
>> >>> > anyway. Or delete unneeded data.
>> >>> > Given your level of experience, I'd advice against playing with weights
>> >>> > and the respective "full" configuration options.
>> >>> >
>> >>> > >             532 pgs backfill_wait
>> >>> > >             3 pgs backfilling
>> >>> > >             330 pgs degraded
>> >>> > >             537 pgs stuck unclean
>> >>> > >             330 pgs undersized
>> >>> > >             recovery 493335/3099981 objects degraded (15.914%)
>> >>> > >             recovery 1377464/3099981 objects misplaced (44.435%)
>> >>> > Are these numbers and the recovery io below still changing, moving along?
>> >>> >
>> >>> > >             8 near full osd(s)
>> >>> > 8 out of 15, definitely needs more OSD.
>> >>> > Output from "ceph osd df" and "ceph osd tree" please.
>> >>> >
>> >>> > >      monmap e7: 3 mons at {Monitors}
>> >>> > >             election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
>> >>> > >      osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs
>> >>> >
>> >>> > Just to confirm, that's all the 15 OSDs your cluster ever had?
>> >>> >
>> >>> > Christian
>> >>> >
>> >>> > >             flags sortbitwise
>> >>> > >       pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
>> >>> > >             8242 GB used, 5715 GB / 13958 GB avail
>> >>> > >             493335/3099981 objects degraded (15.914%)
>> >>> > >             1377464/3099981 objects misplaced (44.435%)
>> >>> > >                  327 active+undersized+degraded+remapped+wait_backfill
>> >>> > >                  205 active+remapped+wait_backfill
>> >>> > >                  103 active+clean
>> >>> > >                    3 active+undersized+degraded+remapped+backfilling
>> >>> > >                    2 active+remapped+backfill_toofull
>> >>> > > recovery io 367 MB/s, 96 objects/s
>> >>> > >   client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Christian Balzer        Network/Systems Engineer
>> >>> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> >>> > http://www.gol.com/
>> >>>
>> >>
>> >>
>> >> --
>> >> Christian Balzer        Network/Systems Engineer
>> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> >> http://www.gol.com/
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com