Re: ceph warning

Christian Balzer <chibi@xxxxxxx> · Thu, 1 Sep 2016 20:16:19 +0900

Hello,

On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:

> thanks for the response
> 
> 
> 
> > You really will want to spend more time reading documentation and this ML,
> > as well as using google to (re-)search things.
> 
> 
>  I did do some reading on the error but cannot understand why they do
> not clear even after so long.
> 
> > In your previous mail you already mentioned a 92% full OSD, that should
> > combined with the various "full" warnings have impressed on you the need
> > to address this issue.
> 
> > When your nodes all rebooted, did everything come back up?
> 
> One host with 5 osd were down nad came up later.
> 
> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> time?
> 
> about 7 hours
> 
> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> time?   about 7 hours
>
OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
restore redundancy, but had not enough space to do so and got itself stuck
in a corner.

Lesson here is:
a) have enough space to cover the loss of one node (rack, etc) or
b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
can recover a failed node before re-balancing starts.

Of course b) assumes that you have 24/7 monitoring and access to your
cluster, so that restoring a failed node is likely faster that
re-balancing the data.

> True
> 
> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> > full for that.
> > And until something changes it will be stuck there.
> > Your best bet is to add more OSDs, since you seem to be short on space
> > anyway. Or delete unneeded data.
> > Given your level of experience, I'd advice against playing with weights
> > and the respective "full" configuration options.
> 
> I did reweights some osd but everything is back to normal. No config
> changes on "Full" config
> 
> I deleted about 900G this morning and prepared 3 osd, should I add them now?
> 
More OSDs will both make things less likely to get full again and give the
nearfull OSDs a place to move data to.

However they will also cause more data movement, so if your cluster is
busy, maybe do that during the night or weekend.

> > Are these numbers and the recovery io below still changing, moving along?
> 
> original email:
> 
> >             recovery 493335/3099981 objects degraded (15.914%)
> >             recovery 1377464/3099981 objects misplaced (44.435%)
> 
> 
> current email:
> 
> 
>  recovery 389973/3096070 objects degraded (12.596%)
>  recovery 1258984/3096070 objects misplaced (40.664%)
> 
> 
So there is progress, it may recover by itself after all.

Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
correct? 

If so definitely progress, it's just taking a lot of time to recover.

If the progress should stop before the cluster can get healthy again,
write another mail with "ceph -s" and so forth for us to peruse.

Christian

> > Just to confirm, that's all the 15 OSDs your cluster ever had?
> 
> yes
> 
> 
> > Output from "ceph osd df" and "ceph osd tree" please.
> 
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
>  3 0.90868  1.00000   930G   232G  698G 24.96 0.40 105
>  5 0.90868  1.00000   930G   139G  791G 14.99 0.24 139
>  6 0.90868  1.00000   930G 61830M  870G  6.49 0.10 138
>  0 0.90868  1.00000   930G   304G  625G 32.76 0.53 128
>  2 0.90868  1.00000   930G 24253M  906G  2.55 0.04 130
>  1 0.90868  1.00000   930G   793G  137G 85.22 1.37 162
>  4 0.90868  1.00000   930G   790G  140G 84.91 1.36 160
>  7 0.90868  1.00000   930G   803G  127G 86.34 1.39 144
> 10 0.90868  1.00000   930G   792G  138G 85.16 1.37 145
> 13 0.90868  1.00000   930G   811G  119G 87.17 1.40 163
> 15 0.90869  1.00000   930G   794G  136G 85.37 1.37 157
> 16 0.90869  1.00000   930G   757G  172G 81.45 1.31 159
> 17 0.90868  1.00000   930G   800G  129G 86.06 1.38 144
> 18 0.90869  1.00000   930G   786G  144G 84.47 1.36 166
> 19 0.90868  1.00000   930G   793G  137G 85.26 1.37 160
>               TOTAL 13958G  8683G 5274G 62.21
> MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10
> 
> 
> 
> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 13.63019 root default
> -2  4.54338     host nodeB
>  3  0.90868         osd.3       up  1.00000          1.00000
>  5  0.90868         osd.5       up  1.00000          1.00000
>  6  0.90868         osd.6       up  1.00000          1.00000
>  0  0.90868         osd.0       up  1.00000          1.00000
>  2  0.90868         osd.2       up  1.00000          1.00000
> -3  4.54338     host nodeC
>  1  0.90868         osd.1       up  1.00000          1.00000
>  4  0.90868         osd.4       up  1.00000          1.00000
>  7  0.90868         osd.7       up  1.00000          1.00000
> 10  0.90868         osd.10      up  1.00000          1.00000
> 13  0.90868         osd.13      up  1.00000          1.00000
> -6  4.54343     host nodeD
> 15  0.90869         osd.15      up  1.00000          1.00000
> 16  0.90869         osd.16      up  1.00000          1.00000
> 17  0.90868         osd.17      up  1.00000          1.00000
> 18  0.90869         osd.18      up  1.00000          1.00000
> 19  0.90868         osd.19      up  1.00000          1.00000
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Sep 1, 2016 at 10:56 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >
> > Hello,
> >
> > On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote:
> >
> > > Hi All,
> > >
> > > Can someone please decipher this errors for me, after all nodes rebooted in
> > > my cluster on Monday. the warning has not gone.
> > >
> > You really will want to spend more time reading documentation and this ML,
> > as well as using google to (re-)search things.
> > Like searching for "backfill_toofull", "near full", etc.
> >
> >
> > > Will the warning ever clear?
> > >
> > Unlikely.
> >
> > In your previous mail you already mentioned a 92% full OSD, that should
> > combined with the various "full" warnings have impressed on you the need
> > to address this issue.
> >
> > When your nodes all rebooted, did everything come back up?
> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> > time?
> > My guess is that some nodes/OSDs where restarted a lot later than others.
> >
> > See inline:
> > >
> > >   cluster df3f96d8-3889-4baa-8b27-cc2839141425
> > >      health HEALTH_WARN
> > >             2 pgs backfill_toofull
> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> > full for that.
> > And until something changes it will be stuck there.
> >
> > Your best bet is to add more OSDs, since you seem to be short on space
> > anyway. Or delete unneeded data.
> > Given your level of experience, I'd advice against playing with weights
> > and the respective "full" configuration options.
> >
> > >             532 pgs backfill_wait
> > >             3 pgs backfilling
> > >             330 pgs degraded
> > >             537 pgs stuck unclean
> > >             330 pgs undersized
> > >             recovery 493335/3099981 objects degraded (15.914%)
> > >             recovery 1377464/3099981 objects misplaced (44.435%)
> > Are these numbers and the recovery io below still changing, moving along?
> >
> > >             8 near full osd(s)
> > 8 out of 15, definitely needs more OSD.
> > Output from "ceph osd df" and "ceph osd tree" please.
> >
> > >      monmap e7: 3 mons at {Monitors}
> > >             election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
> > >      osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs
> >
> > Just to confirm, that's all the 15 OSDs your cluster ever had?
> >
> > Christian
> >
> > >             flags sortbitwise
> > >       pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
> > >             8242 GB used, 5715 GB / 13958 GB avail
> > >             493335/3099981 objects degraded (15.914%)
> > >             1377464/3099981 objects misplaced (44.435%)
> > >                  327 active+undersized+degraded+remapped+wait_backfill
> > >                  205 active+remapped+wait_backfill
> > >                  103 active+clean
> > >                    3 active+undersized+degraded+remapped+backfilling
> > >                    2 active+remapped+backfill_toofull
> > > recovery io 367 MB/s, 96 objects/s
> > >   client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com