Re: ceph warning

Christian Balzer <chibi@xxxxxxx> · Thu, 1 Sep 2016 17:56:28 +0900

Hello,

On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote:

> Hi All,
> 
> Can someone please decipher this errors for me, after all nodes rebooted in
> my cluster on Monday. the warning has not gone.
>
You really will want to spend more time reading documentation and this ML,
as well as using google to (re-)search things.
Like searching for "backfill_toofull", "near full", etc.

> Will the warning ever clear?
> 
Unlikely.

In your previous mail you already mentioned a 92% full OSD, that should
combined with the various "full" warnings have impressed on you the need
to address this issue.

When your nodes all rebooted, did everything come back up? 
And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
time? 
My guess is that some nodes/OSDs where restarted a lot later than others.

See inline:
> 
>   cluster df3f96d8-3889-4baa-8b27-cc2839141425
>      health HEALTH_WARN
>             2 pgs backfill_toofull
Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
full for that.
And until something changes it will be stuck there.

Your best bet is to add more OSDs, since you seem to be short on space
anyway. Or delete unneeded data.
Given your level of experience, I'd advice against playing with weights
and the respective "full" configuration options.

>             532 pgs backfill_wait
>             3 pgs backfilling
>             330 pgs degraded
>             537 pgs stuck unclean
>             330 pgs undersized
>             recovery 493335/3099981 objects degraded (15.914%)
>             recovery 1377464/3099981 objects misplaced (44.435%)
Are these numbers and the recovery io below still changing, moving along?

>             8 near full osd(s)
8 out of 15, definitely needs more OSD.
Output from "ceph osd df" and "ceph osd tree" please.

>      monmap e7: 3 mons at {Monitors}
>             election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
>      osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs

Just to confirm, that's all the 15 OSDs your cluster ever had?

Christian

>             flags sortbitwise
>       pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
>             8242 GB used, 5715 GB / 13958 GB avail
>             493335/3099981 objects degraded (15.914%)
>             1377464/3099981 objects misplaced (44.435%)
>                  327 active+undersized+degraded+remapped+wait_backfill
>                  205 active+remapped+wait_backfill
>                  103 active+clean
>                    3 active+undersized+degraded+remapped+backfilling
>                    2 active+remapped+backfill_toofull
> recovery io 367 MB/s, 96 objects/s
>   client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com