Thanks for the tips
A single OSD was indeed 95% full, and after removing it there is 24TB usable space and everything working again. :D
I hope during the backfilling, another OSD won't go 95% also.
It's a bit odd with ~140 OSD's that a single full one can take everything down with it.
I would understand since 8/2 erasure coding spreads data over 10 disks that if one of those is full it cant use the capacity of the other 9 disks.
But seems it can only use the free capacity based on the lowest one in the whole cluster
On Thu, May 9, 2019 at 1:25 PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
One full OSD stops everything.
You can change what's considered 'full', the default is 95%
ceph osd set-full-ratio 0.95
Never let an OSD run 100% full, that will lead to lots of real
problems, 95% is a good default (it's not exact, some metadata might
not always be accounted or it might temporarily need more)
A quick and dirty work-around if only one OSD is full: take it down ;)
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Thu, May 9, 2019 at 2:08 PM Kári Bertilsson <karibertils@xxxxxxxxx> wrote:
>
> Hello
>
> I am running cephfs with 8/2 erasure coding. I had about 40tb usable free(110tb raw), one small disk crashed and i added 2x10tb disks. Now it's backfilling & recovering with 0B free and i can't read a single file from the file system...
>
> This happend with max-backfilling 4, but i have increased max backfills to 128, to hopefully get this over a little faster since system has been unusable for 12 hours anyway. Not sure yet if that was a good idea.
>
> 131TB of raw space was somehow not enough to keep things running. Any tips to avoid this kind of scenario in the future ?
>
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 489TiB 131TiB 358TiB 73.17
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> ec82_pool 41 278TiB 100.00 0B 28549450
> cephfs_metadata 42 174MiB 0.04 381GiB 666939
> rbd 51 99.3GiB 20.68 381GiB 25530
>
> data:
> pools: 3 pools, 704 pgs
> objects: 29.24M objects, 278TiB
> usage: 358TiB used, 131TiB / 489TiB avail
> pgs: 1265432/287571907 objects degraded (0.440%)
> 12366014/287571907 objects misplaced (4.300%)
> 536 active+clean
> 137 active+remapped+backfilling
> 27 active+undersized+degraded+remapped+backfilling
> 4 active+remapped+backfill_toofull
>
> io:
> client: 64.0KiB/s wr, 0op/s rd, 7op/s wr
> recovery: 1.17GiB/s, 113objects/s
>
> Is there anything i can do to restore reading ? I can understand writing not working, but why is it blocking reading also ? Any tips ?
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com