Re: upmap balancer and consequences of osds briefly marked out

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 4 May 2020 11:30:40 +0200

Hi Dylan,

The backfillfull_ratio, which defaults to 0.9, prevents backfilling
into an osd which is getting too full.
So, worst case scenario is that your cluster will have some osds
getting up to 90% full, after which case the upmap balancer should
starting putting things back into place.

Also, check that your "mon osd down out subtree limit" is set
appropriately for your cluster. In our case, we set it to "host" -- we
don't want to automatically "out" all the osds from an entire host,
because this is normally something that we can quickly fix with a
manual intervention.
But I fear that wouldn't have helped in your case, because the
firewall issue probably downed a random subset of osds from several
hosts all at once.
We've also had happen a couple times, and now set "mon osd down out
interval = 3600" so that we have time to notice the network outage and
set noout on the cluster to prevent lots of rebalancing carnage.

Hope it helps,

Dan

On Fri, May 1, 2020 at 4:37 PM Dylan McCulloch <dmc@xxxxxxxxxxxxxx> wrote:
>
> Thanks Dan, that looks like a really neat method & script for a few use-cases. We've actually used several of the scripts in that repo over the years, so, many thanks for sharing.
>
> That method will definitely help in the scenario in which a set of unnecessary pg remaps have been triggered and can be caught early and reverted. I'm still a little concerned about the possibility of, for example, a brief network glitch occurring at night and then waking up to a full unbalanced cluster. Especially with NVMe clusters that can rapidly remap and rebalance (and for which we also have a greater impetus to squeeze out as much available capacity as possible with upmap due to cost per TB). It's just a risk I hadn't previously considered and was wondering if others have either run into it or felt any need to plan around it.
>
> Cheers,
> Dylan
>
>
> >From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> >Sent: Friday, 1 May 2020 5:53 PM
> >To: Dylan McCulloch <dmc@xxxxxxxxxxxxxx>
> >Cc: ceph-users <ceph-users@xxxxxxx>
> >
> >Subject: Re:  upmap balancer and consequences of osds briefly marked out
> >
> >Hi,
> >
> >You're correct that all the relevant upmap entries are removed when an
> >OSD is marked out.
> >You can try to use this script which will recreate them and get the
> >cluster back to HEALTH_OK quickly:
> >https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
> >
> >Cheers, Dan
> >
> >
> >On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch <dmc@xxxxxxxxxxxxxx> wrote:
> >>
> >> Hi all,
> >>
> >> We're using upmap balancer which has made a huge improvement in evenly distributing data on our osds and has provided a substantial increase in usable capacity.
> >>
> >> Currently on ceph version: 12.2.13 luminous
> >>
> >> We ran into a firewall issue recently which led to a large number of osds being briefly marked 'down' & 'out'. The osds came back 'up' & 'in' after about 25 mins and the cluster was fine but had to perform a significant amount of backfilling/recovery despite
> > there being no end-user client I/O during that period.
> >>
> >> Presumably the large number of remapped pgs and backfills were due to pg_upmap_items being removed from the osdmap when osds were marked out and subsequently those pgs were redistributed using the default crush algorithm.
> >> As a result of the brief outage our cluster became significantly imbalanced again with several osds very close to full.
> >> Is there any reasonable mitigation for that scenario?
> >>
> >> The auto-balancer will not perform optimizations while there are degraded pgs, so it would only start reapplying pg upmap exceptions after initial recovery is complete (at which point capacity may be dangerously reduced).
> >> Similarly, as admins, we normally only apply changes when the cluster is in a healthy state, but if the same issue were to occur again would it be advisable to manually apply balancer plans while initial recovery is still taking place?
> >>
> >> I guess my concern from this experience is that making use of the capacity gained by using upmap balancer appears to carry some risk. i.e. it's possible for a brief outage to remove those space efficiencies relatively quickly and potentially result in full
> > osds/cluster before the automatic balancer is able to resume and redistribute pgs using upmap.
> >>
> >> Curious whether others have any thoughts or experience regarding this.
> >>
> >> Cheers,
> >> Dylan
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx