ceph osd crush tunables optimal AND add new OSD at the same time

sweil@xxxxxxxxxx (Sage Weil) · Tue, 15 Jul 2014 08:33:50 -0700 (PDT)

On Tue, 15 Jul 2014, Andrija Panic wrote:
> Hi Sage, since this problem is tunables-related, do we need to expect 
> same behavior or not ?when we do regular data rebalancing caused by 
> adding new/removing OSD? I guess not, but would like your confirmation. 
> I'm already on optimal tunables, but I'm afraid to test this by i.e. 
> shuting down 1 OSD.

When you shut down a single OSD it is a relativey small amount of data 
that needs to move to do the recovery.  The issue with the tunables is 
just that a huge fraction of the data stored needs to move, and the 
performance impact is much higher.

sage

> 
> Thanks,
> Andrija
> 
> 
> On 14 July 2014 18:18, Sage Weil <sweil at redhat.com> wrote:
>       I've added some additional notes/warnings to the upgrade and
>       release
>       notes:
> 
> ?https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca77328324
>       51
> 
>       If there is somewhere else where you think a warning flag would
>       be useful,
>       let me know!
> 
>       Generally speaking, we want to be able to cope with huge data
>       rebalances
>       without interrupting service. ?It's an ongoing process of
>       improving the
>       recovery vs client prioritization, though, and removing sources
>       of
>       overhead related to rebalancing... and it's clearly not perfect
>       yet. :/
> 
>       sage
> 
> 
>       On Sun, 13 Jul 2014, Andrija Panic wrote:
> 
>       > Hi,
>       > after seting ceph upgrade (0.72.2 to 0.80.3) I have issued
>       "ceph osd crush
>       > tunables optimal" and after only few minutes I have added 2
>       more OSDs to the
>       > CEPH cluster...
>       >
>       > So these 2 changes were more or a less done at the same time -
>       rebalancing
>       > because of tunables optimal, and rebalancing because of adding
>       new OSD...
>       >
>       > Result - all VMs living on CEPH storage have gone mad, no disk
>       access
>       > efectively, blocked so to speak.
>       >
>       > Since this rebalancing took 5h-6h, I had bunch of VMs down for
>       that long...
>       >
>       > Did I do wrong by causing "2 rebalancing" to happen at the
>       same time ?
>       > Is this behaviour normal, to cause great load on all VMs
>       because they are
>       > unable to access CEPH storage efectively ?
>       >
>       > Thanks for any input...
>       > --?
>       >
> > Andrija Pani?
> >
> >
> 
> 
> 
> 
> --
> 
> Andrija Pani?
> --------------------------------------
> ? http://admintweets.com
> --------------------------------------
> 
>