Re: Fast growing monstore during large recovery

Dan Van Der Ster <daniel.vanderster@xxxxxxx> · Tue, 8 Nov 2016 16:52:43 +0000

> On 8 Nov 2016, at 16:26, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> 
> On Tue, Nov 8, 2016 at 7:13 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> Hi,
>> 
>> Last Friday evening I got a call from a customer which had set it's tunables to 'optimal' since he saw a warning.
>> 
>> This 2.0000 OSD (8PB) cluster was initially installed with Firefly and upgraded to Hammer and Jewel.
>> 
>> His change caused a 88% degradation in the cluster which he left running for over 5 hours before the MON stores grew beyond 15GB and he called me.
>> 
>> I eventually reverted the change since another hour later we were at 26GB of MON store and only a few percent additional recovery had been done.
>> 
>> We had 50% of space (80GB) left on the MON stores and I wasn't convinced we would make it without running out of space on the MONs (5x), so I fetched the old CRUSHMap from a OSDMap and injected it back in. A few hours later we were back to HEALTH_OK.
>> 
>> What I learned is that the MON stores can grow quite fast, but are also heavy on disk I/O.
>> 
>> In this case the SSDs weren't the best (850 Pro, don't ask) and they couldn't keep up with all the changes. They are being swapped now for the Intel S3710 400GB and Samsung SM863 480GB (mixing vendors).
>> 
>> The main reasons for the large SSDs:
>> - Performance
>> - Enough space to store a very large MON database
>> 
>> Something to keep in mind with a large cluster. A big re-shuffle of data can lead to MON stores growing rather large.
> 
> Did you work out why they got so big? Does the pgtemp count and the
> increased OSDMap storage account for the extra space usage, or was
> there something else going on?

During a big reshuffle the PGs will stay degraded for ages, so the mon's stop trimming until health_ok is restored.

In our big intervention to replace 3PB of servers with 6PB of new hardware, we did the migration rack-by-rack, and waited for HEALTH_OK between each iteration so that the mon's would trim.
IIRC, each rack would grow the mon leveldbs to ~35GB each.

Wido: thanks for this reminder. Basically there is currently no way to change tunables on a large cluster in a safe way (without a huge SSD on each mon). (Any operation that would keep the cluster in a degraded state for more than a few days is out of the question).

-- Dan

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com