Re: Fast growing monstore during large recovery

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Tue, 8 Nov 2016 16:59:24 +0000

We switched our mon stores to HDDs and have been running on that for a year or so.  We haven't noticed any speed issues with HDDs on our mons instead
 of SSDs.

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

________________________________________

From: Ceph-large [ceph-large-bounces@xxxxxxxxxxxxxx] on behalf of Dan Van Der Ster [daniel.vanderster@xxxxxxx]

Sent: Tuesday, November 08, 2016 9:52 AM

To: Gregory Farnum

Cc: ceph-large@xxxxxxxxxxxxxx

Subject: Re:  Fast growing monstore during large recovery

> On 8 Nov 2016, at 16:26, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

>

> On Tue, Nov 8, 2016 at 7:13 AM, Wido den Hollander <wido@xxxxxxxx> wrote:

>> Hi,

>>

>> Last Friday evening I got a call from a customer which had set it's tunables to 'optimal' since he saw a warning.

>>

>> This 2.0000 OSD (8PB) cluster was initially installed with Firefly and upgraded to Hammer and Jewel.

>>

>> His change caused a 88% degradation in the cluster which he left running for over 5 hours before the MON stores grew beyond 15GB and he called me.

>>

>> I eventually reverted the change since another hour later we were at 26GB of MON store and only a few percent additional recovery had been done.

>>

>> We had 50% of space (80GB) left on the MON stores and I wasn't convinced we would make it without running out of space on the MONs (5x), so I fetched the old CRUSHMap from a OSDMap and injected it back in. A few hours later we were back to HEALTH_OK.

>>

>> What I learned is that the MON stores can grow quite fast, but are also heavy on disk I/O.

>>

>> In this case the SSDs weren't the best (850 Pro, don't ask) and they couldn't keep up with all the changes. They are being swapped now for the Intel S3710 400GB and Samsung SM863 480GB (mixing vendors).

>>

>> The main reasons for the large SSDs:

>> - Performance

>> - Enough space to store a very large MON database

>>

>> Something to keep in mind with a large cluster. A big re-shuffle of data can lead to MON stores growing rather large.

>

> Did you work out why they got so big? Does the pgtemp count and the

> increased OSDMap storage account for the extra space usage, or was

> there something else going on?

During a big reshuffle the PGs will stay degraded for ages, so the mon's stop trimming until health_ok is restored.

In our big intervention to replace 3PB of servers with 6PB of new hardware, we did the migration rack-by-rack, and waited for HEALTH_OK between each iteration so that the mon's would trim.

IIRC, each rack would grow the mon leveldbs to ~35GB each.

Wido: thanks for this reminder. Basically there is currently no way to change tunables on a large cluster in a safe way (without a huge SSD on each mon). (Any operation that would keep the cluster in a degraded state for more than a few days is out of the question).

-- Dan

_______________________________________________

Ceph-large mailing list

Ceph-large@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com