Re: Fast growing monstore during large recovery

Wido den Hollander <wido@xxxxxxxx> · Tue, 8 Nov 2016 19:04:51 +0100 (CET)

> Op 8 november 2016 om 16:26 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
> 
> 
> On Tue, Nov 8, 2016 at 7:13 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> > Hi,
> >
> > Last Friday evening I got a call from a customer which had set it's tunables to 'optimal' since he saw a warning.
> >
> > This 2.0000 OSD (8PB) cluster was initially installed with Firefly and upgraded to Hammer and Jewel.
> >
> > His change caused a 88% degradation in the cluster which he left running for over 5 hours before the MON stores grew beyond 15GB and he called me.
> >
> > I eventually reverted the change since another hour later we were at 26GB of MON store and only a few percent additional recovery had been done.
> >
> > We had 50% of space (80GB) left on the MON stores and I wasn't convinced we would make it without running out of space on the MONs (5x), so I fetched the old CRUSHMap from a OSDMap and injected it back in. A few hours later we were back to HEALTH_OK.
> >
> > What I learned is that the MON stores can grow quite fast, but are also heavy on disk I/O.
> >
> > In this case the SSDs weren't the best (850 Pro, don't ask) and they couldn't keep up with all the changes. They are being swapped now for the Intel S3710 400GB and Samsung SM863 480GB (mixing vendors).
> >
> > The main reasons for the large SSDs:
> > - Performance
> > - Enough space to store a very large MON database
> >
> > Something to keep in mind with a large cluster. A big re-shuffle of data can lead to MON stores growing rather large.
> 
> Did you work out why they got so big? Does the pgtemp count and the
> increased OSDMap storage account for the extra space usage, or was
> there something else going on?

It was the OSDMaps with the pgtemp in there. This cluster has rougly 40k PGs in total.

I still have this output from a OSD's status:

    {
        "cluster_fsid": "5ec1249f-7a87-4b4a-b360-2b5d17e24984",
        "osd_fsid": "bf1bc4a5-b1fc-456b-a198-69c8e9e526d8",
        "whoami": 1588,
        "state": "active",
        "oldest_map": 143967,
        "newest_map": 158091,
        "num_pgs": 65
    }

As you can see it has 14124 OSDMaps in it's store, so these are also in the MON's store.

After all PGs became active+clean again the MONs trimmed quickly.

Right now changing tunables on this system is rather dangerous. We need a lot of space on the MONs and probably also on the OSDs to store all the old maps.

@Zoltan: Not much more I/O issues then during a other recovery. Most PGs were in backfill_wait.

@Dan: Indeed. But for the sake if it we are installing larger and faster OSDs

@David: Odd. We saw a near 100% util of the SSDs on the MONs. The 850 Pro's are slow, dead slow with sync writes. So that's why they are being replaced the S3710's and SM863's.

Wido

> -Greg
> 
> >
> > Just wanted to share this.
> >
> > Wido
> > _______________________________________________
> > Ceph-large mailing list
> > Ceph-large@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com