Re: mon.mon01 store is getting too big! 18119 MB >= 15360 MB -- 94% avail

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 1 Feb 2017 12:34:57 -0800

In particular, when using leveldb, stalls while reading or writing to 
the store - typically, leveldb is compacting when this happens. This 
leads to all sorts of timeouts to be triggered, but the really annoying 
one would be the lease timeout, which tends to result in flapping quorum.

Also, being unable to sync monitors. Again, stalls on leveldb lead to 
timeouts being triggered and the sync to restart.

Once upon a time, this *may* have also translated into large memory 
consumption. A direct relation was never proved though, and behaviour 
went away as ceph became smarter, and libs were updated by distros.

My team suffered no small amount of pain due to persistent DB inflation, not just during topology churn.  RHCS 1.3.2 addressed that for us.  Before we applied that update I saw mon DB’s grow as large as 54GB.

When measuring the size of /var/lib/ceph/mon/store.db, be careful to not blindly include *.log or *LOG* files that you may find there.  I set leveldb_log = /dev/null to supress writing those, which were confusing our metrics.   I also set mon_compact_on_start = true to compact each mon’s leveldb at startup.  This was found anecdotally to be more effective than using ceph tell to compact during operation, as there was less contention.  It does mean however that one needs to be careful, when the set of DB’s across mons is inflated, to not restart them all in a short amount of time.  It seems that even after the mon log reports that compaction is complete, there is (as of Hammer) trimming still running silently in the background that impacts performance until complete.  This means that one will see additional shrinkage of the store.db directory over time.

In my clusters of 450+ OSD’s, 4GB is the arbitrary point above which I get worried.  Mind you most of our mon DB’s are still on (wince) LFF rotational drives, which doesn’t help.  Strongly advise faster storage for the DB’s.  I found that the larger the DB’s grow, the slower all mon operations become — which includes peering and especially backfill/recovery.  With a DB that large you may find that OSD loss / removal via attrition can cause significant client impact.

Inflating during recovery/backfill does still happen; it sounds as though the OP doubled the size of his/her cluster in one swoop, which is fairly drastic.  Early on with Dumpling I trebled the size of a cluster in one operation, and still ache from the fallout.  A phased deployment will spread out the impact and allow the DB’s to preen in between phases.  One approach is to add only 1 or a few drives per OSD server at a time, but in parallel.  So if you were adding 10 servers of 12 OSD’s each, say 6-12 steps of 10x1 or 10x2 OSD’s.  That way the write workload is spread across 10 servers instead of funneling into just one, avoiding HBA saturation and the blocked requests that can result from it.  Adding the OSD’s with 0 weight and using ceph osd crush reweight to bring them up in phases can also ease the process.  Early on we would allow each reweight to fully recover before the next step, but I’ve since found that peering is the biggest share of the impact, and that upweighting can proceed just as safely after peering clears up from the previous adjustment.  This avoids moving some fraction of data more than once.  With Jewel backfill/recovery is improved to not shuffle data that doesnt’ really need to move, but with Hammer this decidely helps avoid a bolus of slow requests as each new OSD comes up and peers.

— Anthony

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com