Re: Understanding mon space usage during recovery

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 8 Jun 2016 08:33:23 -0700

On Wed, Jun 8, 2016 at 2:25 AM, Bartłomiej Święcki
<bartlomiej.swiecki@xxxxxxxxxxxx> wrote:
> Hi,
>
> I was recently trying to understand the growth of mon disk space usage
> during recovery in one of our clusters,
> wanted to know whether we could reduce disk usage somehow or if we just have
> to prepare more space for our mons.
> Cluster is 0.94.6, just over 300 OSDs. Leveldb compaction does reduce space
> usage but it quickly grows back
> to the previous usage. What I found out is that most of the leveldb data is
> used by osdmap history.
>
> For each osdmap version leveldb contains both full and incremental entry so
> I was thinking if we really need to
> store full osdmaps for all versions? If we're having incremental changes for
> every version anyway, wouldn't it be
> sufficient to keep first full version only and then recover any future ones
> by applying incrementals?

Maybe not; we've gone back and forth on this but I think we ended up
learning that reconstructing them was just annoying in terms of
needing to read all the extra keys.

>
> I was also trying to understand how ceph figures out the range of osdmap
> versions to keep. After analyzing the code
> I thought the obvious answer was in PGMap::calc_min_last_epoch_clean() - In
> case of our production cluster,
> the difference between min and max clean epochs was around 30k during
> recovery, size of one full osdmap blob
> in leveldb is around 250k.

Yeah, there's not a lot that can be done about this directly. 30k maps
is an awful lot though; you probably have other issues happening in
your OSDs (or monitors?).

>
> I also tried to test this on my dev cluster where I could run gdb (15 OSD, 4
> OSD nearfull and lots of misplaced objects).
> What I found out is that execution in OSDmonitor::get_trim_to() almost never
> jumped inside the first 'if'.
> mon->pgmon()->is_readable() returns false, I did debug it once and it was a
> result of false returned by Paxos::is_lease_valid().

Okay, that's bad. If your lease isn't valid, then the monitors are
getting so bogged down that they're timing out the leases and
temporarily breaking quorum. You should figure out if this is a load
issue or a result of clock skew issues or something.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html