On Wed, Jun 8, 2016 at 2:25 AM, Bartłomiej Święcki <bartlomiej.swiecki@xxxxxxxxxxxx> wrote: > Hi, > > I was recently trying to understand the growth of mon disk space usage > during recovery in one of our clusters, > wanted to know whether we could reduce disk usage somehow or if we just have > to prepare more space for our mons. > Cluster is 0.94.6, just over 300 OSDs. Leveldb compaction does reduce space > usage but it quickly grows back > to the previous usage. What I found out is that most of the leveldb data is > used by osdmap history. > > For each osdmap version leveldb contains both full and incremental entry so > I was thinking if we really need to > store full osdmaps for all versions? If we're having incremental changes for > every version anyway, wouldn't it be > sufficient to keep first full version only and then recover any future ones > by applying incrementals? Maybe not; we've gone back and forth on this but I think we ended up learning that reconstructing them was just annoying in terms of needing to read all the extra keys. > > I was also trying to understand how ceph figures out the range of osdmap > versions to keep. After analyzing the code > I thought the obvious answer was in PGMap::calc_min_last_epoch_clean() - In > case of our production cluster, > the difference between min and max clean epochs was around 30k during > recovery, size of one full osdmap blob > in leveldb is around 250k. Yeah, there's not a lot that can be done about this directly. 30k maps is an awful lot though; you probably have other issues happening in your OSDs (or monitors?). > > I also tried to test this on my dev cluster where I could run gdb (15 OSD, 4 > OSD nearfull and lots of misplaced objects). > What I found out is that execution in OSDmonitor::get_trim_to() almost never > jumped inside the first 'if'. > mon->pgmon()->is_readable() returns false, I did debug it once and it was a > result of false returned by Paxos::is_lease_valid(). Okay, that's bad. If your lease isn't valid, then the monitors are getting so bogged down that they're timing out the leases and temporarily breaking quorum. You should figure out if this is a load issue or a result of clock skew issues or something. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html