On Thu, Mar 10, 2016 at 8:12 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 10 Mar 2016, Allen Samuels wrote: >> > > Another important benefit of this is that the WAL code need only kick >> > > in for operations that are less than 4K rather than the current 64K. >> > > This is a big reduction in write-amplification for these operations >> > > which should translate directly into improved throughput, especially >> > > for such benchmark critical areas as random 4K writes... >> > > >> > > While not terribly useful on hybrid or HDD systems, the bitmap based >> > > code has MUCH shorter CPU paths than does the current code. On an all >> > > flash OSD system this will directly translate into more performance >> > > (quantity unknown of course). >> > > >> > > While the memory consumption reduction is nice, for me the significant >> > > performance improvement implied by virtual elimination of WAL is the >> > > compelling factor. >> > >> > I wouldn't conflate the allcoation size and freelist representation; we get the >> > same avoidance of the WAL path with the current code by changing >> > min_alloc_size to 4k. Making the freelist represntation more efficient is >> > important for small block sizes, but *any* memory-efficient strategy is fine >> > (and even the current one is probably fine for most workloads). For >> > example, we could keep an extent-based representation and page in regions >> > instead... >> >> Perhaps I inartfully made my point. I just wanted to say that if you set >> min_alloc_size to 4K you avoid the WAL stuff but you spend more memory, >> in the worst case the memory consumption would be 16 * 320MB => 5GB per >> TB of storage. While I agree that the true worst-case pattern requires a >> pathological use-case, I am concerned that normal use-cases will still >> consume unreasonably large amounts of memory -- leading to unreliable >> systems. >> >> I believe that we simply don't want to be using that much memory for >> this part of the system. There are other tradeoffs (time, complexity, >> latency, etc.) that could significantly reduce memory consumption. Let's >> explore these. > > Agreed. :) > >> > The other thing to keep in mind is that the freelist is representated >> > twice: once in memory in indexed form in StupidAllocator, and once in >> > FreelistManager just to ensure it's persisted properly. In the second case, I >> > suspect we could leverage a custom rocksdb merge operator to avoid that >> > representation entirely so that adjacent extents are coalesced when the sst >> > is generated (when writing to l0 or during compaction). I'm guessing that >> > functionality isn't present is ZS though? >> >> Not at present. But since I control the developers, that is something >> that could be added. Clearly it's possible to address the global >> serialization of KV commits if a different mechanism was available for >> representing the allocation lists in the KV store. Having some kind of >> merging primitive allows that to be done. I was going to raise exactly >> this issue yesterday, but tabled it. Let's continue this conversation. > > I haven't really done my homework with the existing rocksdb merge > operators to confirm that they would work in this way. In particular, I > think we want them to merge adjacent keys together (at least if we stick > with the naive offset=length kv representation), and I'm afraid they might > be intended to merge values with matching keys. Yes, that's definitely my recollection of how those operators work. It's a merge when sstables get put together — anything operating on adjacent ranges would run into all kinds of problems across sstable boundaries! -Greg > In any case, though, my > hope would be that we'd settle on a single strategy that would work across > both ZS and rocksdb as it'd simplify our life a bit. > > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html