Re: storing pg logs outside of rocksdb

Sage Weil <sweil@xxxxxxxxxx> · Thu, 29 Mar 2018 20:16:28 +0000 (UTC)

On Wed, 28 Mar 2018, Mark Nelson wrote:
> The last time I tested universal compaction I saw mixed results with it.  I
> think the problem is that if we are inserting new keys and deleting old ones
> in L0 we've already lost.  We either need to keep those keys out of L0
> entirely or perhaps we can mitigate some of the impact if we simply updating
> existing keys.  Not sure on that last point, but it might be worth a try.
> 
> how about something like this:
> 
> In the RocksDB WAL, Perform the compaction to L0 as usual, but provide a
> mechanism to flag certain entries as short lived (maybe even by prefix or
> column family).  After compacting the non-flaged entries in the buffer, remove
> the buffer from memory but leave it archived on disk and keep track of the
> remaining entries.  Once a tombstone (or overwrite) for all non-compacted
> entries in the archived buffer are encountered, flag the archived log for
> deletion.
> 
> That way we still interleave the writes into the log, but keep logs archived
> until all short lived data is tombstoned.
> 
> It seems rather difficult to write, but I think that's sort of the behavior we
> want.

Agree.. if rocksdb did this, it *sounds* perfect.  The only issues I see 
are that we'd need to (1) periodically rewrite log entries that get really 
old (i.e., idle pgs) in order to allow rocksdb to kill off the wal files 
(or give it a second threshold so that if the keys are living for way too 
long it will eventually just put them in L0).  And (2), if the problem 
isn't the compaction itself but the CPU overhead of feeding the data 
through rocksdb and having it get copied into memtables and indexes and so 
on.

Given that we aren't limited by IOPS on SSD, I think the more modest 
approach of writing the pg log entries to independent log(s) (and spending 
an extra IO) is a better bet.

I don't see a realistic opportunity to do some magic to get the 
interleaved behavior without significant rocksdb changes, though.

sage

> 
> Mark
> 
> On 03/28/2018 11:34 AM, Varada Kari wrote:
> > Agree. I like the approaches. Like first approach, we could manage the
> > space as a virtual container and keep them growing in case someone
> > wants to have a bigger trim window.
> > 
> > Wanted to check, instead of level compaction, what would be impact of
> > universal compaction? we would consume more space, but we can keep all
> > of the entries in L0 files. For SSD backends we might observe some
> > respite on the write amplification, but there could be more space
> > amplification.
> > 
> > Varada
> > 
> > On Wed, Mar 28, 2018 at 7:01 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> > > 
> > > 
> > > On 03/28/2018 08:05 AM, Varada Kari wrote:
> > > > 
> > > > On Wed, Mar 28, 2018 at 11:41 AM, xiaoyan li <wisher2003@xxxxxxxxx>
> > > > wrote:
> > > > > 
> > > > > On Wed, Mar 28, 2018 at 9:43 AM, Josh Durgin <jdurgin@xxxxxxxxxx>
> > > > > wrote:
> > > > > > 
> > > > > > Hi Lisa, your presentation last week at Cephalocon was quite
> > > > > > convincing.
> > > > > > 
> > > > > > Recordings aren't available yet, so perhaps you can share your
> > > > > > slides.
> > > > > 
> > > > > Here are the slides:
> > > > > 
> > > > > https://drive.google.com/file/d/1WC0id77KWLNVllsEcJCgRgEQ-Xzvzqx8/view?usp=sharing
> > > > > > 
> > > > > > For those who weren't there, Lisa tested many configurations of
> > > > > > rocksdb
> > > > > > with bluestore to attempt to keep the pg log out of level 0 in
> > > > > > rocksdb,
> > > > > > and thus avoid a large source of write amplification.
> > > > > > 
> > > > > > None of these tunings were successful, so the conclusion was that
> > > > > > the pg
> > > > > > log ought to be stored outside of rocksdb.
> > > > > > 
> > > > > > Lisa, what are your thoughts on how to store the pg log?
> > > > > > 
> > > > > > For historical reference, it was moved into leveldb originally to
> > > > > > make
> > > > > > it easier to program against correctly [0], but the current PGLog
> > > > > > code
> > > > > > has grown too complex despite that.
> > > > > 
> > > > > I ever wondered whether we can just put pg log in standalone log
> > > > > files. The read performance is not critical as they are read when an
> > > > > OSD node recovers. That is to store other metadata in RocksDB and then
> > > > > store pg log in standalone journal files. (No transaction for other
> > > > > metadata and pg log). But then I noticed that we can't differentiate
> > > > > which OSD has latest data if 3 OSD nodes which contain same pgs fail
> > > > > during a write request. Some OSDs may have updated data, and other
> > > > > OSDs may have un-undated data, which all of these have no pg log
> > > > > appended. In this case, it needs to compare the full objects.
> > > > > 
> > > > We need ordered set of pg log entries for recovery and peering.
> > > >    If we store them as files, we need to remember in
> > > > someway of what all entries this osd contains and where to read them.
> > > > 
> > > > And if we keep them in Bluefs as you mentioned in other
> > > > discussion, there will more entries in blues log and problem of
> > > > compaction shifts from rocksdb to Bluefs.
> > > > otherwise we have to store them in BlueStore block device, which would
> > > > consume more onodes but leads to
> > > > more fragmented space usage on the block device.
> > > 
> > > 
> > > There are two schemes that sort of seemed worth pursuing if we were to
> > > seriously considering moving this data outside of rocksdb:
> > > 
> > > 1) dedicated block space for per-pg ring buffers
> > > 
> > > 2) something vaguely like the rocksdb WAL but we don't keep entries in
> > > memory, we don't cap the number of buffers, and we keep in-memory
> > > references
> > > to each one and only delete old buffers once no PG references them
> > > anymore.
> > > 
> > > It seems to me that both schemes would avoid the write amp penalty we
> > > suffer
> > > from now.  1 should be more space efficient but more or less turn this all
> > > into random IO.  2 would mean bouncing between the WAL and this external
> > > log
> > > which might be nearly as bad as 1.  It also could mean extremely high
> > > space
> > > amp in unusual scenarios.
> > > 
> > > I think Josh's proposal from last time is worth thinking about too: Make
> > > per-pg ring buffers in rocksdb itself so that you are updating existing
> > > keys.
> > > 
> > > Mark
> > > 
> > > 
> > > 
> > > > 
> > > > > Another method I am investigating is that whether in Rocksdb we can
> > > > > use fifo case just for pg log. That means we need to handle for each
> > > > > pg. This needs to update in Rocksdb and every pg log will be written
> > > > > twice at most. (One to Rocksdb log, and one to level 0).
> > > > > 
> > > > Ideally if we can have, one instance of rocksdb per PG and one column
> > > > family per pglog entries,
> > > > that way we can have all the entries never leaving rocksdb L0s, but it
> > > > comes with its own baggage of managing multiple
> > > > instances and much resources. Practically might be a difficult thing to
> > > > handle.
> > > > Fifo might solve the problem but write amplification is coming to 2
> > > > again as you mentioned. sorry I don't have any bright ideas to support
> > > > you, we need to explore some ideas about separating all the pg log
> > > > info into a multiple column families by themselves and always pin them
> > > > to L0s, just passing some extra info to always compact them to
> > > > L0's(never to promote) or to keep the trimming intervals too tight to
> > > > keep them L0's, that can have more backfilling impact, but given the
> > > > improvements in recovery, should we calibrate that and find out the
> > > > cost of it compared to log based recovery?
> > > > 
> > > > Varada
> > > > 
> > > > > Any suggestions?
> > > > > 
> > > > > > Josh
> > > > > > 
> > > > > > [0]
> > > > > > 
> > > > > > https://github.com/ceph/ceph/commit/1ef94200e9bce5e0f0ac5d1e563421a9d036c203
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > ceph-devel" in
> > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > Best wishes
> > > > > Lisa
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html