Re: storing pg logs outside of rocksdb

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Wed, 28 Mar 2018 12:15:58 -0500

The last time I tested universal compaction I saw mixed results with it. 
 I think the problem is that if we are inserting new keys and deleting 
old ones in L0 we've already lost.  We either need to keep those keys 
out of L0 entirely or perhaps we can mitigate some of the impact if we 
simply updating existing keys.  Not sure on that last point, but it 
might be worth a try.

how about something like this:

In the RocksDB WAL, Perform the compaction to L0 as usual, but provide a 
mechanism to flag certain entries as short lived (maybe even by prefix 
or column family).  After compacting the non-flaged entries in the 
buffer, remove the buffer from memory but leave it archived on disk and 
keep track of the remaining entries.  Once a tombstone (or overwrite) 
for all non-compacted entries in the archived buffer are encountered, 
flag the archived log for deletion.

That way we still interleave the writes into the log, but keep logs 
archived until all short lived data is tombstoned.

It seems rather difficult to write, but I think that's sort of the 
behavior we want.

Mark

On 03/28/2018 11:34 AM, Varada Kari wrote:
Agree. I like the approaches. Like first approach, we could manage the
space as a virtual container and keep them growing in case someone
wants to have a bigger trim window.

Wanted to check, instead of level compaction, what would be impact of
universal compaction? we would consume more space, but we can keep all
of the entries in L0 files. For SSD backends we might observe some
respite on the write amplification, but there could be more space
amplification.

Varada

On Wed, Mar 28, 2018 at 7:01 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 03/28/2018 08:05 AM, Varada Kari wrote:

On Wed, Mar 28, 2018 at 11:41 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote:

On Wed, Mar 28, 2018 at 9:43 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:

Hi Lisa, your presentation last week at Cephalocon was quite convincing.

Recordings aren't available yet, so perhaps you can share your slides.

Here are the slides:

https://drive.google.com/file/d/1WC0id77KWLNVllsEcJCgRgEQ-Xzvzqx8/view?usp=sharing

For those who weren't there, Lisa tested many configurations of rocksdb
with bluestore to attempt to keep the pg log out of level 0 in rocksdb,
and thus avoid a large source of write amplification.

None of these tunings were successful, so the conclusion was that the pg
log ought to be stored outside of rocksdb.

Lisa, what are your thoughts on how to store the pg log?

For historical reference, it was moved into leveldb originally to make
it easier to program against correctly [0], but the current PGLog code
has grown too complex despite that.

I ever wondered whether we can just put pg log in standalone log
files. The read performance is not critical as they are read when an
OSD node recovers. That is to store other metadata in RocksDB and then
store pg log in standalone journal files. (No transaction for other
metadata and pg log). But then I noticed that we can't differentiate
which OSD has latest data if 3 OSD nodes which contain same pgs fail
during a write request. Some OSDs may have updated data, and other
OSDs may have un-undated data, which all of these have no pg log
appended. In this case, it needs to compare the full objects.

We need ordered set of pg log entries for recovery and peering.
   If we store them as files, we need to remember in
someway of what all entries this osd contains and where to read them.

And if we keep them in Bluefs as you mentioned in other
discussion, there will more entries in blues log and problem of
compaction shifts from rocksdb to Bluefs.
otherwise we have to store them in BlueStore block device, which would
consume more onodes but leads to
more fragmented space usage on the block device.

There are two schemes that sort of seemed worth pursuing if we were to
seriously considering moving this data outside of rocksdb:

1) dedicated block space for per-pg ring buffers

2) something vaguely like the rocksdb WAL but we don't keep entries in
memory, we don't cap the number of buffers, and we keep in-memory references
to each one and only delete old buffers once no PG references them anymore.

It seems to me that both schemes would avoid the write amp penalty we suffer
from now.  1 should be more space efficient but more or less turn this all
into random IO.  2 would mean bouncing between the WAL and this external log
which might be nearly as bad as 1.  It also could mean extremely high space
amp in unusual scenarios.

I think Josh's proposal from last time is worth thinking about too: Make
per-pg ring buffers in rocksdb itself so that you are updating existing
keys.

Mark

Another method I am investigating is that whether in Rocksdb we can
use fifo case just for pg log. That means we need to handle for each
pg. This needs to update in Rocksdb and every pg log will be written
twice at most. (One to Rocksdb log, and one to level 0).

Ideally if we can have, one instance of rocksdb per PG and one column
family per pglog entries,
that way we can have all the entries never leaving rocksdb L0s, but it
comes with its own baggage of managing multiple
instances and much resources. Practically might be a difficult thing to
handle.
Fifo might solve the problem but write amplification is coming to 2
again as you mentioned. sorry I don't have any bright ideas to support
you, we need to explore some ideas about separating all the pg log
info into a multiple column families by themselves and always pin them
to L0s, just passing some extra info to always compact them to
L0's(never to promote) or to keep the trimming intervals too tight to
keep them L0's, that can have more backfilling impact, but given the
improvements in recovery, should we calibrate that and find out the
cost of it compared to log based recovery?

Varada

Any suggestions?

Josh

[0]

https://github.com/ceph/ceph/commit/1ef94200e9bce5e0f0ac5d1e563421a9d036c203
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html