Re: storing pg logs outside of rocksdb

Varada Kari <varada.kari@xxxxxxxxx> · Wed, 28 Mar 2018 22:04:40 +0530

Agree. I like the approaches. Like first approach, we could manage the
space as a virtual container and keep them growing in case someone
wants to have a bigger trim window.

Wanted to check, instead of level compaction, what would be impact of
universal compaction? we would consume more space, but we can keep all
of the entries in L0 files. For SSD backends we might observe some
respite on the write amplification, but there could be more space
amplification.

Varada

On Wed, Mar 28, 2018 at 7:01 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
>
> On 03/28/2018 08:05 AM, Varada Kari wrote:
>>
>> On Wed, Mar 28, 2018 at 11:41 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote:
>>>
>>> On Wed, Mar 28, 2018 at 9:43 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>>>>
>>>> Hi Lisa, your presentation last week at Cephalocon was quite convincing.
>>>>
>>>> Recordings aren't available yet, so perhaps you can share your slides.
>>>
>>> Here are the slides:
>>>
>>> https://drive.google.com/file/d/1WC0id77KWLNVllsEcJCgRgEQ-Xzvzqx8/view?usp=sharing
>>>>
>>>> For those who weren't there, Lisa tested many configurations of rocksdb
>>>> with bluestore to attempt to keep the pg log out of level 0 in rocksdb,
>>>> and thus avoid a large source of write amplification.
>>>>
>>>> None of these tunings were successful, so the conclusion was that the pg
>>>> log ought to be stored outside of rocksdb.
>>>>
>>>> Lisa, what are your thoughts on how to store the pg log?
>>>>
>>>> For historical reference, it was moved into leveldb originally to make
>>>> it easier to program against correctly [0], but the current PGLog code
>>>> has grown too complex despite that.
>>>
>>> I ever wondered whether we can just put pg log in standalone log
>>> files. The read performance is not critical as they are read when an
>>> OSD node recovers. That is to store other metadata in RocksDB and then
>>> store pg log in standalone journal files. (No transaction for other
>>> metadata and pg log). But then I noticed that we can't differentiate
>>> which OSD has latest data if 3 OSD nodes which contain same pgs fail
>>> during a write request. Some OSDs may have updated data, and other
>>> OSDs may have un-undated data, which all of these have no pg log
>>> appended. In this case, it needs to compare the full objects.
>>>
>> We need ordered set of pg log entries for recovery and peering.
>>   If we store them as files, we need to remember in
>> someway of what all entries this osd contains and where to read them.
>>
>> And if we keep them in Bluefs as you mentioned in other
>> discussion, there will more entries in blues log and problem of
>> compaction shifts from rocksdb to Bluefs.
>> otherwise we have to store them in BlueStore block device, which would
>> consume more onodes but leads to
>> more fragmented space usage on the block device.
>
>
> There are two schemes that sort of seemed worth pursuing if we were to
> seriously considering moving this data outside of rocksdb:
>
> 1) dedicated block space for per-pg ring buffers
>
> 2) something vaguely like the rocksdb WAL but we don't keep entries in
> memory, we don't cap the number of buffers, and we keep in-memory references
> to each one and only delete old buffers once no PG references them anymore.
>
> It seems to me that both schemes would avoid the write amp penalty we suffer
> from now.  1 should be more space efficient but more or less turn this all
> into random IO.  2 would mean bouncing between the WAL and this external log
> which might be nearly as bad as 1.  It also could mean extremely high space
> amp in unusual scenarios.
>
> I think Josh's proposal from last time is worth thinking about too: Make
> per-pg ring buffers in rocksdb itself so that you are updating existing
> keys.
>
> Mark
>
>
>
>>
>>> Another method I am investigating is that whether in Rocksdb we can
>>> use fifo case just for pg log. That means we need to handle for each
>>> pg. This needs to update in Rocksdb and every pg log will be written
>>> twice at most. (One to Rocksdb log, and one to level 0).
>>>
>> Ideally if we can have, one instance of rocksdb per PG and one column
>> family per pglog entries,
>> that way we can have all the entries never leaving rocksdb L0s, but it
>> comes with its own baggage of managing multiple
>> instances and much resources. Practically might be a difficult thing to
>> handle.
>> Fifo might solve the problem but write amplification is coming to 2
>> again as you mentioned. sorry I don't have any bright ideas to support
>> you, we need to explore some ideas about separating all the pg log
>> info into a multiple column families by themselves and always pin them
>> to L0s, just passing some extra info to always compact them to
>> L0's(never to promote) or to keep the trimming intervals too tight to
>> keep them L0's, that can have more backfilling impact, but given the
>> improvements in recovery, should we calibrate that and find out the
>> cost of it compared to log based recovery?
>>
>> Varada
>>
>>> Any suggestions?
>>>
>>>> Josh
>>>>
>>>> [0]
>>>>
>>>> https://github.com/ceph/ceph/commit/1ef94200e9bce5e0f0ac5d1e563421a9d036c203
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best wishes
>>> Lisa
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html