Re: storing pg logs outside of rocksdb

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 28, 2018 at 11:41 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote:
> On Wed, Mar 28, 2018 at 9:43 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>> Hi Lisa, your presentation last week at Cephalocon was quite convincing.
>>
>> Recordings aren't available yet, so perhaps you can share your slides.
>
> Here are the slides:
> https://drive.google.com/file/d/1WC0id77KWLNVllsEcJCgRgEQ-Xzvzqx8/view?usp=sharing
>>
>> For those who weren't there, Lisa tested many configurations of rocksdb
>> with bluestore to attempt to keep the pg log out of level 0 in rocksdb,
>> and thus avoid a large source of write amplification.
>>
>> None of these tunings were successful, so the conclusion was that the pg
>> log ought to be stored outside of rocksdb.
>>
>> Lisa, what are your thoughts on how to store the pg log?
>>
>> For historical reference, it was moved into leveldb originally to make
>> it easier to program against correctly [0], but the current PGLog code
>> has grown too complex despite that.
> I ever wondered whether we can just put pg log in standalone log
> files. The read performance is not critical as they are read when an
> OSD node recovers. That is to store other metadata in RocksDB and then
> store pg log in standalone journal files. (No transaction for other
> metadata and pg log). But then I noticed that we can't differentiate
> which OSD has latest data if 3 OSD nodes which contain same pgs fail
> during a write request. Some OSDs may have updated data, and other
> OSDs may have un-undated data, which all of these have no pg log
> appended. In this case, it needs to compare the full objects.
>
We need ordered set of pg log entries for recovery and peering.
 If we store them as files, we need to remember in
someway of what all entries this osd contains and where to read them.

And if we keep them in Bluefs as you mentioned in other
discussion, there will more entries in blues log and problem of
compaction shifts from rocksdb to Bluefs.
otherwise we have to store them in BlueStore block device, which would
consume more onodes but leads to
more fragmented space usage on the block device.

> Another method I am investigating is that whether in Rocksdb we can
> use fifo case just for pg log. That means we need to handle for each
> pg. This needs to update in Rocksdb and every pg log will be written
> twice at most. (One to Rocksdb log, and one to level 0).
>
Ideally if we can have, one instance of rocksdb per PG and one column
family per pglog entries,
that way we can have all the entries never leaving rocksdb L0s, but it
comes with its own baggage of managing multiple
instances and much resources. Practically might be a difficult thing to handle.
Fifo might solve the problem but write amplification is coming to 2
again as you mentioned. sorry I don't have any bright ideas to support
you, we need to explore some ideas about separating all the pg log
info into a multiple column families by themselves and always pin them
to L0s, just passing some extra info to always compact them to
L0's(never to promote) or to keep the trimming intervals too tight to
keep them L0's, that can have more backfilling impact, but given the
improvements in recovery, should we calibrate that and find out the
cost of it compared to log based recovery?

Varada

> Any suggestions?
>
>>
>> Josh
>>
>> [0]
>> https://github.com/ceph/ceph/commit/1ef94200e9bce5e0f0ac5d1e563421a9d036c203
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best wishes
> Lisa
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux