Re: Is BlueFS an alternative of BlueStore?

Javen Wu <javen.wu@xxxxxxxxxxxx> · Thu, 7 Jan 2016 23:54:24 +0800

Appreciate your explanation. I get you mean.
I will think about it and get you back after do more investigation.

Javen
> 
> This is the key piece that will determine whether rocksdb (or something 
> similar) is required.  POSIX doesn't give you sorted enumeration of 
> files.  In order to provide that with FileStore, we used a horrible 
> hashing scheme that dynamically broke directories into 
> smaller subdirectories once they got big, and organized things by a hash 
> prefix (enumeration is in hash order).  That meant a mess of directories 
> with bounded size (so that there were a bounded number of entries to read 
> and then sort in memory before returning a sorted result), which was 
> inefficient, and it meant that as the number of objects grew you'd have 
> this periodic rehash work that had to be done that further slowed things 
> down.  This, combined with the inability to group an arbitrary 
> number of file operations (writes, unlinks, renames, setxattrs, etc.) into 
> an atomic transaction was FileStore's downfall.  I think the zfs libs give 
> you the transactions you need, but you *also* need to get sorted 
> enumeration (with a sort order you define) or else you'll have all the 
> ugliness of the FileStore indexes.
> 

>> 4. create a new metaslab class to store CEPH journal.
>> 5. align CEPH journal and ZFS transcation.
>> 
>> Actually we've talked about the possibility of building RocksDB::Env on top
>> of the zfs libraries. It must align ZIL(ZFS intent log) and RocksDB WAL.
>> Otherwise, there is still same problem as XFS and RocksDB.
>> 
>> ZFS is tree style log structure-like file system, once a leaf block updates,
>> the modification would be propagated from the leaf to the root of tree.
>> To batch writes and reduce times of disk write, ZFS persist modification to
>> disk
>> in 5 seconds transaction. Only when Fsync/sync write arrives in the middle of
>> the 5 seconds, ZFS would persist the journal to ZIL.
>> I remembered RocksDB would do a sync after log record adding, so it means if
>> we can not align ZIL and WAL, the log write would be write to ZIL firstly and
>> then apply ZIL to log file, finally Rockdb update sst file. It's almost the
>> same problem as XFS if my understanding is correct.
> 
> If you implement rocksdb::Env, you'll see the rocksdb WAL writes and the 
> fsync calls come down.  You can store those however you'd like... as 
> "files" or perhaps directly in the ZIL.
> 
> The way we do this in BlueFS is that for an initial warm-up period, we 
> append to a WAL log file, and have to do both the log write *and* a 
> journal write to update the file size.  Once we've written out enough 
> logs, though, we start recycling the same logs (and disk blocks) and just 
> overwrite the previously allocated space.  The rocksdb log replay is now 
> smart enough to determine when it's reached the end of the new content and 
> is now seeing (old) garbage and stop.
> 
> Whether it makes sense to do something similar in zfs-land I'm not sure.  
> Presumably the ZIL itself is doing something similar (sequence nubmers and 
> crcs on log entries in a circular buffer) but the rocksdb log 
> lifecycle probably doesn't match the ZIL...
> 
> sage
> 
>> In my mind, aligning ZIL and WAL need more modifications in RocksDB.
>> 
>> Thanks
>> Javen
>> 
>> 
>> On 2016年01月07日 22:37, peng.hse wrote:
>>> Hi Sage,
>>> 
>>> thanks for your quick response. Javen and I  once the zfs developer,are
>>> currently focusing on how to
>>> leverage some of the zfs ideas to improve the ceph backend performance in
>>> userspace.
>>> 
>>> 
>>> Based on your encouraging reply, we come up with 2 schemes to continue our
>>> future work
>>> 
>>> 1. the scheme one: using the entire new FS to replace rocksdb+bluefs, the FS
>>> itself handles the mapping of
>>>    oid->fs-object(kind of zfs dnode) and the according attrs used by ceph.
>>>   Despite the implemention challenges you mentioned about the in-order
>>> enumeration of objects during backfill, scrub, etc (the
>>>    same situation we also confronted in zfs, the ZAP features help us a
>>> lot).
>>>    From performance or architecture point of view, it looks more clear and
>>> clean, would you suggest us to give a try ?
>>> 
>>> 2. the scheme two: As your last suspect, we just temporarily implemented the
>>> simple version of the FS
>>>     which leverage libzpool ideas to plug into rocksdb underneath as your
>>> bluefs did
>>> 
>>> precious your insightful reply.
>>> 
>>> Thanks
>>> 
>>> 
>>> 
>>> On 2016年01月07日 21:19, Sage Weil wrote:
>>>> On Thu, 7 Jan 2016, Javen Wu wrote:
>>>>> Hi Sage,
>>>>> 
>>>>> Sorry to bother you. I am not sure if it is appropriate to send email to
>>>>> you
>>>>> directly, but I cannot find any useful information to address my
>>>>> confusion
>>>>> from Internet. Hope you can help me.
>>>>> 
>>>>> Occasionally, I heard that you are going to start BlueFS to eliminate
>>>>> the
>>>>> redudancy between XFS journal and RocksDB WAL. I am a little confused.
>>>>> Is the Bluefs only to host RocksDB for BlueStore or it's an
>>>>> alternative of BlueStore?
>>>>> 
>>>>> I am a new comer to CEPH, I am not sure my understanding is correct
>>>>> about
>>>>> BlueStore. BlueStore in my mind is as below.
>>>>> 
>>>>>              BlueStore
>>>>>              =========
>>>>>    RocksDB
>>>>> +-----------+          +-----------+
>>>>> |   onode   |          |           |
>>>>> |    WAL    |          |           |
>>>>> |   omap    |          |           |
>>>>> +-----------+          |   bdev    |
>>>>> |           |          |           |
>>>>> |   XFS     |          |           |
>>>>> |           |          |           |
>>>>> +-----------+          +-----------+
>>>> This is the picture before BlueFS enters the picture.
>>>> 
>>>>> I am curious if BlueFS is able to host RocksDB, actually it's already a
>>>>> "filesystem" which have to maintain blockmap kind of metadata by its own
>>>>> WITHOUT the help of RocksDB.
>>>> Right.  BlueFS is a really simple "file system" that is *just* complicated
>>>> enough to implement the rocksdb::Env interface, which is what rocksdb
>>>> needs to store its log and sst files.  The after picture looks like
>>>> 
>>>>  +--------------------+
>>>>  |     bluestore      |
>>>>  +----------+         |
>>>>  | rocksdb  |         |
>>>>  +----------+         |
>>>>  |  bluefs  |         |
>>>>  +----------+---------+
>>>>  |    block device    |
>>>>  +--------------------+
>>>> 
>>>>> The reason we care the intention and the design target of BlueFS is that
>>>>> I had
>>>>> discussion with my partner Peng.Hse about an idea to introduce a new
>>>>> ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore
>>>>> backend
>>>>> already, but we had a different immature idea to use libzpool to
>>>>> implement a
>>>>> new
>>>>> ObjectStore for CEPH totally in userspace without SPL and ZOL kernel
>>>>> module.
>>>>> So that we can align CEPH transaction and zfs transaction in order to
>>>>> avoid
>>>>> double write for CEPH journal.
>>>>> ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store
>>>>> and
>>>>> it's platform kernel/user independent. Another benefit for the idea is
>>>>> we
>>>>> can extend our metadata without bothering any DBStore.
>>>>> 
>>>>> Frankly, we are not sure if our idea is realistic so far, but when I
>>>>> heard of
>>>>> BlueFS, I think we need to know the BlueFS design goal.
>>>> I think it makes a lot of sense, but there are a few challenges.  One
>>>> reason we use rocksdb (or a similar kv store) is that we need in-order
>>>> enumeration of objects in order to do collection listing (needed for
>>>> backfill, scrub, and omap).  You'll need something similar on top of zfs.
>>>> 
>>>> I suspect the simplest path would be to also implement the rocksdb::Env
>>>> interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the
>>>> interface that has to be implemented...
>>>> 
>>>> sage
>>>> 
>>> 
>>> 
>> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html