Re: Is BlueFS an alternative of BlueStore?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage,

Peng and I investigated the code about PG backfill and scrub per your guidance.
Below is further investigation result.

Please forgive me about the long email :-(

ZFS library + ObjectStore
=========================

I think I know very well about what you mentioned "collection sorted
enumeration". The so called "sorted enumeration" actually implies two
meanings:

1. a sort of all objects in the collection.
2. given a object, it can tell whether the object in a range easily.

Obviously, the most efficient way is NOT to sort the objects of collection
after we retrieve the list of objects from backend. So it would be better
that the entries are stored on the backend according the expected order.
That's why RocksDB is key piece of BlueStore.

We tried so hard to map the ZFS ZAP to CEPH collection. Here is what we
thought the scheme:

ZAP is ZFS Attribute Processor which is actually a object type to describe
Key-Value set. ZFS used it a lot to describe metadata, Directory is one of them.
And the most important thing is entries in ZAP does have a "ORDER". The ZAP
hashes the "key" to a 64-bit integer, plus a 32-bit CD (collision
differentiator) to index and store the KV entries. The CD is managed by ZAP
iteself to solve hash collision and is persisted in the ZAP entry descriptor.
(There is more detailed explanation about ZAP at the end of the mail)

In theory, we are able to use ZAP to achieve the goal of "sorted enumeration".
Firstly, we can retrieve a sorted list of KVs(objects) from ZAP.
Secondly, according key name (object name), hash can be calculated, and we can retrieve CD from on-disk ZAP entry associated to the object.bring hash and CD
together, the order is able to be determined.

However, we didn't find a elegant way to implement the idea for CEPH. If we
leverage ZFS libraries to implement a new ObjectStore, the change cannot
be well confined in the ObjectStore layer since hboject, gobject and
comparision logic will be redefined based on ZFS "ZAP entry hash + CD",
which is beyond the scope of ObjectStore alone. The comparision logics
is spread in ReplicatedPG etc.

In addition, we have another question about BlueStore which is relevant to our
idea. Does BlueStore consider "batch writes"?
Similar to BlueStore, ZFS is also no "modify in place". ZFS's transaction
considers not only metadata/data consistency, but also "batch writes". The
write batch reduces disk write times significantly. So ZFS transaction
persist data to disk in 5 seconds period. I saw FileStore persist data
immediately even in filesystem semantics without sync() requirement.
If we align ZFS transaction and CEPH ObjectStore transaction, it means
we either delay persist data to backend until 5-second transaction commit
or persist data to ZIL immediately before update real backend. The last
choice is still double write. Will it be a problem if we delay persist
data and reply to client until the data is persisted?

We are looking forward to your advice, is it worthy that we continue the
proposal (leveraging ZFS library to implement a new ObjectStore)?

ZFS Library + RocksDB
=====================
We also evaluated the possibility of using ZFS libraries to host
RocksDB. I think it is very hard to do that. The reasons are:

1. ZIL reclaims the block after log trim and allocates block when new
log record is added, so that means there is no BlueFS-like "warm up
phase."

2. RocksDB does sync write for WAL. Then RocksDB sync flush memtable
to backend file before trim WAL. ZFS does not like sync operation since
it tries to batch writes and commit data in 5 seconds. ZFS trim ZIL once
transaction is commited. So the life cycle of ZIL does not match RocksDB
WAL. If we are going to change that, there would be a huge change in
RocksDB which cannot be confined in RocksDB::Env.

Overall, there is NO impossible in Engineer's world, but whether the
effort is worthful should be considered carefully ;-)


ZAP description:
==============

ZAP hashes the attribute name (key) to a 64 bit integer.
CD is collision differentiator when hash collision and CD
is managed by ZAP and is persisted on the backend.

So 64bit hash + CD uniquely identify a attribute in the ZAP object.
ZAP insert/index the KVs in the order of (hash + CD).

n + m + k = 64 bits
n bits decide the point table bucket,
m bits decide which zap leaf block
k bits decide the entry in the leaf bucket
CD is collision differentiator

+---------------------+
|ZAP object descriptor|
+---------------------+
         |
         |  n bit of prefix of 64-bit hash index into bucket of ptbl
         V
pointer table
 ___________
| zap leaf  |
|___________|           zap leaf           zap leaf
| zap leaf  |        ____________        ____________
|___________|        |   next   |        |   next   |
| zap leaf  |------->|__________|------> |__________|
|___________|        | hash tbl |        | hash tbl |
|    ...    |        |__________|        |__________|
                          |                   |
                          | entry hash tbl    | entry hash tbl
                     _____V_____          ____V_____
                     |__________|        |__________|
                     |__________|        |__________|
                     |__________|        |__________|
                     |__________|        |__________|
           ----------|__________| |__________|
           |
           |
           |
           |
        ___V______        __________ __________
       |entry next|----> |entry next|----> |entry next|
       |__________|      |__________|      |__________|
       |__ hash___|      |___hash___|      |___hash___|
       |    CD    |      |    CD    |      |    CD    |
       |__________|      |__________|      |__________|


Thanks
Javen & Peng


On Thu, 7 Jan 2016, Javen Wu wrote:
Thanks Sage for your reply.

I am not sure I understand the challenges you mentioned about backfill/scrub.
I will investigate from the code and let you know if we can conquer the
challenge by easy means.
Our rough idea for ZFSStore are:
1. encapsulate dnode object as onode and add onode attributes.
2. uses ZAP object as collection. (ZFS directory uses ZAP object)
3. enumerating entries in ZAP object is list objects in collection.
This is the key piece that will determine whether rocksdb (or something
similar) is required.  POSIX doesn't give you sorted enumeration of
files.  In order to provide that with FileStore, we used a horrible
hashing scheme that dynamically broke directories into
smaller subdirectories once they got big, and organized things by a hash
prefix (enumeration is in hash order).  That meant a mess of directories
with bounded size (so that there were a bounded number of entries to read
and then sort in memory before returning a sorted result), which was
inefficient, and it meant that as the number of objects grew you'd have
this periodic rehash work that had to be done that further slowed things
down.  This, combined with the inability to group an arbitrary
number of file operations (writes, unlinks, renames, setxattrs, etc.) into
an atomic transaction was FileStore's downfall.  I think the zfs libs give
you the transactions you need, but you *also* need to get sorted
enumeration (with a sort order you define) or else you'll have all the
ugliness of the FileStore indexes.

4. create a new metaslab class to store CEPH journal.
5. align CEPH journal and ZFS transcation.

Actually we've talked about the possibility of building RocksDB::Env on top
of the zfs libraries. It must align ZIL(ZFS intent log) and RocksDB WAL.
Otherwise, there is still same problem as XFS and RocksDB.

ZFS is tree style log structure-like file system, once a leaf block updates,
the modification would be propagated from the leaf to the root of tree.
To batch writes and reduce times of disk write, ZFS persist modification to
disk
in 5 seconds transaction. Only when Fsync/sync write arrives in the middle of
the 5 seconds, ZFS would persist the journal to ZIL.
I remembered RocksDB would do a sync after log record adding, so it means if
we can not align ZIL and WAL, the log write would be write to ZIL firstly and
then apply ZIL to log file, finally Rockdb update sst file. It's almost the
same problem as XFS if my understanding is correct.
If you implement rocksdb::Env, you'll see the rocksdb WAL writes and the
fsync calls come down.  You can store those however you'd like... as
"files" or perhaps directly in the ZIL.

The way we do this in BlueFS is that for an initial warm-up period, we
append to a WAL log file, and have to do both the log write *and* a
journal write to update the file size.  Once we've written out enough
logs, though, we start recycling the same logs (and disk blocks) and just
overwrite the previously allocated space.  The rocksdb log replay is now
smart enough to determine when it's reached the end of the new content and
is now seeing (old) garbage and stop.

Whether it makes sense to do something similar in zfs-land I'm not sure.
Presumably the ZIL itself is doing something similar (sequence nubmers and
crcs on log entries in a circular buffer) but the rocksdb log
lifecycle probably doesn't match the ZIL...

sage

In my mind, aligning ZIL and WAL need more modifications in RocksDB.

Thanks
Javen


On 2016年01月07日 22:37, peng.hse wrote:
Hi Sage,

thanks for your quick response. Javen and I  once the zfs developer,are
currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance in
userspace.


Based on your encouraging reply, we come up with 2 schemes to continue our
future work

1. the scheme one: using the entire new FS to replace rocksdb+bluefs, the FS
itself handles the mapping of
     oid->fs-object(kind of zfs dnode) and the according attrs used by ceph.
    Despite the implemention challenges you mentioned about the in-order
enumeration of objects during backfill, scrub, etc (the
     same situation we also confronted in zfs, the ZAP features help us a
lot).
     From performance or architecture point of view, it looks more clear and
clean, would you suggest us to give a try ?

2. the scheme two: As your last suspect, we just temporarily implemented the
simple version of the FS
      which leverage libzpool ideas to plug into rocksdb underneath as your
bluefs did

precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:
On Thu, 7 Jan 2016, Javen Wu wrote:
Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send email to
you
directly, but I cannot find any useful information to address my
confusion
from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to eliminate
the
redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct
about
BlueStore. BlueStore in my mind is as below.

               BlueStore
               =========
     RocksDB
+-----------+          +-----------+
|   onode   |          |           |
|    WAL    |          |           |
|   omap    |          |           |
+-----------+          |   bdev    |
|           |          |           |
|   XFS     |          |           |
|           |          |           |
+-----------+          +-----------+
This is the picture before BlueFS enters the picture.

I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its own
WITHOUT the help of RocksDB.
Right.  BlueFS is a really simple "file system" that is *just* complicated
enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

   +--------------------+
   |     bluestore      |
   +----------+         |
   | rocksdb  |         |
   +----------+         |
   |  bluefs  |         |
   +----------+---------+
   |    block device    |
   +--------------------+

The reason we care the intention and the design target of BlueFS is that
I had
discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore
backend
already, but we had a different immature idea to use libzpool to
implement a
new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel
module.
So that we can align CEPH transaction and zfs transaction in order to
avoid
double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store
and
it's platform kernel/user independent. Another benefit for the idea is
we
can extend our metadata without bothering any DBStore.

Frankly, we are not sure if our idea is realistic so far, but when I
heard of
BlueFS, I think we need to know the BlueFS design goal.
I think it makes a lot of sense, but there are a few challenges.  One
reason we use rocksdb (or a similar kv store) is that we need in-order
enumeration of objects in order to do collection listing (needed for
backfill, scrub, and omap).  You'll need something similar on top of zfs.

I suspect the simplest path would be to also implement the rocksdb::Env
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the
interface that has to be implemented...

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux