Re: Blueprint: Add LevelDB support to ceph cluster backend store

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2013-7-31, 2:01, Sage Weil <sage@xxxxxxxxxxx> wrote:

> Hi Haomai,
> 
> On Wed, 31 Jul 2013, Haomai Wang wrote:
>> Every node of ceph cluster has a backend filesystem such as btrfs,
>> xfs and ext4 that provides storage for data objects, whose location
>> are determined by CRUSH algorithm. There should exists an abstract
>> interface sitting between osd and backend store, allowing different
>> backend store implementation. Currently, we only have general 
>> POSIX interface. LevelDB is a fast key-value storage library written at 
>> Google that provides an ordered mapping from string keys to string 
>> values. We could implement a LevelDB backend to support base 
>> operations correspond to POSIX operations.  LevelDB driver enables 
>> gateway to communicate with LevelDB to store objects on the node 
>> basis.
>> 
>> 
>> LevelDB driver is attractive by the folks who own a special use case 
>> such as a write-heave system. If we can abstract a general interface, 
>> we can choose other DBM if you find it more suitable, such as Kyoto 
>> Cabinet, BDB. Futhermore, we can choose backen store for each OSD
>> node. So we have different OSD type for special purpose.
>> 
>> Expected Results: Objects can be stored reliably to LevelDB. The IO 
>> performance and recovery process can be comparable to original 
>> stores. And for special case, LevelDB driver should have much better 
>> performance than local filesystem backend driver. The snapshot and
>> any features you think of are optional.
> 
> I added a comment in the wiki, but I'll reply here.
> 
> Much of what you're talking about is already in place:
> 
> - There is an ObjectStore.h abstraction of the local storage.  The only 
>   up to date implementation is FileStore, which uses a combination 
>   of a local file system and leveldb, but other backends have been used 
>   in the past, and new ones can we easily added in.
> 
> - We currently use leveldb for the 'omap' component of rados objects.  
>   That is, each rados object has a bytestream portion (like a file), 
>   attr (like extended attributes), and an omap (keys/values).  All of 
>   none of those interfaces can be used for any given object, although 
>   most users only use one interface at a time.  The main limitation here 
>   if you want to use leveldb only is that we still have an inode in the 
>   file system to represent each object, even when it contains only 
>   key/value pairs.
> 
> - The use of leveldb itself is also well abstracted by a KeyValueDB 
>   interface, so other key/value libraries could be swapped in in its 
>   place.  The main other component is a middle layer that wraps the kv 
>   store to provide copy-on-write type semantics for each object's set of 
>   keys (to facilitate the snapshot functionality in rados/ceph).
> 
> If you have a workload that you want to be purgely key/value based, it 
> would be possible to write a much simpler ObjectStore implementation that 
> ignores or trivially implements the byte and attr portions of the object 
> in leveldb (or the KeyValueDB abstraction).  It would have very different 
> performance characteristics than what we're doing now, of course.  You 
> might also be interested in looking at the HyperLevelDB project, which is 
> a fork of leveldb that focuses on multithreading and compaction 
> performance.
I'm happy to hear it. 

I think there may exists one thing you may leave out.  If we abstract a unified
or more different interfaces, we can allow different pool to use in different
situation. For example, there exists two LevelDB backend OSD nodes forming
up a distributed k/v store, three Btrfs OSD nodes forming up a traditional use case.
More imaging space will be given to users.
> 
> We've heard from other people who are interested in wiring different 
> key/value backends into the OSD, so any work to make it easier to do that 
> would be great!
> 
> sage

Best regards,
Wheats



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux