RE: newstore direction

Sage Weil <sweil@xxxxxxxxxx> · Tue, 20 Oct 2015 05:30:23 -0700 (PDT)

On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> +1, nowadays K-V DB care more about very small key-value pairs, say 
> several bytes to a few KB, but in SSD case we only care about 4KB or 
> 8KB. In this way, NVMKV is a good design and seems some of the SSD 
> vendor are also trying to build this kind of interface, we had a NVM-L 
> library but still under development.

Do you have an NVMKV link?  I see a paper and a stale github repo.. not 
sure if I'm looking at the right thing.

My concern with using a key/value interface for the object data is that 
you end up with lots of key/value pairs (e.g., $inode_$offset = 
$4kb_of_data) that is pretty inefficient to store and (depending on the 
implementation) tends to break alignment.  I don't think these interfaces 
are targetted toward block-sized/aligned payloads.  Storing just the 
metadata (block allocation map) w/ the kv api and storing the data 
directly on a block/page interface makes more sense to me.

sage

> > -----Original Message-----
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > owner@xxxxxxxxxxxxxxx] On Behalf Of James (Fei) Liu-SSI
> > Sent: Tuesday, October 20, 2015 6:21 AM
> > To: Sage Weil; Somnath Roy
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: newstore direction
> > 
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive  solution than raw
> > block device base keyvalue store as backend for objectstore. The new key
> > value  SSD device with transaction support would be  ideal to solve the issues.
> > First of all, it is raw SSD device. Secondly , It provides key value interface
> > directly from SSD. Thirdly, it can provide transaction support, consistency will
> > be guaranteed by hardware device. It pretty much satisfied all of objectstore
> > needs without any extra overhead since there is not any extra layer in
> > between device and objectstore.
> >    Either way, I strongly support to have CEPH own data format instead of
> > relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v
> > > dbs (for storing allocators and all). The reason is the unknown write
> > > amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it as
> > appropriate) so that other backends can be easily swapped in (e.g. a btree-
> > based one for high-end flash).
> > 
> > sage
> > 
> > 
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@xxxxxxxxxxxxxxx
> > > Subject: newstore direction
> > >
> > > The current design is based on two simple ideas:
> > >
> > >  1) a key/value interface is better way to manage all of our internal
> > > metadata (object metadata, attrs, layout, collection membership,
> > > write-ahead logging, overlay data, etc.)
> > >
> > >  2) a file system is well suited for storage object data (as files).
> > >
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > few
> > > things:
> > >
> > >  - We currently write the data to the file, fsync, then commit the kv
> > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > journal, one for the kv txn to commit (at least once my rocksdb
> > > changes land... the kv commit is currently 2-3).  So two people are
> > > managing metadata, here: the fs managing the file metadata (with its
> > > own
> > > journal) and the kv backend (with its journal).
> > >
> > >  - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> > minimum it is a couple btree lookups.  We'd love to use open by handle
> > (which would reduce this to 1 btree traversal), but running the daemon as
> > ceph and not root makes that hard...
> > >
> > >  - ...and file systems insist on updating mtime on writes, even when it is a
> > overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> > >
> > >  - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> > >
> > > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep it
> > pretty simple, and manage it in kv store along with all of our other metadata.
> > >
> > > Wins:
> > >
> > >  - 2 IOs for most: one to write the data to unused space in the block device,
> > one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> > before).
> > >
> > >  - No concern about mtime getting in the way
> > >
> > >  - Faster reads (no fs lookup)
> > >
> > >  - Similarly sized metadata for most objects.  If we assume most objects are
> > not fragmented, then the metadata to store the block offsets is about the
> > same size as the metadata to store the filenames we have now.
> > >
> > > Problems:
> > >
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put
> > > metadata on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of a
> > different pool and those aren't currently fungible.
> > >
> > >  - We have to write and maintain an allocator.  I'm still optimistic this can be
> > reasonbly simple, especially for the flash case (where fragmentation isn't
> > such an issue as long as our blocks are reasonbly sized).  For disk we may
> > beed to be moderately clever.
> > >
> > >  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> > news is it'll just need to validate what we have stored in the kv store.
> > >
> > > Other thoughts:
> > >
> > >  - We might want to consider whether dm-thin or bcache or other block
> > layers might help us with elasticity of file vs block areas.
> > >
> > >  - Rocksdb can push colder data to a second directory, so we could
> > > have a fast ssd primary area (for wal and most metadata) and a second
> > > hdd directory for stuff it has to push off.  Then have a conservative
> > > amount of file space on the hdd.  If our block fills up, use the
> > > existing file mechanism to put data there too.  (But then we have to
> > > maintain both the current kv + file approach and not go all-in on kv +
> > > block.)
> > >
> > > Thoughts?
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > > ________________________________
> > >
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby notified
> > that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html