Re: newstore direction

Matt Benjamin <mbenjamin@xxxxxxxxxx> · Tue, 20 Oct 2015 16:42:54 -0400 (EDT)

We mostly assumed that sort-of transactional file systems, perhaps hosted in user space was the most tractable trajectory.  I have seen newstore and keyvalue store as essentially congruent approaches using database primitives (and I am interested in what you make of Russell Sears).  I'm skeptical of any hope of keeping things "simple."  Like Martin downthread, most systems I havce seen (filers, ZFS)) make use of a fast, durable commit log and then flex out...something else.

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

----- Original Message -----
> From: "Sage Weil" <sweil@xxxxxxxxxx>
> To: "John Spray" <jspray@xxxxxxxxxx>
> Cc: "Ceph Development" <ceph-devel@xxxxxxxxxxxxxxx>
> Sent: Tuesday, October 20, 2015 4:00:23 PM
> Subject: Re: newstore direction
> 
> On Tue, 20 Oct 2015, John Spray wrote:
> > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put metadata
> > > on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > > rgw index data or cephfs metadata?  Suddenly we are pulling storage out
> > > of
> > > a different pool and those aren't currently fungible.
> > 
> > This is the concerning bit for me -- the other parts one "just" has to
> > get the code right, but this problem could linger and be something we
> > have to keep explaining to users indefinitely.  It reminds me of cases
> > in other systems where users had to make an educated guess about inode
> > size up front, depending on whether you're expecting to efficiently
> > store a lot of xattrs.
> > 
> > In practice it's rare for users to make these kinds of decisions well
> > up-front: it really needs to be adjustable later, ideally
> > automatically.  That could be pretty straightforward if the KV part
> > was stored directly on block storage, instead of having XFS in the
> > mix.  I'm not quite up with the state of the art in this area: are
> > there any reasonable alternatives for the KV part that would consume
> > some defined range of a block device from userspace, instead of
> > sitting on top of a filesystem?
> 
> I agree: this is my primary concern with the raw block approach.
> 
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
> 
> I see two basic options:
> 
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
> 
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html