Re: newstore direction

Sage Weil <sweil@xxxxxxxxxx> · Tue, 20 Oct 2015 13:00:23 -0700 (PDT)

On Tue, 20 Oct 2015, John Spray wrote:
> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> 
> This is the concerning bit for me -- the other parts one "just" has to
> get the code right, but this problem could linger and be something we
> have to keep explaining to users indefinitely.  It reminds me of cases
> in other systems where users had to make an educated guess about inode
> size up front, depending on whether you're expecting to efficiently
> store a lot of xattrs.
> 
> In practice it's rare for users to make these kinds of decisions well
> up-front: it really needs to be adjustable later, ideally
> automatically.  That could be pretty straightforward if the KV part
> was stored directly on block storage, instead of having XFS in the
> mix.  I'm not quite up with the state of the art in this area: are
> there any reasonable alternatives for the KV part that would consume
> some defined range of a block device from userspace, instead of
> sitting on top of a filesystem?

I agree: this is my primary concern with the raw block approach.

There are some KV alternatives that could consume block, but the problem 
would be similar: we need to dynamically size up or down the kv portion of 
the device.

I see two basic options:

1) Wire into the Env abstraction in rocksdb to provide something just 
smart enough to let rocksdb work.  It isn't much: named files (not that 
many--we could easily keep the file table in ram), always written 
sequentially, to be read later with random access. All of the code is 
written around abstractions of SequentialFileWriter so that everything 
posix is neatly hidden in env_posix (and there are various other env 
implementations for in-memory mock tests etc.).

2) Use something like dm-thin to sit between the raw block device and XFS 
(for rocksdb) and the block device consumed by newstore.  As long as XFS 
doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb 
files in their entirety) we can fstrim and size down the fs portion.  If 
we similarly make newstores allocator stick to large blocks only we would 
be able to size down the block portion as well.  Typical dm-thin block 
sizes seem to range from 64KB to 512KB, which seems reasonable enough to 
me.  In fact, we could likely just size the fs volume at something 
conservatively large (like 90%) and rely on -o discard or periodic fstrim 
to keep its actual utilization in check.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html