Re: newstore direction

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 20 Oct 2015 13:36:31 -0700

On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Tue, 20 Oct 2015, Ric Wheeler wrote:
>> The big problem with consuming block devices directly is that you ultimately
>> end up recreating most of the features that you had in the file system. Even
>> enterprise databases like Oracle and DB2 have been migrating away from running
>> on raw block devices in favor of file systems over time.  In effect, you are
>> looking at making a simple on disk file system which is always easier to start
>> than it is to get back to a stable, production ready state.
>
> This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
> everything we were implementing and more: mainly, copy on write and data
> checksums.  But in practice the fact that its general purpose means it
> targets a very different workloads and APIs than what we need.

Try 7 years since ebofs...
That's one of my concerns, though. You ditched ebofs once already
because it had metastasized into an entire FS, and had reached its
limits of maintainability. What makes you think a second time through
would work better? :/

On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

I can't work this one out. If you're doing one write for the data and
one for the kv journal (which is on another filesystem), how does the
commit sequence work that it's only 2 IOs instead of the same 3 we
already have? Or are you planning to ditch the LevelDB/RocksDB store
for our journaling and just use something within the block layer?

If we do want to go down this road, we shouldn't need to write an
allocator from scratch. I don't remember exactly which ones it is but
we've read/seen at least a few storage papers where people have reused
existing allocators  — I think the one from ext2? And somebody managed
to get it running in userspace.

Of course, then we also need to figure out how to get checksums on the
block data, since if we're going to put in the effort to reimplement
this much of the stack we'd better get our full data integrity
guarantees along with it!

On Tue, Oct 20, 2015 at 1:00 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Tue, 20 Oct 2015, John Spray wrote:
>> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> >  - We have to size the kv backend storage (probably still an XFS
>> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> > a different pool and those aren't currently fungible.
>>
>> This is the concerning bit for me -- the other parts one "just" has to
>> get the code right, but this problem could linger and be something we
>> have to keep explaining to users indefinitely.  It reminds me of cases
>> in other systems where users had to make an educated guess about inode
>> size up front, depending on whether you're expecting to efficiently
>> store a lot of xattrs.
>>
>> In practice it's rare for users to make these kinds of decisions well
>> up-front: it really needs to be adjustable later, ideally
>> automatically.  That could be pretty straightforward if the KV part
>> was stored directly on block storage, instead of having XFS in the
>> mix.  I'm not quite up with the state of the art in this area: are
>> there any reasonable alternatives for the KV part that would consume
>> some defined range of a block device from userspace, instead of
>> sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).

This seems like the obviously correct move to me? Except we might want
to include the rocksdb store on flash instead of hard drives, which
means maybe we do want some unified storage system which can handle
multiple physical storage devices as a single piece of storage space.
(Not that any of those exist in "almost done" hell, or that we're
going through requirements expansion or anything!)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html