Re: newstore direction

Sage Weil <sweil@xxxxxxxxxx> · Wed, 21 Oct 2015 10:30:28 -0700 (PDT)

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> On 10/21/2015 04:22 AM, Orit Wasserman wrote:
> > On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> > > On 10/19/2015 03:49 PM, Sage Weil wrote:
> > > > The current design is based on two simple ideas:
> > > > 
> > > >    1) a key/value interface is better way to manage all of our internal
> > > > metadata (object metadata, attrs, layout, collection membership,
> > > > write-ahead logging, overlay data, etc.)
> > > > 
> > > >    2) a file system is well suited for storage object data (as files).
> > > > 
> > > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > > few
> > > > things:
> > > > 
> > > >    - We currently write the data to the file, fsync, then commit the kv
> > > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > > journal, one for the kv txn to commit (at least once my rocksdb changes
> > > > land... the kv commit is currently 2-3).  So two people are managing
> > > > metadata, here: the fs managing the file metadata (with its own
> > > > journal) and the kv backend (with its journal).
> > > If all of the fsync()'s fall into the same backing file system, are you
> > > sure
> > > that each fsync() takes the same time? Depending on the local FS
> > > implementation
> > > of course, but the order of issuing those fsync()'s can effectively make
> > > some of
> > > them no-ops.
> > > 
> > > >    - On read we have to open files by name, which means traversing the
> > > > fs
> > > > namespace.  Newstore tries to keep it as flat and simple as possible,
> > > > but
> > > > at a minimum it is a couple btree lookups.  We'd love to use open by
> > > > handle (which would reduce this to 1 btree traversal), but running
> > > > the daemon as ceph and not root makes that hard...
> > > This seems like a a pretty low hurdle to overcome.
> > > 
> > > >    - ...and file systems insist on updating mtime on writes, even when
> > > > it is
> > > > a overwrite with no allocation changes.  (We don't care about mtime.)
> > > > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > > > brainfreeze.
> > > Are you using O_DIRECT? Seems like there should be some enterprisey
> > > database
> > > tricks that we can use here.
> > > 
> > > >    - XFS is (probably) never going going to give us data checksums,
> > > > which we
> > > > want desperately.
> > > What is the goal of having the file system do the checksums? How strong do
> > > they
> > > need to be and what size are the chunks?
> > > 
> > > If you update this on each IO, this will certainly generate more IO (each
> > > write
> > > will possibly generate at least one other write to update that new
> > > checksum).
> > > 
> > > > But what's the alternative?  My thought is to just bite the bullet and
> > > > consume a raw block device directly.  Write an allocator, hopefully keep
> > > > it pretty simple, and manage it in kv store along with all of our other
> > > > metadata.
> > > The big problem with consuming block devices directly is that you
> > > ultimately end
> > > up recreating most of the features that you had in the file system. Even
> > > enterprise databases like Oracle and DB2 have been migrating away from
> > > running
> > > on raw block devices in favor of file systems over time.  In effect, you
> > > are
> > > looking at making a simple on disk file system which is always easier to
> > > start
> > > than it is to get back to a stable, production ready state.
> > The best performance is still on block device (SAN).
> > File system simplify the operation tasks which worth the performance
> > penalty for a database. I think in a storage system this is not the
> > case.
> > In many cases they can use their own file system that is tailored for
> > the database.
> 
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.

...except it's not.  Preallocating the file gives you contiguous space, 
but you still have to mark the extent written (not zero/prealloc).  The 
only way to get an identical IO pattern is to *pre-write* zeros (or 
whatever) to the file... which is hours on modern HDDs.

Ted asked for a way to force prealloc to expose preexisting disk bits a 
couple hears back at LSF and it was shot down for security reasons (and 
rightly so, IMO).

If you're going down this path, you already have a "file system" in user 
space sitting on top of the preallocated file, and you could just as 
easily use the block device directly.

If you're not, then you're writing smaller files (e.g., megabytes), and 
will be paying the price to write to the {xfs,ext4} journal to update 
allocation and inode metadata.  And that's what we're trying to avoid...

> If we are causing additional IO's, then we really need to spend some time
> talking to the local file system gurus about this in detail.  I can help with
> that conversation.

Happy to sync up with Eric or Dave, but I really don't think the fs is 
doing anything wrong here.  It's just not the right fit.

> > This won't be a file system but just an allocator which is a very small
> > part of a file system.
> 
> That is always the intention and then we wake up a few years into the project
> with something that looks and smells like a file system as we slowly bring in
> just one more small thing at a time.

Probably, yes.  But it will be exactly the small things that *we* need.

> > The benefits are not just in reducing the number of IO operations we
> > preform, we are also removing the file system stack overhead that will
> > reduce our latency and make it more predictable.
> > Removing this layer will give use more control and allow us other
> > optimization we cannot do today.
> 
> I strongly disagree here - we can get that optimal number of IO's if we use
> the file system API's developed over the years to support enterprise
> databases.  And we can have that today without having to re-write allocation
> routines and checkers.

It will take years and years to get data crcs and the types of IO hints 
that we want in XFS (if we ever get them--my guess is we won't as it's not 
worth the rearchitecting that is required).  We can be much more agile 
this way.  Yes it's an additional burden, but it's also necessary to get 
the performance we need to be competitive: POSIX does not provide the 
atomicity/consistency that we require, and there is no way to unify our 
transaction commit IOs with the underlying FS journals, or get around the 
fact that the fs is maintaining an independent data structure (inode) for 
our per-object metadata record with yet another intervening data structure 
(directories and dentries) that we have 0 use for.  It's not that the fs 
isn't doing what it does really well, it's that it's doing the wrong 
things for our use case.

> > I think this is more acute when taking SSD (and even faster
> > technologies) into account.
> 
> XFS and ext4 both support DAX, so we can effectively do direct writes to
> persistent memory (no block IO required). Most of the work over the past few
> years in the IO stack has been around driving IOPs at insanely high rates on
> top of the whole stack (file system layer included) and we have really good
> results.

Yes.  But ironically much of that hard work is around maintaining the 
existing functionality of the stack while reducing its overhead.  If you 
avoid a layer of the stack entirely it's a moot issue.  Obviously the 
block layer work will still be important for us, but the fs bits won't 
matter.  And in order to capture any of these benefits the code that is 
driving the IO from userspace also has to be equally efficient anyway, so 
it's not like using a file system here gets you anything for free.

> > > In addition to the technical hurdles, there are also production worries
> > > like how
> > > long will it take for distros to pick up formal support?  How do we test
> > > it
> > > properly?
> > > 
> > This should be userspace only, I don't think we need it in the kernel
> > (will need root access for opening the device).
> > For users that don't have root access we can use one big file and use
> > the same allocator in it. It can be good for testing too.
> > 
> > As someone that already been part of such a
> > move more than once (for example in Exanet) I can say that the
> > performance gain is very impressive and after the change we could
> > remove many workarounds which simplified the code.
> > 
> > As the API should be small the testing effort is reasonable, we do need
> > to test it well as a bug in the allocator has really bad consequences.
> > 
> > We won't be able to match (or exceed) our competitors performance
> > without making this effort ...
> > 
> > Orit
> > 
> 
> I don't agree that we will see a performance win if we use the file system
> properly.  Certainly, you can measure a slow path through a file system and
> then show an improvement with a new, user space block access, but that is not
> a long term path to success.

I've been doing this long enough that I'm pretting confident I'm not 
measuring the slow path.  And yes, there are some things we could do to 
improve the situation, but the complexity required is similar to avoiding 
the fs altogether, and the end result will still be far from optimal.

For example: we need to do an overwrite of an existing object that is 
atomic with respect to a larger ceph transaction (we're updating a bunch 
of other metadata at the same time, possibly overwriting or appending to 
multiple files, etc.).  XFS and ext4 aren't cow file systems, so plugging 
into the transaction infrastructure isn't really an option (and even after 
several years of trying to do it with btrfs it proved to be impractical).  
So: we do write-ahead journaling.  That's okay (even great) for small io 
(the database we're tracking our metadata is log-structured anyway), but 
if the overwrite it large it's pretty inefficient.  Assuming I have a 4MB 
XFS file, how to I do an atomic 1MB overwrite?  Maybe we write to a new 
file, fsync that, and use the defrag ioctl to swap extents.  But then 
we're creating extraneous inodes, forcing additional fsyncs, and relying 
on weakly tested functionality that is much more likely to lead to nasty 
surprises for users (for example, see our use of the xfs extsize ioctl in 
firefly and the resulting data corruption that causes on 3.2 kernels).  
It would be an extremely delicate solution that relies on very 
careful ordering of fs ioctls and syscalls to ensure both data 
safety and performance... and even then it wouldn't be optimal.

If we manage allocation ourselves this problem is trivial: write to an 
unallocated extent, fua/flush, commit transaction.

The allocators in general purpose file systems have to cope with a huge 
spectrum of workloads, and they to admirably well given the challenge.  
Ours will need to cope with a vastly simpler set of constraints.  And most 
importantly will be tied into the same transaction commit mechanism as 
everything else, which means it will not require additional IOs to 
maintain its metadata.  And the metadata we do manage will be exactly the 
metadata we need, and nothing more and nothing less.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html