Re: newstore direction

Sage Weil <sweil@xxxxxxxxxx> · Thu, 22 Oct 2015 05:50:15 -0700 (PDT)

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some time
> talking to the local file system gurus about this in detail.  I can help with
> that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then 
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or 
a few) huge files and the user space app already has all the complexity of 
a filesystem-like thing (with its own internal journal, allocators, 
garbage collection, etc.).  Do they just do this to ease administrative 
tasks like backup?

This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that 
there are two independent layers journaling and managing different types 
of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file 
system to work around what it is used to: we swap extents to avoid 
write-ahead (see Christoph's patch), O_NOMTIME, unprivileged 
open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that 
lives within it (pretending the file is a block device).  The file system 
rarely gets in the way (assuming the file is prewritten and we don't do 
anything stupid).  But it doesn't give us anything a block device 
wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) 
complexity to 2.  On the other hand, if you step back and view teh 
entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex 
than 2... and yet still slower.  Given we ultimately have to support both 
(both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the 
beaten path (1) to anything mildly exotic (1b) we have been bitten by 
obscure file systems bugs.  And that's assume we get everything we need 
upstream... which is probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better 
support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a 
huge amount of sense of a ton of different systems.  But our situations is 
a bit different: we always own the entire device (and often the server), 
so there is no need to share with other users or apps (and when you do, 
you just use the existing FileStore backend).  And as you know performance 
is a huge pain point.  We are already handicapped by virtue of being 
distributed and strongly consistent; we can't afford to give away more to 
a storage layer that isn't providing us much (or the right) value.

And I'm tired of half measures.  I want the OSD to be as fast as we can 
make it given the architectural constraints (RADOS consistency and 
ordering semantics).  This is truly low-hanging fruit: it's modular, 
self-contained, pluggable, and this will be my third time around this 
particular block.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html