Re: newstore direction

Samuel Just <sjust@xxxxxxxxxx> · Thu, 22 Oct 2015 16:42:15 -0700



Since the changes which moved the pg log and the pg info into the pg
object space, I think it's now the case that any transaction submitted
to the objectstore updates a disjoint range of objects determined by
the sequencer.  It might be easier to exploit that parallelism if we
control allocation and allocation related metadata.  We could split
the store into N pieces which partition the pg space (one additional
one for the meta sequencer?) with one rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of
global allocation decisions) and managed more finely within each
partition.  The main challenge would be avoiding internal
fragmentation of those, but at least defragmentation can be managed on
a per-partition basis.  Such parallelism is probably necessary to
exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
<james.liu@xxxxxxxxxxxxxxx> wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working on  objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of  use case need more supports but not provided in near future by filesystem no matter what reasons.
>
>    There are so many techniques  pop out which can help to improve performance of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator,  also gives you the thread scheduling support,  CPU affinity , NUMA friendly, polling  which  might fundamentally change the performance of objectstore.  It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc.
>     I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best  performance for OSD with new techniques. These two goals are not going to conflict with each other.  They are just for different purposes to make Ceph not only more stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten), then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.).  Do they just do this to ease administrative tasks like backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device).  The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid).  But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code.
>
> At the end of the day, 1 and 1b are always going to be slower than 2.
> And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2.  On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower.  Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive.
>
> Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs.  And that's assume we get everything we need upstream... which is probably a year's endeavour.
>
> Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems.  But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend).  And as you know performance is a huge pain point.  We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value.
>
> And I'm tired of half measures.  I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics).  This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html