Re: CephFS use cases + MDS limitations

"Matt W. Benjamin" <matt@xxxxxxxxxxxx> · Wed, 6 Nov 2013 11:59:14 -0500 (EST)

Hi Michael,

Thanks for posting this.

We don't have specific workload information, but we did want to mention some of
the experimental Cephfs development we (Cohortfs) have been doing, in case it
might be of interest to others in the community.

One of the projects we've undertaken is to implement pnfs-metastripe
(a proposal for scale-out metadata in NFSv4) on Cephfs.  In doing that we've
essentially been evolving a metastripe-flavored version Cephfs, building on
previous work to provide first-class lookup-by-inode# support (more below).

Our current codebase has a number of changes.

In support of metastripe, we've augmented directory fragmentation with the
concept of stripes, each of which can be locked and modified independently.
In order to permit parallel updates on stripes, clients take "stripe caps" in
place of a single capset on directories.

We've also extended the Ceph cap model to support in-place state updates, as
well as invalidates.

We have a group of changes intended to increase mds workload independence,
including more independent caching.

There are many cases where a ceph mds needs to get a cache replica of an object
from its auth mds. Most are needed in order to satisfy a client request (like a
rename from one mds to another).  Many others, however, are necessitated by the
reliance on full paths  to locate objects.  This means that every cache object
must then have cache replicas of all parent objects in order to make these
traversals possible.  These extra cache replicas have a cost in terms of memory,
lock latency, and messaging overhead that will have an effect on scalability.

All of these overheads are essentially side effects of Ceph's method of storing
inodes with their primary dentry.  We're attacking this by storing inodes in a
separate container, which is also striped across MDS nodes to enable
lookup-by-ino with a simple placement function. Obviously, the former design
change is a big one, which trades away some of the Cephfs' inlining properties
for parallel performance and better NFS tuning.

We have other client and MDS work planned and/or in progress, including client
concurrency work (in progress), MDS concurrency work (planned), MDS cache management
changes (planned), and client cache management changes (in progress).

We're looking to add the ability to journal inode updates as deltas, in order
to compress the journals and speed up replay. Further down the line, we'd like
to create a journal for each stripe of the inode container (where stripe count
>> mds count), rather than tying them to an individual mds. This would
facilitate load balancing and failover, by allowing any mds to become
authoritative for a stripe of inodes by replaying its journal.

One of our main goals is for a plurality of Cohortfs (and Cephfs) file systems
to coexist in a Ceph cluster, in separate or unified namespaces.

So, in fine, our Cohortfs version of Cephfs makes some tradeoffs that we expect
to perform better on some workloads, and perhaps worse on others, but some of the
work we've performed may also be useful to traditional Cephfs.

We've been working entirely on our own so far, but we're doing open source work.
We welcome feedback, and if there are others in the community interested in
collaborating in these or related areas, you're welcome to join in.

Matt, Casey, Adam, Marcus

----- "Michael Sevilla" <mikesevilla3@xxxxxxxxx> wrote:

> Hi Ceph community,
> 
> I’d like to get a feel for some of the problems that CephFS users are
> encountering with single MDS deployments. There were requests for
> stable distributed metadata/MDS services [1] and I’m guessing its
> because your workloads exhibit many, many metadata operations. Some
> of
> you mentioned opening many files in a directory for checkpointing,
> recursive stats on a directory, etc. [2] and I’d like more details,
> such as:
> - workloads/applications that stress the MDS service that would cause
> you to call for multi-MDS support
> - use cases for the Ceph file system (I’m not really too interested
> in
> users using CephFS to host VMs, since many of these use cases are
> migrating to RBD)
> 
> I’m just trying to get an idea of what’s out there and the problems
> CephFS users encounter as a result of a bottlenecked MDS (single node
> or cluster).
> 
> Thanks!
> 
> Michael
> 
> [1] CephFS MDS Status Discussion,
> http://ceph.com/dev-notes/cephfs-mds-status-discussion/
> [2] CephFS First Product Release Discussion,
> http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html