Hi Michael, Thanks for posting this. We don't have specific workload information, but we did want to mention some of the experimental Cephfs development we (Cohortfs) have been doing, in case it might be of interest to others in the community. One of the projects we've undertaken is to implement pnfs-metastripe (a proposal for scale-out metadata in NFSv4) on Cephfs. In doing that we've essentially been evolving a metastripe-flavored version Cephfs, building on previous work to provide first-class lookup-by-inode# support (more below). Our current codebase has a number of changes. In support of metastripe, we've augmented directory fragmentation with the concept of stripes, each of which can be locked and modified independently. In order to permit parallel updates on stripes, clients take "stripe caps" in place of a single capset on directories. We've also extended the Ceph cap model to support in-place state updates, as well as invalidates. We have a group of changes intended to increase mds workload independence, including more independent caching. There are many cases where a ceph mds needs to get a cache replica of an object from its auth mds. Most are needed in order to satisfy a client request (like a rename from one mds to another). Many others, however, are necessitated by the reliance on full paths to locate objects. This means that every cache object must then have cache replicas of all parent objects in order to make these traversals possible. These extra cache replicas have a cost in terms of memory, lock latency, and messaging overhead that will have an effect on scalability. All of these overheads are essentially side effects of Ceph's method of storing inodes with their primary dentry. We're attacking this by storing inodes in a separate container, which is also striped across MDS nodes to enable lookup-by-ino with a simple placement function. Obviously, the former design change is a big one, which trades away some of the Cephfs' inlining properties for parallel performance and better NFS tuning. We have other client and MDS work planned and/or in progress, including client concurrency work (in progress), MDS concurrency work (planned), MDS cache management changes (planned), and client cache management changes (in progress). We're looking to add the ability to journal inode updates as deltas, in order to compress the journals and speed up replay. Further down the line, we'd like to create a journal for each stripe of the inode container (where stripe count >> mds count), rather than tying them to an individual mds. This would facilitate load balancing and failover, by allowing any mds to become authoritative for a stripe of inodes by replaying its journal. One of our main goals is for a plurality of Cohortfs (and Cephfs) file systems to coexist in a Ceph cluster, in separate or unified namespaces. So, in fine, our Cohortfs version of Cephfs makes some tradeoffs that we expect to perform better on some workloads, and perhaps worse on others, but some of the work we've performed may also be useful to traditional Cephfs. We've been working entirely on our own so far, but we're doing open source work. We welcome feedback, and if there are others in the community interested in collaborating in these or related areas, you're welcome to join in. Matt, Casey, Adam, Marcus ----- "Michael Sevilla" <mikesevilla3@xxxxxxxxx> wrote: > Hi Ceph community, > > I’d like to get a feel for some of the problems that CephFS users are > encountering with single MDS deployments. There were requests for > stable distributed metadata/MDS services [1] and I’m guessing its > because your workloads exhibit many, many metadata operations. Some > of > you mentioned opening many files in a directory for checkpointing, > recursive stats on a directory, etc. [2] and I’d like more details, > such as: > - workloads/applications that stress the MDS service that would cause > you to call for multi-MDS support > - use cases for the Ceph file system (I’m not really too interested > in > users using CephFS to host VMs, since many of these use cases are > migrating to RBD) > > I’m just trying to get an idea of what’s out there and the problems > CephFS users encounter as a result of a bottlenecked MDS (single node > or cluster). > > Thanks! > > Michael > > [1] CephFS MDS Status Discussion, > http://ceph.com/dev-notes/cephfs-mds-status-discussion/ > [2] CephFS First Product Release Discussion, > http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Matt Benjamin CohortFS, LLC. 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://cohortfs.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html