We've been doing a lot of work on CephFS over the past few months. This is an update on the current state of things as of Giant. What we've working on: * better mds/cephfs health reports to the monitor * mds journal dump/repair tool * many kernel and ceph-fuse/libcephfs client bug fixes * file size recovery improvements * client session management fixes (and tests) * admin socket commands for diagnosis and admin intervention * many bug fixes We started using CephFS to back the teuthology (QA) infrastructure in the lab about three months ago. We fixed a bunch of stuff over the first month or two (several kernel bugs, a few MDS bugs). We've had no problems for the last month or so. We're currently running 0.86 (giant release candidate) with a single MDS and ~70 OSDs. Clients are running a 3.16 kernel plus several fixes that went into 3.17. With Giant, we are at a point where we would ask that everyone try things out for any non-production workloads. We are very interested in feedback around stability, usability, feature gaps, and performance. We recommend: * Single active MDS. You can run any number of standby MDS's, but we are not focusing on multi-mds bugs just yet (and our existing multimds test suite is already hitting several). * No snapshots. These are disabled by default and require a scary admin command to enable them. Although these mostly work, there are several known issues that we haven't addressed and they complicate things immensely. Please avoid them for now. * Either the kernel client (kernel 3.17 or later) or userspace (ceph-fuse or libcephfs) clients are in good working order. The key missing feature right now is fsck (both check and repair). This is *the* development focus for Hammer. Here's a more detailed rundown of the status of various features: * multi-mds: implemented. limited test coverage. several known issues. use only for non-production workloads and expect some stability issues that could lead to data loss. * snapshots: implemented. limited test coverage. several known issues. use only for non-production workloads and expect some stability issues that could lead to data loss. * hard links: stable. no known issues, but there is somewhat limited test coverage (we don't test creating huge link farms). * direct io: implemented and tested for kernel client. no special support for ceph-fuse (the kernel fuse driver handles this). * xattrs: implemented, stable, tested. no known issues (for both kernel and userspace clients). * ACLs: implemented, tested for kernel client. not implemented for ceph-fuse. * file locking (fcntl, flock): supported and tested for kernel client. limited test coverage. one known minor issue for kernel with fix pending. implemention in progress for ceph-fuse/libcephfs. * kernel fscache support: implmented. no test coverage. used in production by adfin. * hadoop bindings: implemented, limited test coverage. a few known issues. * samba VFS integration: implemented, limited test coverage. * ganesha NFS integration: implemented, no test coverage. * kernel NFS reexport: implemented. limited test coverage. no known issues. Anybody who has experienced bugs in the past should be excited by: * new MDS admin socket commands to look at pending operations and client session states. (Check them out with "ceph daemon mds.a help"!) These will make diagnosing, debugging, and even fixing issues a lot simpler. * the cephfs_journal_tool, which is capable of manipulating mds journal state without doing difficult exports/imports and using hexedit. Thanks! sage _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com