Thanks for all your hard work on CephFS. This progress is very exciting to hear about. I am constantly amazed at the amount of work that gets done in Ceph in so short an amount of time.
On Mon, Apr 20, 2015 at 6:26 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
We’ve been hard at work on CephFS over the last year since Firefly was released, and with Hammer coming out it seemed like a good time to go over some of the big developments users will find interesting. Much of this is cribbed from John’s Linux Vault talk (http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf), in addition to the release notes (http://ceph.com/docs/master/release-notes/).
===========================================================================
New Filesystem features & improvements:
ceph-fuse has gained support for fcntl and flock locking. (Yan, Zheng) This has been in the kernel for a while but nobody had done the work to implement tracking structures and wire it up in userspace.
ceph-fuse has gained support for soft quotas, enforced on the client side. (Yunchuan Wen) The Ubuntu Kylin guys worked on this for quite a while and we thank them for their work and their patience. You can now specify soft quotas on a directory and ceph-fuse will behave as you’d expect from that.
Hadoop support has been generally improved and updated. (Noah Watkins, Huamin Chen) It now works against the 2.0 API, the tests we run in our lab are more sophisticated, and it’s a lot friendlier to install with Maven and other Java tools. Noah’s still doing work on this to make it as turnkey as possible, but soon you’ll just need to drop a single JAR on the system (this will include the libcephfs stuff, so you don’t even need to worry about those packages and compatibility!) and change a few config options.
ceph-fuse and CephFS as a whole now have much-improved full space handling. If you run out of space at the RADOS layer you will get ENOSPC errors in the client (instead of it retrying indefinitely), and these errors (and others) are now propagated out to fsync and fclose calls.
We are now much more consistent in our handling of timestamps. Previously we attempted to take the time from whichever process was responsible for making a change, which could be either a client or the MDS. But this was troublesome if their times weren’t synced — made worse by trying not to let the time move backwards — and some applications which relied on sharing mtime and ctime values as versions (Hadoop and rsync both did this in certain configurations) were unhappy. We now use a timestamp provided by the client for all operations, which has been more stable.
Certain internal data structures are now much more scalable on a per-client level. We had issues when certain “MDSTables” got too large, but John Spray sorted them out.
The reconnect phase, when an MDS is restarted or dies and the clients have to connect to a different daemon, has been made much faster in the typical case. (Yan, Zheng)
===========================================================================
Administrator features & improvements:
The MDS has gained an OpTracker, with functionality similar to that in the OSD. You can dump in-flight requests and notably slow ones from the recent past. The changes to enable this also made working with many code paths a lot easier.
We’ve changed how you create and manage CephFS file systems in a cluster. (John Spray) The “data” and “metadata” pools are no longer created by default, and the management is done via monitor commands that start with “ceph fs” (eg, “ceph fs new”). These have been designed with future extensions in mind, but for now they mostly replicate existing features with more consistency and improved repeatability/idempotency.
The MDS now reports on a variety of health metrics to the monitor, joining the existing OSD and monitor health reports. These include information on misbehaving clients and MDS data structures. (John Spray)
The MDS admin socket now includes a bunch of new commands. You can examine and evict client sessions, plus do things around filesystem repair (see below).
The MDS now gathers metadata from the clients about who they are and shares that with users via a variety of helpful interfaces and warning messages. (John Spray)
===========================================================================
Recovery tools
We have a new MDS journal format and a new cephfs-journal-tool. (John Spray) This eliminates the days of needing to hexedit a journal dump in order to let your MDS start back up — you can inspect the journal state (human-readable or json, great for our testing!) and make changes on a per-event level. It also includes the ability to scan through hopelessly broken journals and parse out whatever data is available for flushing to the backing RADOS objects.
Similarly, there’s a cephfs-table-tool for working with the SessionTable, InoTable, and SnapTable. (John Spray)
We’ve added new “scrub_path” and “flush_path” commands to the admin socket. These are fairly limited right now but will check that both directories and files are self-consistent. It’s a building block for the "forward scrub" and fsck features that I’ve been working on, and includes a lot of code-level work to enable those.
===========================================================================
Performance improvements
Both the kernel and userspace clients are a lot more efficient with some of their “capability” and directory content handling. This lets them serve a lot more out of local cache, a lot more often, than they were able to previously. This is particularly noticeable in workloads where a single client “owned” a directory but another client periodically peeked in on it.
There are also a bunch of extra improvements in this area that have gone in since Hammer and will be released in Infernalis. ;)
The code in the MDS that handles the journaling has been split into a separate thread. (Yan, Zheng) This has increased maximum throughput a fair bit and is the first major improvement enabled by John’s work to start breaking down the big MDS lock. (We still have a big MDS lock, but in addition to the journal it no longer covers the Objecter. Setting up the interfaces to make that manageable should make future lock sharding and changes a lot simpler than they would have been previously.)
===========================================================================
Developer & test improvements
In addition to a slightly expanded set of black-box tests, we are now testing FS behaviors to make sure everything behaves as expected in specific scenarios (failure and otherwise). This is largely thanks to John, but we’re doing more with it in general as we add features that can be tested this way.
As alluded to in previous sections, we’ve done a lot of work that makes the MDS codebase a lot easier to work with. Interfaces, if not exactly bright and shining, are a lot cleaner than they used to be. Locking is a lot more explicit and easier to reason about in many places. There are fewer special paths for specific kinds of operations, and a lot more shared paths that everything goes through — which means we have more invariants we can assume on every operation.
===========================================================================
Notable bug reductions
Although we continue to leave snapshots disabled by default and don’t recommend multi-MDS systems, both of these have been *dramatically* improved by Zheng’s hard work. Our multimds suite now passes almost all of the existing tests, whereas it previously failed most of them (http://pulpito.ceph.com/?suite=multimds). Our snapshot tests pass reliably and using them is no longer a shortcut to breaking your system, and bugs are less likely to leave your entire filesystem inaccessible.
There’s a lot more I haven’t discussed above, like how the entire stack is a lot more tolerant of failures elsewhere than it used to be and so bugs are less likely to make your entire filesystem inaccessible. But those are some of the biggest features and improvements that users are likely to notice or might have been waiting on before they decided to test it out. It’s nice to reflect occasionally — I knew we were getting a lot done, but this list is much longer than I’d initially thought it would be!
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com