> On 12 Apr 2016, at 20:00, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Tue, 12 Apr 2016, Jan Schermer wrote: >> I'd like to raise these points, then >> >> 1) some people (like me) will never ever use XFS if they have a choice >> given no choice, we will not use something that depends on XFS >> >> 2) choice is always good > > Okay! > >> 3) doesn't majority of Ceph users only care about RBD? > > Probably that's true now. We shouldn't recommend something that prevents > them from adding RGW to an existing cluster in the future, though. > >> (Angry rant coming) >> Even our last performance testing of Ceph (Infernalis) showed abysmal >> performance. The most damning sign is the consumption of CPU time at >> unprecedented rate. Was it faster than Dumpling? Slightly, but it ate >> more CPU also, so in effect it was not really "faster". >> >> It would make *some* sense to only support ZFS or BTRFS because you can >> offload things like clones/snapshots and consistency to the filesystem - >> which would make the architecture much simpler and everything much >> faster. Instead you insist on XFS and reimplement everything in >> software. I always dismissed this because CPU time was ususally cheap, >> but in practice it simply doesn't work. You duplicate things that >> filesystems had solved for years now (namely crash consistency - though >> we have seen that fail as well), instead of letting them do their work >> and stripping the IO path to the bare necessity and letting someone >> smarter and faster handle that. >> >> IMO, If Ceph was moving in the right direction there would be no >> "supported filesystem" debate, instead we'd be free to choose whatever >> is there that provides the guarantees we need from filesystem (which is >> usually every filesystem in the kernel) and Ceph would simply distribute >> our IO around with CRUSH. >> >> Right now CRUSH (and in effect what it allows us to do with data) is >> _the_ reason people use Ceph, as there simply wasn't much else to use >> for distributed storage. This isn't true anymore and the alternatives >> are orders of magnitude faster and smaller. > > This touched on pretty much every reason why we are ditching file > systems entirely and moving toward BlueStore. Nooooooooooooooo! > > Local kernel file systems maintain their own internal consistency, but > they only provide what consistency promises the POSIX interface > does--which is almost nothing. ... which is exactly what everyone expects ... which is everything any app needs > That's why every complicated data > structure (e.g., database) stored on a file system ever includes it's own > journal. ... see? > In our case, what POSIX provides isn't enough. We can't even > update a file and it's xattr atomically, let alone the much more > complicated transitions we need to do. ... have you thought that maybe xattrs weren't meant to be abused this way? Filesystems usually aren't designed to be a performant key=value stores. btw at least i_version should be atomic? And I still feel (ironically) that you don't understand what journals and commits/flushes are for if you make this argument... Btw I think at least i_version xattr could be atomic. > We coudl "wing it" and hope for > the best, then do an expensive crawl and rsync of data on recovery, but we > chose very early on not to do that. If you want a system that "just" > layers over an existing filesystem, try you can try Gluster (although note > that they have a different sort of pain with the ordering of xattr > updates, and are moving toward a model that looks more like Ceph's backend > in their next version). True, which is why we dismissed it. > > Offloading stuff to the file system doesn't save you CPU--it just makes > someone else responsible. What does save you CPU is avoiding the > complexity you don't need (i.e., half of what the kernel file system is > doing, and everything we have to do to work around an ill-suited > interface) and instead implement exactly the set of features that we need > to get the job done. In theory you are right. In practice in-kernel filesystems are fast, and fuse filesystems are slow. Ceph is like that - slow. And you want to be fast by writing more code :) > > FileStore is slow, mostly because of the above, but also because it is an > old and not-very-enlightened design. BlueStore is roughly 2x faster in > early testing. ... which is still literally orders of magnitude slower than a filesystem. I dug into bluestore and how you want to implement it, and from what I understood you are reimplementing what the filesystem journal does... It makes sense it will be 2x faster if you avoid the double-journalling, but I'd be very much surprised if it helped with CPU usage one bit - I certainly don't see my filesystems consuming significant amount of CPU time on any of my machines, and I seriously doubt you're going to do that better, sorry. > > Finally, remember you *are* completely free to run Ceph on whatever file > system you want--and many do. We just aren't going to test them all for > you and promise they will all work. Remember that we have hit different > bugs in every single one we've tried. It's not as simple as saying they > just have to "provide the guarantees we need" given the complexity of the > interface, and almost every time we've tried to use "supported" APIs that > are remotely unusually (fallocate, zeroing extents... even xattrs) we've > hit bugs or undocumented limits and idiosyncrasies on one fs or another. This can be a valid point, those are features people either don't use, or use quite differently. But just because you can stress the filesystems until they break doesn't mean you should go write a new one. What makes you think you will do a better job than all the people who made xfs/ext4/...? Anyway, I don't know how more to debunk the "insufficient guarantees in POSIX filesystem transactions" myth that you insist on fixing, so I guess I'll have to wait until you rewrite everything up to the drive firmware to appreciate it :) Jan P.S. A joke for you How many syscalls does it take for Ceph to write "lightbulb" to the disk? 10 000 ha ha? > > Cheers- > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html