Don't laugh. I am experimenting with Ceph in an enthusiast, small-office, home-office setting. Yes, this is not the conventional use case, but I think Ceph almost is, almost could be used for this. Do I need to explain why? These kinds of people (i.e. me) already run RAID. And maybe CIFS/Samba or NFS. On half-a-dozen or more machines. And RAID is OK, I guess, as long as your mirrors are in sync. And if you have three disks in a mirror, do they vote as to who has the best data? Not mdraid, no. And fsck takes hours on a multi-terabyte system with old disks. There's no CRC on data, just metadata. Maybe there's silent corruption in the background. Who knows. And being admin for CIFS/NFS in a small network is not exactly a fulfilling experience. So, hey, you know what? Perhaps CephFS can replace these. Perhaps CephFS is better. Perhaps CephFS is ready for this. I mean, I like ext4fs, but only because btrfs hates me. I'm loyal. But I'm here now. Seriously, I think that, with just a little bit of polishing and automation, Ceph could be deployed in the small-office/home-office setting. Don't laugh. This could happen. Sadly, I am seeing a tiny bit of data corruption. I said don't laugh. I'm resource constrained, server and disk constrained, and time-unconstrained. So I thought I'd bootstrap up from nothing. Manually.. First, one server, with two OSD's from two partitions on two different disks. Works great! For the record, this should be no worse than RAID mirroring on one host, and so is 100% entirely acceptable in the home/small-office environment. I move (not copy, but move) a few 100 GB of data I can afford to lose. Of course there are "warning: degraded+undersize" warnings, but hardly a surprise. I add a second server, and a 3rd OSD. The initial system is running Ceph "reef", because it's running Debian testing which has reef in it. The second system has ceph "pacific", because it's Debian stable, and that's what Debian stable currently packages. (Yes, pacific is past EOL. But it's on Debian stable, so....) Next version of Debian stable won't show up till next summer. I thought reef and pacific might not talk to one-another, but they do, and happily so. Excellent! +1 ceph! The system starts rebalancing onto the third OSD. After running all night, it gets to 2/3rds or 3/4 clean, and maybe 1/4th degraded-undersize. Surprising at first, because I had size/min_size set to 3/2 and I had 3 OSD's so .. ??? Later I understood that CRUSH wants three hosts, but whatever. I experiment with changing size/min_size to 2/1 and the warnings go away! Going back to 3/2 and they reappear. I am somewhat unhappy: before switching to 2/1, I saw most PG's were on most OSD's. Good! Switching to 2/1 made many of them disappear (!) and switching back to 3/2 did not make them reappear. The OSD's seem mildly underpopulated, and seem to remain that way. Huh. Who am I to say, but I would have liked to have seen one copy of *every* PG on *every* OSD, because that's what size=3 should mean, right? Even if there are only two servers? Or should I select a different crush algo? (I have not played with this, yet). I would sleep better if all three disks had a copy, because that would be an actual improvement over bog-standard raid mirroring. Anyway, I add a 4th OSD on the second server (so that there are 4 total) and I add a third server with zero OSD's (so that ceph-mon can create a proper quorum.) And by now I have over a TB of data on there (that I can afford to lose). So that's the backstory. Here's the problem. Last night, I move (not copy, but move) a 33 GB directory onto CephFS. This morning I see: mv: cannot overwrite directory '/dest/.snap' with non-directory '/source/.snap' wtf? I cd to /dest/.snap and indeed it is a directory. And it's empty. The source .snap is a file, 736 bytes in size. Directory timestamp and source-file timestamp are identical. I attempt a hand repair. I cd to the parent dir, do rmdir .snap and rmdir says "no such directory". wtf? So then `ls -la` and indeed, there is no such directory. But I can still cd into it! Huh. I cannot rmdir it, I cannot cp or mv it, because "it doesn't exist". If I cd into it, I cannot create any files in it, I get "permission denied", even as root. So it's a directory, it won't show in the listing of the parent dir. I can cd to it, but I cannot put anything in it. This, to me, says "CephFS bug". Other culprits could have been: (a) bad disk (b) corruption during copy, i.e. bad fs stat during copy due to (b1) bad kernel (b2) bad SATA traffic (b3) corrupt kernel page table, buffer, block cache, etc. (b4) bad data on the source disk. But all this is in the past: I can look at the source file: it's there, it's readable, it has valid contents, (It's ascii text. It looks just fine.) Clearly some transient error perturbed things. However, I now have a CephFS with a directory that I cannot remove, and I cannot place anything into it, and I cannot create a file with the same name as the directory. So that's not good. Sometime after I hit send on this email, ceph will automatically run scrub and/or deep scrub, and maybe this problem will fix itself. If there's something I should do to preserve this broken-ness for future posterity debugging, tell me now before it's too late. While I'm on this topic: The failed `mv` means that the source was still there, and so I was able to run `diff -rq --no-dereference` of the original source and the copy. I discovered that there were 5 files that were borked. They all had the same timestamps as the original, but the content was all zeros (as reported by od) I found that I could repair these by hand-copying the original over them, and the result is fine. Yes, this is a frankenstein system. Yes, it has 3 mons and three mgrs and three mds running on three machines, but OSD's only on two of them. I'll add more OSD's maybe tomorrow. But meanwhile, I have a system I can lose sleep over. The five files that were corrupted were corrupted *silently*, I would never have known if not for the weird-dir bug allowing me to go do the diff and find the corruption. This was on a copy of 33GB containing 412K files. I'd previously moved over a TB of files. Are any of these damaged? God knows. I was sloppy, I did not keep the originals, I did not do a diff. I trusted CephFS to do the right thing, and it seems it didn't. That trust is shattered. How often does ext4fs do something like this? I don't know. I've used ext2/3/4fs for 25 years on maybe a hundred different machines, and never lost data (that wasn't user error, i.e. rm-r* in the wrong directory) This is with 25 years of consumer-grade, low-quality pata/sata and sometimes scsi gear plugged into consumer-grade boxes cobbled from parts bought on newegg or maybe refurbished from Dell. (Yeah I also get to use huge high-end machines, but this is "for example") If there was dataloss during these decades, I'm not aware of it. Was there any? Laws of probability say "yes". Once upon a time, I managed to corrupt an ext2fs disk so that e2fsck could not fix it. Yes, *hit happens. But I've been using Ceph for barely a week, and its 2024 so it should be mature, so I'm not thrilled that I hit a bug so early and so easily. WTF. I did read the paper https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution" which is an awesome paper, by the way, kudos to the authors. It does have one mis-representation, though. It says, paraphrasing: "it typically takes ten years for a file system to mature, and we here at CephFS/bluestore did it in only two (glow)" That paper was published in 2019 and it's now 2024, and off-the-shelf CephFS is clearly and blatantly buggy. Ceph is still awesome, but you cannot blame this bug on crappy hardware or a frankenstein sysconfig. Either the corner cases work, or they don't, and the authors of other filesystems take a decade or two to get through all of those corner cases. It would seem that CephFS has not evaded this iron law. Sorry for ending on such a downer, but hey... I'm one of those perfectionists that wreck things for the good-enough people. So it goes. -- Linas -- Patrick: Are they laughing at us? Sponge Bob: No, Patrick, they are laughing next to us. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx