I haven’t gone through all the details since you seem to know you’ve done some odd stuff, but the “.snap” issue is because you’ve run into a CephFS feature which I recently discovered is embarrassingly under-documented: https://docs.ceph.com/en/reef/dev/cephfs-snapshots So that’s a special fake directory used to take snapshots using mkdir, and trying to do other things with it will result in errors. You can expose it with a different name by setting client configurations — “snapdirname” is a mount option with kernel clients, or “client_snapdir” with use space/ceph-fuse. -Greg On Tue, Nov 26, 2024 at 5:12 PM Linas Vepstas <linasvepstas@xxxxxxxxx> wrote: > Don't laugh. I am experimenting with Ceph in an enthusiast, > small-office, home-office setting. Yes, this is not the conventional > use case, but I think Ceph almost is, almost could be used for this. > Do I need to explain why? These kinds of people (i.e. me) already run > RAID. And maybe CIFS/Samba or NFS. On half-a-dozen or more machines. > > And RAID is OK, I guess, as long as your mirrors are in sync. And if > you have three disks in a mirror, do they vote as to who has the best > data? Not mdraid, no. And fsck takes hours on a multi-terabyte system > with old disks. There's no CRC on data, just metadata. Maybe there's > silent corruption in the background. Who knows. > > And being admin for CIFS/NFS in a small network is not exactly a > fulfilling experience. So, hey, you know what? Perhaps CephFS can > replace these. Perhaps CephFS is better. Perhaps CephFS is ready for > this. I mean, I like ext4fs, but only because btrfs hates me. I'm > loyal. But I'm here now. > > Seriously, I think that, with just a little bit of polishing and > automation, Ceph could be deployed in the small-office/home-office > setting. Don't laugh. This could happen. > > Sadly, I am seeing a tiny bit of data corruption. > > I said don't laugh. I'm resource constrained, server and disk > constrained, and time-unconstrained. So I thought I'd bootstrap up > from nothing. Manually.. First, one server, with two OSD's from two > partitions on two different disks. Works great! For the record, this > should be no worse than RAID mirroring on one host, and so is 100% > entirely acceptable in the home/small-office environment. I move (not > copy, but move) a few 100 GB of data I can afford to lose. Of course > there are "warning: degraded+undersize" warnings, but hardly a > surprise. > > I add a second server, and a 3rd OSD. The initial system is running > Ceph "reef", because it's running Debian testing which has reef in it. > The second system has ceph "pacific", because it's Debian stable, and > that's what Debian stable currently packages. (Yes, pacific is past > EOL. But it's on Debian stable, so....) Next version of Debian stable > won't show up till next summer. I thought reef and pacific might not > talk to one-another, but they do, and happily so. Excellent! +1 ceph! > > The system starts rebalancing onto the third OSD. After running all > night, it gets to 2/3rds or 3/4 clean, and maybe 1/4th > degraded-undersize. Surprising at first, because I had size/min_size > set to 3/2 and I had 3 OSD's so .. ??? Later I understood that CRUSH > wants three hosts, but whatever. I experiment with changing > size/min_size to 2/1 and the warnings go away! Going back to 3/2 and > they reappear. I am somewhat unhappy: before switching to 2/1, I saw > most PG's were on most OSD's. Good! Switching to 2/1 made many of them > disappear (!) and switching back to 3/2 did not make them reappear. > The OSD's seem mildly underpopulated, and seem to remain that way. > > Huh. Who am I to say, but I would have liked to have seen one copy of > *every* PG on *every* OSD, because that's what size=3 should mean, > right? Even if there are only two servers? Or should I select a > different crush algo? (I have not played with this, yet). I would > sleep better if all three disks had a copy, because that would be an > actual improvement over bog-standard raid mirroring. > > Anyway, I add a 4th OSD on the second server (so that there are 4 > total) and I add a third server with zero OSD's (so that ceph-mon can > create a proper quorum.) And by now I have over a TB of data on there > (that I can afford to lose). > > So that's the backstory. Here's the problem. > > Last night, I move (not copy, but move) a 33 GB directory onto CephFS. > This morning I see: > > mv: cannot overwrite directory '/dest/.snap' with non-directory > '/source/.snap' > > wtf? I cd to /dest/.snap and indeed it is a directory. And it's empty. > The source .snap is a file, 736 bytes in size. Directory timestamp > and source-file timestamp are identical. I attempt a hand repair. I > cd to the parent dir, do rmdir .snap and rmdir says "no such > directory". wtf? So then `ls -la` and indeed, there is no such > directory. But I can still cd into it! Huh. I cannot rmdir it, I > cannot cp or mv it, because "it doesn't exist". If I cd into it, I > cannot create any files in it, I get "permission denied", even as > root. So it's a directory, it won't show in the listing of the parent > dir. I can cd to it, but I cannot put anything in it. > > This, to me, says "CephFS bug". Other culprits could have been: (a) > bad disk (b) corruption during copy, i.e. bad fs stat during copy due > to (b1) bad kernel (b2) bad SATA traffic (b3) corrupt kernel page > table, buffer, block cache, etc. (b4) bad data on the source disk. > But all this is in the past: I can look at the source file: it's > there, it's readable, it has valid contents, (It's ascii text. It > looks just fine.) Clearly some transient error perturbed things. > However, I now have a CephFS with a directory that I cannot remove, > and I cannot place anything into it, and I cannot create a file with > the same name as the directory. So that's not good. > > Sometime after I hit send on this email, ceph will automatically run > scrub and/or deep scrub, and maybe this problem will fix itself. If > there's something I should do to preserve this broken-ness for future > posterity debugging, tell me now before it's too late. > > While I'm on this topic: The failed `mv` means that the source was > still there, and so I was able to run `diff -rq --no-dereference` of > the original source and the copy. I discovered that there were 5 files > that were borked. They all had the same timestamps as the original, > but the content was all zeros (as reported by od) I found that I could > repair these by hand-copying the original over them, and the result is > fine. > > Yes, this is a frankenstein system. Yes, it has 3 mons and three mgrs > and three mds running on three machines, but OSD's only on two of > them. I'll add more OSD's maybe tomorrow. > > But meanwhile, I have a system I can lose sleep over. The five files > that were corrupted were corrupted *silently*, I would never have > known if not for the weird-dir bug allowing me to go do the diff and > find the corruption. This was on a copy of 33GB containing 412K > files. I'd previously moved over a TB of files. Are any of these > damaged? God knows. I was sloppy, I did not keep the originals, I did > not do a diff. I trusted CephFS to do the right thing, and it seems it > didn't. That trust is shattered. > > How often does ext4fs do something like this? I don't know. I've used > ext2/3/4fs for 25 years on maybe a hundred different machines, and > never lost data (that wasn't user error, i.e. rm-r* in the wrong > directory) This is with 25 years of consumer-grade, low-quality > pata/sata and sometimes scsi gear plugged into consumer-grade boxes > cobbled from parts bought on newegg or maybe refurbished from Dell. > (Yeah I also get to use huge high-end machines, but this is "for > example") If there was dataloss during these decades, I'm not aware of > it. Was there any? Laws of probability say "yes". Once upon a time, I > managed to corrupt an ext2fs disk so that e2fsck could not fix it. > Yes, *hit happens. But I've been using Ceph for barely a week, and > its 2024 so it should be mature, so I'm not thrilled that I hit a bug > so early and so easily. WTF. > > I did read the paper > https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf "File Systems > Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph > Evolution" which is an awesome paper, by the way, kudos to the > authors. It does have one mis-representation, though. It says, > paraphrasing: "it typically takes ten years for a file system to > mature, and we here at CephFS/bluestore did it in only two (glow)" > That paper was published in 2019 and it's now 2024, and off-the-shelf > CephFS is clearly and blatantly buggy. Ceph is still awesome, but you > cannot blame this bug on crappy hardware or a frankenstein sysconfig. > Either the corner cases work, or they don't, and the authors of other > filesystems take a decade or two to get through all of those corner > cases. It would seem that CephFS has not evaded this iron law. > > Sorry for ending on such a downer, but hey... I'm one of those > perfectionists that wreck things for the good-enough people. So it > goes. > > -- Linas > > -- > Patrick: Are they laughing at us? > Sponge Bob: No, Patrick, they are laughing next to us. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx