Re: CephFS empty files in a Frankenstein system

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 26 Nov 2024 17:31:56 -0800

I haven’t gone through all the details since you seem to know you’ve done
some odd stuff, but the “.snap” issue is because you’ve run into a CephFS
feature which I recently discovered is embarrassingly under-documented:
https://docs.ceph.com/en/reef/dev/cephfs-snapshots

So that’s a special fake directory used to take snapshots using mkdir, and
trying to do other things with it will result in errors.

You can expose it with a different name by setting client configurations —
“snapdirname” is a mount option with kernel clients, or “client_snapdir”
with use space/ceph-fuse.
-Greg

On Tue, Nov 26, 2024 at 5:12 PM Linas Vepstas <linasvepstas@xxxxxxxxx>
wrote:

> Don't laugh.  I am experimenting with Ceph in an enthusiast,
> small-office, home-office setting. Yes, this is not the conventional
> use case, but I think Ceph almost is, almost could be used for this.
> Do I need to explain why? These kinds of people (i.e. me) already run
> RAID. And maybe CIFS/Samba or NFS. On half-a-dozen or more machines.
>
> And RAID is OK, I guess, as long as your mirrors are in sync. And if
> you have three disks in a mirror, do they vote as to who has the best
> data? Not mdraid, no. And fsck takes hours on a multi-terabyte system
> with old disks. There's no CRC on data, just metadata. Maybe there's
> silent corruption in the background. Who knows.
>
> And being admin for CIFS/NFS in a small network is not exactly a
> fulfilling experience. So, hey, you know what? Perhaps CephFS can
> replace these. Perhaps CephFS is better. Perhaps CephFS is ready for
> this. I mean, I like ext4fs, but only because btrfs hates me. I'm
> loyal. But I'm here now.
>
> Seriously, I think that, with just a little bit of polishing and
> automation, Ceph could be deployed in the small-office/home-office
> setting. Don't laugh. This could happen.
>
> Sadly, I am seeing a tiny bit of data corruption.
>
> I said don't laugh. I'm resource constrained, server and disk
> constrained, and time-unconstrained.  So I thought I'd bootstrap up
> from nothing. Manually.. First, one server, with two OSD's from two
> partitions on two different disks. Works great! For the record, this
> should be no worse than RAID mirroring on one host, and so is 100%
> entirely acceptable in the home/small-office environment. I move (not
> copy, but move) a few 100 GB of data I can afford to lose. Of course
> there are "warning: degraded+undersize" warnings, but hardly a
> surprise.
>
> I add a second server, and a 3rd OSD.  The initial system is running
> Ceph "reef", because it's running Debian testing which has reef in it.
> The second system has ceph "pacific", because it's Debian stable, and
> that's what Debian stable currently packages. (Yes, pacific is past
> EOL. But it's on Debian stable, so....) Next version of Debian stable
> won't show up till next summer. I thought reef and pacific might not
> talk to one-another, but they do, and happily so. Excellent! +1 ceph!
>
> The system starts rebalancing onto the third OSD.  After running all
> night, it gets to 2/3rds or 3/4 clean, and maybe 1/4th
> degraded-undersize. Surprising at first, because I had size/min_size
> set to 3/2 and I had 3 OSD's so .. ??? Later I understood that CRUSH
> wants three hosts, but whatever. I experiment with changing
> size/min_size to 2/1 and the warnings go away! Going back to 3/2 and
> they reappear. I am somewhat unhappy: before switching to 2/1, I saw
> most PG's were on most OSD's. Good! Switching to 2/1 made many of them
> disappear (!) and switching back to 3/2 did not make them reappear.
> The OSD's seem mildly underpopulated, and seem to remain that way.
>
> Huh. Who am I to say, but I would have liked to have seen one copy of
> *every* PG on *every* OSD, because that's what size=3 should mean,
> right? Even if there are only two servers? Or should I select a
> different crush algo? (I have not played with this, yet). I would
> sleep better if all three disks had a copy, because that would be an
> actual improvement over bog-standard raid mirroring.
>
> Anyway, I add a 4th OSD on the second server (so that there are 4
> total) and I add a third server with zero OSD's (so that ceph-mon can
> create a proper quorum.) And by now I have over a TB of data on there
> (that I can afford to lose).
>
> So that's the backstory. Here's the problem.
>
> Last night, I move (not copy, but move) a 33 GB directory onto CephFS.
> This morning I see:
>
> mv: cannot overwrite directory '/dest/.snap' with non-directory
> '/source/.snap'
>
> wtf? I cd to /dest/.snap and indeed it is a directory. And it's empty.
> The source .snap is a file, 736 bytes in size.  Directory timestamp
> and source-file timestamp are identical. I attempt a hand repair.  I
> cd to the parent dir, do rmdir .snap and rmdir says "no such
> directory". wtf? So then `ls -la` and indeed, there is no such
> directory. But I can still cd into it! Huh. I cannot rmdir it, I
> cannot cp or mv it, because "it doesn't exist". If I cd into it, I
> cannot create any files in it, I get "permission denied", even as
> root. So it's a directory, it won't show in the listing of the parent
> dir. I can cd to it, but I cannot put anything in it.
>
> This, to me, says "CephFS bug".  Other culprits could have been: (a)
> bad disk (b) corruption during copy, i.e. bad fs stat during copy due
> to (b1) bad kernel (b2) bad SATA traffic (b3) corrupt kernel page
> table, buffer, block cache, etc.  (b4) bad data on the source disk.
> But all this is in the past: I can look at the source file: it's
> there, it's readable, it has valid contents, (It's ascii text. It
> looks just fine.)  Clearly some transient error perturbed things.
> However, I now have a CephFS with a directory that I cannot remove,
> and I cannot place anything into it, and I cannot create a file with
> the same name as the directory.  So that's not good.
>
> Sometime after I hit send on this email, ceph will automatically run
> scrub and/or deep scrub, and maybe this problem will fix itself. If
> there's something I should do to preserve this broken-ness for future
> posterity debugging, tell me now before it's too late.
>
> While I'm on this topic: The failed `mv` means that the source was
> still there, and so I was able to run `diff -rq --no-dereference` of
> the original source and the copy. I discovered that there were 5 files
> that were borked. They all had the same timestamps as the original,
> but the content was all zeros (as reported by od) I found that I could
> repair these by hand-copying the original over them, and the result is
> fine.
>
> Yes, this is a frankenstein system. Yes, it has 3 mons and three mgrs
> and three mds running on three machines, but OSD's only on two of
> them. I'll add more OSD's maybe tomorrow.
>
> But meanwhile, I have a system I can lose sleep over. The five files
> that were corrupted were corrupted *silently*, I would never have
> known if not for the weird-dir bug allowing me to go do the diff and
> find the corruption.  This was on a copy of 33GB containing 412K
> files. I'd previously moved over a TB of files. Are any of these
> damaged? God knows. I was sloppy, I did not keep the originals, I did
> not do a diff. I trusted CephFS to do the right thing, and it seems it
> didn't. That trust is shattered.
>
> How often does ext4fs do something like this? I don't know. I've used
> ext2/3/4fs for 25 years on maybe a hundred different machines, and
> never lost data (that wasn't user error, i.e. rm-r* in the wrong
> directory) This is with 25 years of consumer-grade, low-quality
> pata/sata and sometimes scsi gear plugged into consumer-grade boxes
> cobbled from parts bought on newegg or maybe refurbished from Dell.
> (Yeah I also get to use huge high-end machines, but this is "for
> example") If there was dataloss during these decades, I'm not aware of
> it. Was there any? Laws of probability say "yes". Once upon a time, I
> managed to corrupt an ext2fs disk so that e2fsck could not fix it.
> Yes, *hit happens.  But I've been using Ceph for barely a week, and
> its 2024 so it should be mature, so I'm not thrilled that I hit a bug
> so early and so easily. WTF.
>
> I did read the paper
> https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf  "File Systems
> Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph
> Evolution" which is an awesome paper, by the way, kudos to the
> authors. It does have one mis-representation, though. It says,
> paraphrasing: "it typically takes ten years for a file system to
> mature, and we here at CephFS/bluestore did it in only two (glow)"
> That paper was published in 2019 and it's now 2024, and off-the-shelf
> CephFS is clearly and blatantly buggy. Ceph is still awesome, but you
> cannot blame this bug on crappy hardware or a frankenstein sysconfig.
> Either the corner cases work, or they don't, and the authors of other
> filesystems take a decade or two to get through all of those corner
> cases. It would seem that CephFS has not evaded this iron law.
>
> Sorry for ending on such a downer, but hey... I'm one of those
> perfectionists that wreck things for the good-enough people. So it
> goes.
>
> -- Linas
>
> --
> Patrick: Are they laughing at us?
> Sponge Bob: No, Patrick, they are laughing next to us.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx