CephFS empty files in a Frankenstein system

Linas Vepstas <linasvepstas@xxxxxxxxx> · Tue, 26 Nov 2024 19:11:53 -0600

Don't laugh.  I am experimenting with Ceph in an enthusiast,
small-office, home-office setting. Yes, this is not the conventional
use case, but I think Ceph almost is, almost could be used for this.
Do I need to explain why? These kinds of people (i.e. me) already run
RAID. And maybe CIFS/Samba or NFS. On half-a-dozen or more machines.

And RAID is OK, I guess, as long as your mirrors are in sync. And if
you have three disks in a mirror, do they vote as to who has the best
data? Not mdraid, no. And fsck takes hours on a multi-terabyte system
with old disks. There's no CRC on data, just metadata. Maybe there's
silent corruption in the background. Who knows.

And being admin for CIFS/NFS in a small network is not exactly a
fulfilling experience. So, hey, you know what? Perhaps CephFS can
replace these. Perhaps CephFS is better. Perhaps CephFS is ready for
this. I mean, I like ext4fs, but only because btrfs hates me. I'm
loyal. But I'm here now.

Seriously, I think that, with just a little bit of polishing and
automation, Ceph could be deployed in the small-office/home-office
setting. Don't laugh. This could happen.

Sadly, I am seeing a tiny bit of data corruption.

I said don't laugh. I'm resource constrained, server and disk
constrained, and time-unconstrained.  So I thought I'd bootstrap up
from nothing. Manually.. First, one server, with two OSD's from two
partitions on two different disks. Works great! For the record, this
should be no worse than RAID mirroring on one host, and so is 100%
entirely acceptable in the home/small-office environment. I move (not
copy, but move) a few 100 GB of data I can afford to lose. Of course
there are "warning: degraded+undersize" warnings, but hardly a
surprise.

I add a second server, and a 3rd OSD.  The initial system is running
Ceph "reef", because it's running Debian testing which has reef in it.
The second system has ceph "pacific", because it's Debian stable, and
that's what Debian stable currently packages. (Yes, pacific is past
EOL. But it's on Debian stable, so....) Next version of Debian stable
won't show up till next summer. I thought reef and pacific might not
talk to one-another, but they do, and happily so. Excellent! +1 ceph!

The system starts rebalancing onto the third OSD.  After running all
night, it gets to 2/3rds or 3/4 clean, and maybe 1/4th
degraded-undersize. Surprising at first, because I had size/min_size
set to 3/2 and I had 3 OSD's so .. ??? Later I understood that CRUSH
wants three hosts, but whatever. I experiment with changing
size/min_size to 2/1 and the warnings go away! Going back to 3/2 and
they reappear. I am somewhat unhappy: before switching to 2/1, I saw
most PG's were on most OSD's. Good! Switching to 2/1 made many of them
disappear (!) and switching back to 3/2 did not make them reappear.
The OSD's seem mildly underpopulated, and seem to remain that way.

Huh. Who am I to say, but I would have liked to have seen one copy of
*every* PG on *every* OSD, because that's what size=3 should mean,
right? Even if there are only two servers? Or should I select a
different crush algo? (I have not played with this, yet). I would
sleep better if all three disks had a copy, because that would be an
actual improvement over bog-standard raid mirroring.

Anyway, I add a 4th OSD on the second server (so that there are 4
total) and I add a third server with zero OSD's (so that ceph-mon can
create a proper quorum.) And by now I have over a TB of data on there
(that I can afford to lose).

So that's the backstory. Here's the problem.

Last night, I move (not copy, but move) a 33 GB directory onto CephFS.
This morning I see:

mv: cannot overwrite directory '/dest/.snap' with non-directory '/source/.snap'

wtf? I cd to /dest/.snap and indeed it is a directory. And it's empty.
The source .snap is a file, 736 bytes in size.  Directory timestamp
and source-file timestamp are identical. I attempt a hand repair.  I
cd to the parent dir, do rmdir .snap and rmdir says "no such
directory". wtf? So then `ls -la` and indeed, there is no such
directory. But I can still cd into it! Huh. I cannot rmdir it, I
cannot cp or mv it, because "it doesn't exist". If I cd into it, I
cannot create any files in it, I get "permission denied", even as
root. So it's a directory, it won't show in the listing of the parent
dir. I can cd to it, but I cannot put anything in it.

This, to me, says "CephFS bug".  Other culprits could have been: (a)
bad disk (b) corruption during copy, i.e. bad fs stat during copy due
to (b1) bad kernel (b2) bad SATA traffic (b3) corrupt kernel page
table, buffer, block cache, etc.  (b4) bad data on the source disk.
But all this is in the past: I can look at the source file: it's
there, it's readable, it has valid contents, (It's ascii text. It
looks just fine.)  Clearly some transient error perturbed things.
However, I now have a CephFS with a directory that I cannot remove,
and I cannot place anything into it, and I cannot create a file with
the same name as the directory.  So that's not good.

Sometime after I hit send on this email, ceph will automatically run
scrub and/or deep scrub, and maybe this problem will fix itself. If
there's something I should do to preserve this broken-ness for future
posterity debugging, tell me now before it's too late.

While I'm on this topic: The failed `mv` means that the source was
still there, and so I was able to run `diff -rq --no-dereference` of
the original source and the copy. I discovered that there were 5 files
that were borked. They all had the same timestamps as the original,
but the content was all zeros (as reported by od) I found that I could
repair these by hand-copying the original over them, and the result is
fine.

Yes, this is a frankenstein system. Yes, it has 3 mons and three mgrs
and three mds running on three machines, but OSD's only on two of
them. I'll add more OSD's maybe tomorrow.

But meanwhile, I have a system I can lose sleep over. The five files
that were corrupted were corrupted *silently*, I would never have
known if not for the weird-dir bug allowing me to go do the diff and
find the corruption.  This was on a copy of 33GB containing 412K
files. I'd previously moved over a TB of files. Are any of these
damaged? God knows. I was sloppy, I did not keep the originals, I did
not do a diff. I trusted CephFS to do the right thing, and it seems it
didn't. That trust is shattered.

How often does ext4fs do something like this? I don't know. I've used
ext2/3/4fs for 25 years on maybe a hundred different machines, and
never lost data (that wasn't user error, i.e. rm-r* in the wrong
directory) This is with 25 years of consumer-grade, low-quality
pata/sata and sometimes scsi gear plugged into consumer-grade boxes
cobbled from parts bought on newegg or maybe refurbished from Dell.
(Yeah I also get to use huge high-end machines, but this is "for
example") If there was dataloss during these decades, I'm not aware of
it. Was there any? Laws of probability say "yes". Once upon a time, I
managed to corrupt an ext2fs disk so that e2fsck could not fix it.
Yes, *hit happens.  But I've been using Ceph for barely a week, and
its 2024 so it should be mature, so I'm not thrilled that I hit a bug
so early and so easily. WTF.

I did read the paper
https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf  "File Systems
Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph
Evolution" which is an awesome paper, by the way, kudos to the
authors. It does have one mis-representation, though. It says,
paraphrasing: "it typically takes ten years for a file system to
mature, and we here at CephFS/bluestore did it in only two (glow)"
That paper was published in 2019 and it's now 2024, and off-the-shelf
CephFS is clearly and blatantly buggy. Ceph is still awesome, but you
cannot blame this bug on crappy hardware or a frankenstein sysconfig.
Either the corner cases work, or they don't, and the authors of other
filesystems take a decade or two to get through all of those corner
cases. It would seem that CephFS has not evaded this iron law.

Sorry for ending on such a downer, but hey... I'm one of those
perfectionists that wreck things for the good-enough people. So it
goes.

-- Linas

-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx