some thoughts about scrub

Colin McCabe <cmccabe@xxxxxxxxxxxxxx> · Mon, 31 Jan 2011 12:56:36 -0800

I was thinking a little bit about OSD scrub and I wanted to share my
thoughts with you guys.

OSD Scrub is the process of reconciling files stored on one instance
of cosd with those stored on another. Each file represents an object
in the ObjectStore. So why would files be different on different
cosds? As I see it, there are three cases:

Case #1:
The hard disk where that the FileStore is reading from could be dying.
In my experience, hard disks that are dying will tend to experience
long delays in reading from the filesystem. Occasionally you will be
unable to read some files, and you'll get EIO instead. When a hard
disk is dying, all you want to do is get your data off there as soon
as possible. You don't want to bother trying to "fix" the files on the
disk. That disk is toast.

Case #2:
Another reason is because the filesystem could be slightly corrupt.
This could happen, for example, if the computer lost power without
properly unmounting its filesystems. The usual remedy in these cases
is to run fsck.

ext3 and ext4 have working fsck commands that can usually fix what
ails your filesystem. Unfortunately, the fixed filesystem may lack
some of the data you used to have. I have often seen files truncated
to 0-length after running ext3's fsck.

Although btrfs does not have a fsck that can fix errors, it does have
data checksumming. This means that the chances that a random disk
error will go undetected are basically zero.

Case #3:
There could be a bug in btrfs, ext4, or Ceph that has caused the data
to diverge between different cosd nodes.

So the question is really: what should our OSD scrub code do?

In case #1, I don't really want to repair the disk. I just want to
become aware of the problem, so I can send someone out to replace the
failing hard disk.

In case #3, I probably don't want to repair things either. If I am
testing code, a repair mechanism that hides Ceph bugs just makes it
harder to create a good test. If I am in production, any mechanism
that causes bugs to generate a bunch of I/O will probably lead to
unhappiness. If there's a btrfs bug on your node, doing a bunch of I/O
with that btrfs partition will probably just make things worse.

So we're left with case #2. The case where the filesystem is slightly
corrupt. The question you really have to ask yourself in this case is:
what is faster, running fsck on the corrupt filesystem, or
reformatting and using Ceph to synchronize everything over?

Also keep in mind that, again, btrfs doesn't have fsck. I suspect that
checksum failures will return EIO, which looks a lot like case #1 to
the hapless sysadmin.

Here's what I think:
* Repairing FileStores might be a lot less useful than we originally
thought. Notifying the sysadmin about file errors might be a lot more
useful than we originally thought. We should have a configuration
option for scrub that makes it just note errors, but not initiate
repair.

* If we absolutely must repair a filestore, but only have 2x
replication, we might consider the FileStore whose underlying FS has
been fsck'ed most recently to be the non-authoritative one.

what do you think?

cheers,
Colin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html