Re: Formatting of backing device

Alex Elsayed <eternaleye@xxxxxxxxx> · Thu, 16 Feb 2012 12:50:58 -0800

On Thu, Feb 16, 2012 at 12:33 PM, Piergiorgio Sartor
<piergiorgio.sartor@xxxxxxxx> wrote:
> Hi Alex,
>
>> Oh sure, the cache is persistent. But device discovery order is undefined, and
>> if the backing device is no different from one without a cache and writeback
>> caching is enabled the kernel has no *possible* way to know that a caching
>> device is needed or even exists. So it mounts it, but it doesn't have any of the
>> data in the writeback cache meaning it thinks the filesystem is corrupted.
>> Depending on the filesystem and exactly what is missing, it may run some
>> in-kernel recovery code that alters the disk. You just lost your data.
>
> nonono, I believe I wrote that the kernel
> should *first* look for caching devices
> and later for the others...
>
> The formatting thing is, clearly, a much
> standard approach, for the current kernel
> architecture, but nothing forbids to have
> a hierarchical search of devices.
> This could be done, for example, by assigning
> different classes to each device type, to
> be scanned in a specific order.
>
> In this scope (not bcache, but device discovery)
> it is already a problem a layered software RAID
> with metadata 1.0 together with 1.2 (or 1.1).
> Where the first lies at the end and the second
> at the beginning of the HDDs, making it difficult
> (but not impossible) to find out which is the
> outer and which is the inner one.

The difference is that for MD devices, both types
of metadata are on the same block device. You're
prioritizing which *type of metadata* is checked
for first in that case. For bcache, you'd have to
scan /dev/sdz before /dev/sda if sdz is the cache
and sda is the backing device. Now consider a
few things:

1.) SCSI/SATA devices may be probed in parallel

2.) udev gets events when each device is probed,
*not* after all devices have been probed

3.) The bcache device may not even be attached
to the system at the time

4.) Even in the MD case, there is still *some*
change to the backing device, there is still some
sort of data there that says "hey, there's more."
A totally unchanged backing device won't do that.
Even if it doesn't invalidate the other metadata, it
still tells the kernel that it's not enough - think of
it as invalidating it at the logical rather than the
physical level

3 and 4 are the really critical ones. If the cable
that connects the SSD to the computer is flaky,
and it never gets probed, and there is *no*
metadata on the backing device, there is
*exactly* zero information available to the kernel
to inform it that a backing device ever existed at all.

Also, you say that the cache must be scanned
before the backing device - but how do you know
it's a cache or a backing device until you've probed it?
You could delay sending any uevents untill all
devices are probed, except there are some devices
that take 30sec timeouts and fail, or iscsi, or devices
that get plugged in at runtime, or...

And since you can't do that, you have a chicken
and egg problem. You can't probe the backing
device before the cache, but you don't know which
is the cache until you probe it. And there may be
more than one of each. You can have one cache
and 200 backing devices, in theory. Want to take
the odds that the cache gets probed first at random?
Because the kernel doesn't have enough information
for it to be anything other than random.
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html