Re: [PATCH] bcache: recover data from backing device when read request hit clean

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/11/2017 11:09 PM, Eddie Chapman wrote:
> On 22/11/17 05:13, Michael Lyle wrote:
>> Reviewed-by: Michael Lyle <mlyle@xxxxxxxx>
> 
> <snip>
> 
> I have a little dilemma now related to this patch and I wonder if I
> could ask you guys for your opinion?
> 

Hi Eddie,

As author of  "only permit to recovery read error" patch, let me try to
response your question.

> I have been hit and am being hit by the bug this patch fixes every few
> days at the moment, on writeback bcache resources. When it hits it
> causes a failure in the layer above, and this leads to other problems
> that can and has led to loss of data (not in bcache, just that the read
> failure causes a chain of events in upper layer, separate issue, won't
> clutter this mail with that). So, I am very keen to resolve this
> urgently in my environment.
> 
> I am running stable 4.9, and I have cherry picked several recent bcache
> patches, including "bcache: only permit to recovery read error
> when cache device is clean", which I understand causes this issue.
> 
> I have a dilemma now whether I should apply the new patch in this
> thread, which should fix the issue, or revert "only permit to recovery
> read error" patch.
> 

I suggest you to take one of the 2 actions,
- apply "bcache: recover data from backing device when read request hit
clean" patch, or
- revert "only permit to recovery read error" patch, and add the
modification as "bcache: recover data from backing device when read
request hit clean" does.

> The dilemma is because I don't like the idea of reverting the "only
> permit to recovery read error" patch since it appears to fix an
> important problem.  Or am I overestimating the problem (maybe that's why
> the original patch has not been backported to stable kernels, maybe the
> original issue quite rare)?
> 

"only permit to recovery read error" patch fixes a real data corruption
issue with broken cache device (SSD), which was reported from production
environment. If you don't have all these 2 patches, your data may be
corrupted in silence when SSD is broken.

> But I am also wary of applying the new patch as Michael only just added
> to his tree so hasn't had much baking time.
> 

The new one "bcache: recover data from backing device when read request
hit clean" can be treated as an improvement of my patch. Because when I
wrote that patch, I didn't notice metadata and data failure on cache
device were in different code path, I have to only check ds->has_dirty.
Rui points out the metadata failure is in another code pathm and buggy
too. He posts another patch to fix it, and you may find it in
linux-bcache mailing list.

When I wrote the "only permit to recovery read error" patch, I didn't
notice the read race existed neither, until Junhui and Rui noticed me.
The read race only happens on clean data (because its bucket is clean
and re-allocated to different write request), so the new patch permits
data recovery from backing device if a read failed on clean data.

The result for this new fix is, less I/O error may returns to upper
layer code, especially the bogus I/O error generated by btree read race.

> I wondered what is the opinion of you guys which of these two I should
> choose?

If you only do patching, you need to have both of them. If you only want
one patch, you may rebase this patch to 4.9 kernel and ignore the first one.

> 
> (BTW, please don't say "just don't cherry pick onto 4.9". I am aware of
> the risks of cherry picking, as discussed recently on this list. I am
> happy with the level of risk and I am being very careful in what I pick.
> In this case I have not been "burnt" by cherry picking since I would
> have hit the same issue if I would have just upgraded to 4.14 which
> contains the original "only permit to recovery read error" patch.)
> 
> ( N.B. I understand that any opinion/advice I receive is just that. I
> take complete responsibility for what I decide :-) )

One more comment, when you apply both patches, I/O failure on broken
cache device (SSD) is still risky. As Rui points out in his patch,
"The read request might meet error when searching the btree, but the
error was not handled in cache_lookup(), and this kind of metadata
failure will not go into cached_dev_read_error(), finally, the upper
layer will receive bi_status=0."

Rui's patch "bcache: return IOERR to upper layer when read request meet
metadata error" is also added inti Mike's tree, if you may consider
testing it, that will be very helpful to us.

Thank you for asking great question :-)

-- 
Coly Li
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux