On 2017/7/12 上午1:42, Eric Wheeler wrote: > On Tue, 11 Jul 2017, Coly Li wrote: >> On 2017/7/11 上午5:46, Kai Krakow wrote: >>> Am Mon, 10 Jul 2017 19:18:28 +0800 >>> schrieb Coly Li <colyli@xxxxxxx>: >>> >>>> If a read bio to cache device gets failed, bcache will try to >>>> recovery it by forward the read bio to backing device. If backing >>>> device responses read request successfully then the bio contains data >>>> from backing device will be returned to uppper layer. >>>> >>>> The recovery effort in cached_dev_read_error() is not correct, and >>>> there is report that corrupted data may returned when a dirty cache >>>> device goes offline during reading I/O. >>>> >>>> For writeback cache mode, before dirty data are wrote back to backing >>>> device, data blocks on backing device are not updated and consistent. >>>> If a dirty cache device dropped and a read bio gets failed, bcache >>>> will return its stale version from backing device. This is mistaken >>>> behavior that applications don't expected, especially for data base >>>> workload. >>>> >>>> This patch fixes the issue by only permit recoverable I/O when cached >>>> device is in writethough mode, and s->recoverable is set. For other >>>> cache mode, recovery I/O failure by reading backing device does not >>>> make sense, bache just simply returns -EIO immediately. > > Since bcache keeps the dirty map in memory, can you allow the passthrough > to the backing device if the block is clean, but -EIO otherwise? Hi Eric, This is a nice idea, I also discussed with other people around me. A reason not to choose it is, for data base applications in clustering environment, it is not very important for dirty data lost on disconnected from one server, because the work load can be switched to other server and replay inside the cluster. But it is important to detect the disconnected cache device failure as early as possible, so the work load can be switched with less dirty data lost. Finally I choose simply returning an -EIO to I/O requester, to make application to know the cache not available earlier and not to introduce more inconsistent data on backing device. Coly > >>> >>> Shouldn't write-around mode be okay for the same reason, i.e. there's >>> no stale version on disk? >>> >>> So the patch should read "mode != CACHE_MODE_WRITEBACK" (not sure about >>> the constant), that would also match your textual description. >>> >> >> For write-around mode, there is only backing device as bdev target, it >> does not have such inconsistent data issue. >> >> And indeed, this is a try-best effort. I mean, if a cached device is >> changed from writeback mode to writethrough mode, then the cache device >> is offline with dirty data, data corruption will still be probably to >> happen. Anyway, this patch fixes most of cases when people do not change >> cache mode in runtime. >> >> Thanks. >> >> Coly >> >>> >>>> Reported-by: Arne Wolf <awolf@xxxxxxxxxx> >>>> Signed-off-by: Coly Li <colyli@xxxxxxx> >>>> --- >>>> drivers/md/bcache/request.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c >>>> index 019b3df9f1c6..6edacac9b00d 100644 >>>> --- a/drivers/md/bcache/request.c >>>> +++ b/drivers/md/bcache/request.c >>>> @@ -702,8 +702,11 @@ static void cached_dev_read_error(struct closure >>>> *cl) { >>>> struct search *s = container_of(cl, struct search, cl); >>>> struct bio *bio = &s->bio.bio; >>>> + struct cached_dev *dc = container_of(s->d, struct >>>> cached_dev, disk); >>>> + unsigned mode = cache_mode(dc, NULL); >>>> >>>> - if (s->recoverable) { >>>> + if (s->recoverable && >>>> + (mode == CACHE_MODE_WRITETHROUGH)) { >>>> /* Retry from the backing device: */ >>>> trace_bcache_read_retry(s->orig_bio); >>>> >>> >>> >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html