On 17/11/17 13:22, Coly Li wrote:
On 17/11/2017 8:57 PM, Eddie Chapman wrote:
On 17/11/17 10:20, Rui Hua wrote:
Hi, Stefan
2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx>:
I‘m getting the same xfs error message under high load. Does this
patch fix
it?
Did you applied the patch "bcache: only permit to recovery read error
when cache device is clean" ?
If you did, maybe this patch can fix it. And you'd better check
/sys/fs/bcache/XXX/internal/cache_read_races in your environment,
meanwhile, it should not be zero when you get that err message.
Hi all,
I have 3 servers running a very recent 4.9 stable release, with several
recent bcache patches cherry picked, including V4 of "bcache: only
permit to recovery read error when cache device is clean".
In the 3 weeks since using these cherry picks I've experienced a very
small number of isolated read errors in the layer above bcache, on all 3
servers.
On one of the servers, 2 out of the 6 bcache resources have a value of 1
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same
2 bcache resources where one read error has occurred on the upper layer.
The other 4 bcache resources have 0 in cache_read_races and I haven't
had any read errors on the layers above them.
On another server, I have 1 bcache resource out of 10 with a value of 5
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that
bcache resource where a read error occurred on one occasion. The other 9
bcache resources have 0 in cache_read_races, and no read errors have
occurred on the layers above any of them.
On the 3rd server where some read errors occurred, I cannot verify if
there were positive values in cache_read_races as I moved the data from
there onto other storage, and shut down the bcache resources where the
errors occurred.
If I can provide any other info which might help with this issue, please
let me know.
Hi Eddie,
This is very informative, thank you so much :-)
Coly Li
Hi Coly,
You are most welcome. Another interesting info, but maybe it is
unrelated/coincidence: the bcache resources where the errors occurred,
the underlying backing device was a raid adapter that is quite a lot
slower than the (different) underlying physical storage on the other
bcache resources that do not have read races. Up to now I had suspected
a driver issue with this raid adapter as causing the read errors, so I
started the process of gradually retiring the adapter on these servers
in the last 3 weeks. Anyway, in light of this issue coming up here I'm
wondering if this is significant in suggesting possibly that the read
races are more likely to occur if the backing storage is quite slow. Or
maybe not.
Eddie