Hi all,
I've noticed similar oddities with our servers - not quite the same,
but close enough so I won't open a new thread:
We're running kernel 3.18.8 on server machines delivering SAN and NAS
resources. Back-end storage is a MD-RAID6 (7 WD Red 1TB 2,5"), cache is
on MD-RAID1 (2 SSD TOSHIBA PX02SMF020 200GB). As we live migrated from
128GB SSDs to the Toshibas, the cache size is still at 128 GB.
On one of the servers, I noticed excessive I/O load reported by our
monitoring tools. Having read this thread, I tried to get the amount
of dirty data down (cache_mode set to "writeback", writeback_percent
to 0, writeback_rate to 10000 and then monitoring
writeback_rate_debug), but unlike with the other server, the amount of
dirty data would not go below 186M. Unlike with the original report,
bcache_writeback wasn't at 100% but varying in its CPU usage - but
always on top of all other processes running. I/O wait was unusually
high, compared to the amount of data written.
I left the system rest over night, to find that the next day it would
not go below 197M, so that "bad spot" had changed. The load on this
server had increased - looking at the stats, it seemed like writeback
was trying to write data all the time, but for whatever reason failing
(which matched the lower limit of dirty data).
Fearing the worst, I set the cache mode to writearound to disable
further caching (amount of dirty still wouldn't drop below its border
value), stopped the clients for this server and rebooted.
Luckily, the server came up without a problem, and *I now could get
the amount of dirty data down to zero*. I switched back to writeback,
with writeback_percent to 0 and a fixed writeback_rate.
So in our case, it looks like something borked in bcache's run-time,
rather than on-disk (read: SSD cache content).
Regards,
Jens
PS: We're still facing random reboots (of unknown cause), which may
correlate with bcache's "amount dirty" being near the limit set by
writeback_percent. I'm trying to work-around this these days by
running with writeback_percent set to zero and using a writeback_rate
that lets the cache clean up over the day... so far, so good, but it's
too early to tell for sure. Since switching to the new SSDs, the
reboot rate went down to about once a week and I've made this change
only two days ago.
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html