Hi Vojtech,
Zitat von Vojtech Pavlik <vojtech@xxxxxxxx>:
This matches my situation and behavior exactly. Including the fact that
the backing device is a md raid.
[...]
I've sent an one-liner patch to the mailing list a few moments ago that
fixes the issue.
I've added your patch to our kernel. Just for your reference, the
system is running Opensuse 13.1, with the kernel from
--- cut here ---
Information for package kernel-source:
--------------------------------------
Repository: conecenter
Name: kernel-source
Version: 3.18.8-5.1
Arch: noarch
Vendor: obs://build.opensuse.org/home:conecenter
Installed: Yes
Status: up-to-date
Installed Size: 471.5 MiB
Summary: The Linux Kernel Sources
Description:
Linux kernel sources with many fixes and improvements.
Source Timestamp: 2015-02-06 23:35:46 +0200
GIT Revision: ec2a744f14f988690583c04bd910145cd5a1f3c9
GIT Branch: stable
--- cut here ---
and the patches
bcache001.eml:Subject: [PATCH] bcache: [BUG] clear
BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
bcache002.eml:Subject: [PATCH] bcache: fix a livelock in btree lock
bcache003.eml:Subject: [PATCH] bcache: unregister reboot notifier when
bcache fails to register a block device
bcache004.eml:Subject: [PATCH] fix a leak in bch_cached_dev_run()
bcache005.eml:Subject: [PATCH] bcache: Fix writeback_thread never
writing back incomplete stripes.
I can confirm that running with writeback_percent to zero now works
much smoother (or "at all", for certain circumstances).
PS: We're still facing random reboots (of unknown cause), which may
correlate with bcache's "amount dirty" being near the limit set by
writeback_percent.
For a test, after a few hours running the latest patch, I switched
from writeback_percent==0 to writeback_percent==1, and had a full
kernel crash within an hour! Luckily, I still had a console open on
the machine, so I could for the first time see a hint (but not much
more) of what is going on:
--- cut here ---
Message from syslogd@san02 at Sep 7 14:56:15 ...
kernel:[74182.424659] Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: ffffffffa001a815
Message from syslogd@san02 at Sep 7 14:56:15 ...
kernel:[74182.424659]
Message from syslogd@san02 at Sep 7 14:56:15 ...
kernel:[74182.474050] Kernel Offset: 0x0 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffff9fffffff)
--- cut here ---
Since there's no stack trace, this lets much room for speculation. But
at least I now have an idea where the reboots (and two other "full
stops") might stem from: stack corruption. I have run
scripts/checkstack.pl on the bcache module and found no excessive
stack use, but checking for memset() and memcpy() in bcache's code
gave a number of hits - I'll have to have a look at them, one by one,
and hope to find my way around.
I'll give my servers at least two weeks to run with your patch and
writeback_percent==0 to see if we're hit by reboots with that code as
well. If not, I'll take that as an indicator that the implementation
of the "PID regulator" may need a closer look.
Kent, do you remember having fixed anything that might explain this
stack corruption behavior, in code later than what's included in
kernel 3.18.8?
Regards,
Jens
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html