Re: Bcache stuck at writeback of a key, consuming 100% CPU, not possible to detach

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Vojtech,

Zitat von Vojtech Pavlik <vojtech@xxxxxxxx>:
This matches my situation and behavior exactly. Including the fact that
the backing device is a md raid.
[...]
I've sent an one-liner patch to the mailing list a few moments ago that
fixes the issue.

I've added your patch to our kernel. Just for your reference, the system is running Opensuse 13.1, with the kernel from

--- cut here ---
Information for package kernel-source:
--------------------------------------
Repository: conecenter
Name: kernel-source
Version: 3.18.8-5.1
Arch: noarch
Vendor: obs://build.opensuse.org/home:conecenter
Installed: Yes
Status: up-to-date
Installed Size: 471.5 MiB
Summary: The Linux Kernel Sources
Description:
Linux kernel sources with many fixes and improvements.


Source Timestamp: 2015-02-06 23:35:46 +0200
GIT Revision: ec2a744f14f988690583c04bd910145cd5a1f3c9
GIT Branch: stable
--- cut here ---

and the patches

bcache001.eml:Subject: [PATCH] bcache: [BUG] clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
bcache002.eml:Subject: [PATCH] bcache: fix a livelock in btree lock
bcache003.eml:Subject: [PATCH] bcache: unregister reboot notifier when bcache fails to register a block device
bcache004.eml:Subject: [PATCH] fix a leak in bch_cached_dev_run()
bcache005.eml:Subject: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

I can confirm that running with writeback_percent to zero now works much smoother (or "at all", for certain circumstances).

PS: We're still facing random reboots (of unknown cause), which may
correlate with bcache's "amount dirty" being near the limit set by
writeback_percent.

For a test, after a few hours running the latest patch, I switched from writeback_percent==0 to writeback_percent==1, and had a full kernel crash within an hour! Luckily, I still had a console open on the machine, so I could for the first time see a hint (but not much more) of what is going on:

--- cut here ---
Message from syslogd@san02 at Sep  7 14:56:15 ...
kernel:[74182.424659] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa001a815

Message from syslogd@san02 at Sep  7 14:56:15 ...
 kernel:[74182.424659]

Message from syslogd@san02 at Sep  7 14:56:15 ...
kernel:[74182.474050] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
--- cut here ---

Since there's no stack trace, this lets much room for speculation. But at least I now have an idea where the reboots (and two other "full stops") might stem from: stack corruption. I have run scripts/checkstack.pl on the bcache module and found no excessive stack use, but checking for memset() and memcpy() in bcache's code gave a number of hits - I'll have to have a look at them, one by one, and hope to find my way around.

I'll give my servers at least two weeks to run with your patch and writeback_percent==0 to see if we're hit by reboots with that code as well. If not, I'll take that as an indicator that the implementation of the "PID regulator" may need a closer look.

Kent, do you remember having fixed anything that might explain this stack corruption behavior, in code later than what's included in kernel 3.18.8?

Regards,
Jens

--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux