Re: Bcache stuck at writeback of a key, consuming 100% CPU, not possible to detach

Vojtech Pavlik <vojtech@xxxxxxxx> · Mon, 7 Sep 2015 17:52:17 +0200

On Mon, Sep 07, 2015 at 05:13:18PM +0200, Jens-U. Mozdzen wrote:

> and the patches
> 
> bcache001.eml:Subject: [PATCH] bcache: [BUG] clear
> BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
> bcache002.eml:Subject: [PATCH] bcache: fix a livelock in btree lock
> bcache003.eml:Subject: [PATCH] bcache: unregister reboot notifier
> when bcache fails to register a block device
> bcache004.eml:Subject: [PATCH] fix a leak in bch_cached_dev_run()
> bcache005.eml:Subject: [PATCH] bcache: Fix writeback_thread never
> writing back incomplete stripes.
> 
> I can confirm that running with writeback_percent to zero now works
> much smoother (or "at all", for certain circumstances).

I'm glad to hear that.

> >>PS: We're still facing random reboots (of unknown cause), which may
> >>correlate with bcache's "amount dirty" being near the limit set by
> >>writeback_percent.
> 
> For a test, after a few hours running the latest patch, I switched
> from writeback_percent==0 to writeback_percent==1, and had a full
> kernel crash within an hour! Luckily, I still had a console open on
> the machine, so I could for the first time see a hint (but not much
> more) of what is going on:

I'm running the openSUSE most recent stable kernel, available here:

	http://download.opensuse.org/repositories/Kernel:/stable/standard/

It's currently at 4.2.0 and contains all of the above patches. I've
seen crashes in __find_stripe a couple times a few months apart on older
kernels, but these aren't likely related to bcache. Similar to this:

	https://bugzilla.kernel.org/show_bug.cgi?id=100321

But except for these, the system has been running stable (at
writeback_percent=40 the last few months), so I would bet on a different
source of your crashes than bcache.

> --- cut here ---
> Message from syslogd@san02 at Sep  7 14:56:15 ...
>  kernel:[74182.424659] Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: ffffffffa001a815
> 
> Message from syslogd@san02 at Sep  7 14:56:15 ...
>  kernel:[74182.424659]
> 
> Message from syslogd@san02 at Sep  7 14:56:15 ...
>  kernel:[74182.474050] Kernel Offset: 0x0 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
> --- cut here ---

Maybe you could set up a serial console? That way you'd be able to catch
all the kernel messages.

> Since there's no stack trace, this lets much room for speculation.
> But at least I now have an idea where the reboots (and two other
> "full stops") might stem from: stack corruption. I have run
> scripts/checkstack.pl on the bcache module and found no excessive
> stack use, but checking for memset() and memcpy() in bcache's code
> gave a number of hits - I'll have to have a look at them, one by
> one, and hope to find my way around.
> 
> I'll give my servers at least two weeks to run with your patch and
> writeback_percent==0 to see if we're hit by reboots with that code
> as well. If not, I'll take that as an indicator that the
> implementation of the "PID regulator" may need a closer look.
>
> Kent, do you remember having fixed anything that might explain this
> stack corruption behavior, in code later than what's included in
> kernel 3.18.8?

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html