Re: Kernel BUG at dm-cache-policy-mq.c

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 21 2017 at  9:02am -0400,
Stanislas Oger <stanislas.oger@xxxxxxxxx> wrote:

> Hi,
> 
> We currently encounter a critical issue on a Proxmox cluster we
> operate, which seems to be triggered by a bug in dm-cache ("kernel
> BUG at drivers/md/dm-cache-policy-mq.c:1079!", see syslog below).
> 
> 
> 1/ Context
> 
> The Proxmox cluster uses 4.4 kernel, the VM storage is a DRBD9
> cluster on top of lvm with SSD caching. The underlaying disks are on
> a MegaRAID hardware RAID.
> The problem started to occur since we installed a VM (a mail server)
> that performs many disk reads on many small files (~ 1 million),
> with read lock using flock at each read. With the VM fully running,
> the IO wait of the system is less than 1%.
> 
> 
> 2/ The problem
> 
> Randomly, without pre-fail signs, syslog reports a bug in
> dm-cache-policy-mq.c (see below). A few minutes later all write
> operations infinitely block. A few minutes after the node stopped to
> perform write operations, the other DRBD9 nodes stop writing too. At
> this point all the cluster is down. Reads can be done as usual, but
> write operations are inifitinely blocking.
> 
> The only way we figured out to overcome this situation is to perform
> a hard reboot of the failing node. As soon as the failing node is
> down, the other nodes resume to a normal activity. When the failing
> node is up again, DRBD9 performs disk resynchronization and the
> cluster resume normal activity, as if nothing happened.
> 
> The bug occurred with both 4.4.35 and 4.4.40 kernels, with a
> frequency of about once every 10 days.

How large is your cache? (size of slow and fast device?)

Have you tried the smq policy?  mq is no longer maintained (has been
removed and made an alias of smq, see commit 9ed84698fdda ("dm cache:
make the 'mq' policy an alias for 'smq'")).

It should be noted that dm-cache is changing significantly in 4.12
(already staged in linux-next), see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.12

The new smq code doesn't have the BUG_ON() in question.

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel



[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux