On Thu 16-06-16 16:42:58, Nikola Pajkovsky wrote: > Jan Kara <jack@xxxxxxx> writes: > > > On Fri 10-06-16 07:52:56, Nikola Pajkovsky wrote: > >> Jan Kara <jack@xxxxxxx> writes: > >> > On Thu 09-06-16 09:23:29, Nikola Pajkovsky wrote: > >> >> Holger Hoffstätte <holger@xxxxxxxxxxxxxxxxxxxxxx> writes: > >> >> > >> >> > On Wed, 08 Jun 2016 14:56:31 +0200, Jan Kara wrote: > >> >> > (snip) > >> >> >> Attached patch fixes the issue for me. I'll submit it once a full xfstests > >> >> >> run finishes for it (which may take a while as our server room is currently > >> >> >> moving to a different place). > >> >> >> > >> >> >> Honza > >> >> >> -- > >> >> >> Jan Kara <jack@xxxxxxxx> > >> >> >> SUSE Labs, CR > >> >> >> From 3a120841a5d9a6c42bf196389467e9e663cf1cf8 Mon Sep 17 00:00:00 2001 > >> >> >> From: Jan Kara <jack@xxxxxxx> > >> >> >> Date: Wed, 8 Jun 2016 10:01:45 +0200 > >> >> >> Subject: [PATCH] ext4: Fix deadlock during page writeback > >> >> >> > >> >> >> Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a > >> >> >> deadlock in ext4_writepages() which was previously much harder to hit. > >> >> >> After this commit xfstest generic/130 reproduces the deadlock on small > >> >> >> filesystems. > >> >> > > >> >> > Since you marked this for -stable, just a heads-up that the previous patch > >> >> > for the data exposure was rejected from -stable (see [1]) because it > >> >> > has the mismatching "!IS_NOQUOTA(inode) &&" line, which didn't exist > >> >> > until 4.6. I removed it locally but Greg probably wants an official patch. > >> >> > > >> >> > So both this and the previous patch need to be submitted. > >> >> > > >> >> > [1] http://permalink.gmane.org/gmane.linux.kernel.stable/18074{4,5,6} > >> >> > >> >> I'm just wondering if the Jan's patch is not related to blocked > >> >> processes in following trace. It very hard to hit it and I don't have > >> >> any reproducer. > >> > > >> > This looks like a different issue. Does the machine recover itself or is it > >> > a hard hang and you have to press a reset button? > >> > >> The machine is bit bigger than I have pretend. It's 18 vcpu with 160 GB > >> ram and machine has dedicated mount point only for PostgreSQL data. > >> > >> Nevertheless, I was able always to ssh to the machine, so machine itself > >> was not in hard hang and ext4 mostly gets recover by itself (it took > >> 30min). But I have seen situation, were every process who 'touch' the ext4 > >> goes immediately to D state and does not recover even after hour. > > > > If such situation happens, can you run 'echo w >/proc/sysrq-trigger' to > > dump stuck processes and also run 'iostat -x 1' for a while to see how much > > IO is happening in the system? That should tell us more. > > > Link to 'echo w >/proc/sysrq-trigger' is here, because it's bit bigger > to mail it. > > http://expirebox.com/download/68c26e396feb8c9abb0485f857ccea3a.html Can you upload it again please? I've got to looking at the file only today and it is already deleted. Thanks! > I was running iotop and there was traffic roughly ~20 KB/s write. > > What was bit more interesting, was looking at > > cat /proc/vmstat | egrep "nr_dirty|nr_writeback" > > nr_drity had around 240 and was slowly counting up, but nr_writeback had > ~8800 and was stuck for 120s. Hum, interesting. This would suggest like IO completion got stuck for some reason. We'll see more from the stacktraces hopefully. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html