Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello again!

I know it's been ages, but I finally got some time to get that patch
tested out and try additional debugging.

On Sep 01, 2011, at 11:17, Jan Kara wrote:
> On Tue 30-08-11 19:26:22, Moffett, Kyle D wrote:
>> On Aug 30, 2011, at 18:12, Jan Kara wrote:
>>>> I can still trigger it on my VM snapshot very easily, so if you have anything
>>>> you think I should test I would be very happy to give it a shot.
>>> 
>>> OK, so in the meantime I found a bug in data=journal code which could be
>>> related to your problem. It is fixed by commit
>>> 2d859db3e4a82a365572592d57624a5f996ed0ec which is in 3.1-rc1. Have you
>>> tried that or newer kernel as well?
>>> 
>>> If the problem still is not fixed, I can provide some debugging patch to
>>> you. We spoke with Josef Bacik how errors like yours could happen so I have
>>> some places to watch...
>> 
>> I have not tried anything more recent; I'm actually a bit reluctant to move
>> away from the Debian squeeze official kernels since I do need the security
>> updates.
>> 
>> I took a quick look and I can't find that function in 2.6.32, so I assume it
>> would be a rather nontrivial back-port.  It looks like the relevant code
>> used to be in ext4_clear_inode somewhere?
> It's not that hard - untested patch attached.

So this applied mostly cleanly (with one minor context-only conflict in
the 2.6.32.17 patch), unfortunately it didn't resolve the problem.
Just as a sanity check, I upgraded to the Debian 3.1.0-1-amd64 kernel,
based on kernel version 3.1.1 and the problem still occurs there too
(additional info at the end of the email).

Looking at the issue again, I don't think it has anything to do with
file deletion at all.

Specifically, there are a grand total of 4 files in that filesystem
(alongside an empty "lost+found" directory):
  master.lock
  prng_exch
  smtpd_scache.db
  smtp_scache.db

As far as I can tell, none of those is ever deleted during normal
operation.

The crash occurs very quickly after starting postfix.  It connects to
the external email server (using TLS) and begins to flush queued mail.

At that point, the "tlsmgr" daemon tries to update the "smtp_scache.db"
file, which is a Berkeley DB about 40k in size.  Somewhere in there,
the Berkeley DB does an fdatasync().

The "fdatasync()" apparently triggers the bad behavior from the "jbd2"
thread, which then oopses in fs/jbd2/commit.c:485 (which appears to be
the same same BUG_ON() as before).

The stack looks something like this:
  jbd_journal_commit_transaction+0x4ea/0x1053 [jbd2]
  kjournald2+0xc0/0x20a [jbd2]
  add_wait_queue+0x3c/0x3c
  commit_timeout+0x5/0x5 [jbd2]
  kthread+0x76/0x7e

Cheers,
Kyle Moffett

--
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux