jbd2: J_ASSERT(journal->j_free > 1) in jbd2_journal_next_log_block()

Azat Khuzhin <a3at.mail@xxxxxxxxx> · Wed, 17 May 2017 13:58:40 +0300

Hi everyone!

Recently I got next trace on one of machines (backtrace can be not in correct
order since it captured via netconsole):

  [7932653.070316] ------------[ cut here ]------------
  [7932653.070359] kernel BUG at fs/jbd2/journal.c:759!
  [7932653.070392] invalid opcode: 0000 [#1] SMP
  [7932653.070424] Modules linked in: ...
  [7932653.071108] CPU: 0 PID: 356 Comm: jbd2/md0-8 Not tainted 4.8.1 #1
  [7932653.071164] Hardware name: Supermicro Super Server/X10DRW-i, BIOS 1.1 08/13/2015
  [7932653.071222] task: ffff881ff1babfc0 task.stack: ffff881fef890000
   [<ffffffffa0454728>] jbd2_journal_next_log_block+0x68/0x70 [jbd2]
  [7932653.071258] RIP: 0010:[<ffffffffa0454728>]
  [7932653.071338] RSP: 0018:ffff881fef893cc8  EFLAGS: 00010246
  [7932653.071374] RAX: 0000000000000001 RBX: ffff883fed5a8800 RCX: 00000000000007c4
  [7932653.071432] RDX: 00000000000000ff RSI: ffff881fef893d68 RDI: ffff883fed5a8824
  [7932653.071489] RBP: ffff881fef893d68 R08: 0000000000000000 R09: 0000000000000000
  [7932653.071546] R10: 0000000000000001 R11: 0000000002408840 R12: ffff8822a92aad00
  [7932653.071603] R13: ffff882dfc7b083c R14: ffff883fed5a8800 R15: ffff883fecb5ad00
  [7932653.071659] FS:  0000000000000000(0000) GS:ffff881fffc00000(0000) knlGS:0000000000000000
  [7932653.071718] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [7932653.071753] CR2: 00007f2d363fc730 CR3: 0000000001806000 CR4: 00000000001406f0
  [7932653.071809] Stack:
   0000000800000018
   ffffffffa044d7e5
   ffff881fef893e60
  [7932653.071843]  ffff881274fca7f8
   ffff883fecb5ad9c
   ffff881fe4e26f00
   001c2ecad4ef6dba
  [7932653.071913]  ffff881ff1babfc0
   ffff882dfc7b083c
   ffff883ff10c3000
   000007c400000000
  [7932653.071982]  ffff881ff1babfc0
  [7932653.072051] Call Trace:
  [7932653.072082]  [<ffffffffa044d7e5>] ? jbd2_journal_commit_transaction+0xa65/0x18a0 [jbd2]
  [7932653.072143]  [<ffffffffa04530f2>] ? kjournald2+0xb2/0x240 [jbd2]
  [7932653.072182]  [<ffffffff81088e90>] ? wake_atomic_t_function+0x50/0x50
  [7932653.072220]  [<ffffffffa0453040>] ? commit_timeout+0x10/0x10 [jbd2]
  [7932653.072258]  [<ffffffff8106e17d>] ? kthread+0xbd/0xe0
  [7932653.072294]  [<ffffffff814638cf>] ? ret_from_fork+0x1f/0x40
  [7932653.072330]  [<ffffffff8106e0c0>] ? kthread_worker_fn+0x160/0x160
  ...
  [<ffffffffa0454728>] jbd2_journal_next_log_block+0x68/0x70 [jbd2]
  [7932653.072732] RIP
  [7932653.072793]  RSP <ffff881fef893cc8>
  [7932653.073307] ---[ end trace 3f9eca026a57b5c2 ]---

And a lot of WARN_ON(blk_needs_flush_plug(tsk)) after:

  [7932653.115857] WARNING: CPU: 0 PID: 356 at kernel/exit.c:737 do_exit+0x50/0xad0
  ...
  [7932653.123101] Call Trace:
  [7932653.123188]  [<ffffffff8122fd58>] ? dump_stack+0x46/0x5e
  [7932653.123280]  [<ffffffff8105312e>] ? __warn+0xbe/0xe0
  [7932653.123371]  [<ffffffff810557b0>] ? do_exit+0x50/0xad0
  [7932653.123462]  [<ffffffff81464a37>] ? rewind_stack_do_exit+0x17/0x20
  [7932653.123556]  [<ffffffff8106e0c0>] ? kthread_worker_fn+0x160/0x160
  [7932653.123648] ---[ end trace 3f9eca026a57b5c4 ]---
  [7932653.123737] Fixing recursive fault but reboot is needed!
  [7932653.123830] BUG: unable to handle kernel

I looked through the code on and on, mailing-lists ([1] and [2]), fixed bugs
(seems that [3] cannot make this), but couldn't find anything that can be
relevant.

And there is a comment in jbd2_journal_commit_transaction():
  /*
   * start_this_handle() uses t_outstanding_credits to determine
   * the free space in the log, but this counter is changed
   * by jbd2_journal_next_log_block() also.
   */
  atomic_dec(&commit_transaction->t_outstanding_credits);

And it is there over decades (since "fs/jbd"), but nothing in
jbd2_journal_next_log_block() touches commit_transaction, maybe it is about
"j_free" (since jbd2_log_space_left() uses t_outstanding_credits and j_free).

[1]: https://lkml.org/lkml/2005/1/19/144
[2]: https://lkml.org/lkml/2008/1/11/235
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=559cce698eaf4ccecb2213b2519ea3a0413e5155

Right now I don't have any reasons to think that this is the hardware
problem, but it is SAMSUNG MZ7KM960HAHP-0Z005 (SSD).

Also this is not reproducible, I got it once (a few days ago, and it works well
before for a year or so), and since than I tried to reproduce it without any
luck (with kvm and without).

It will be great if you can give me some hints in debugging this.

Thanks,
  Azat.