On Tue, 03 Sep 2013 19:18:04 +0400, Vyacheslav Dubeyko wrote: > Hi Ryusuke, > > On Tue, 2013-09-03 at 03:13 +0900, Ryusuke Konishi wrote: >> On Mon, 02 Sep 2013 17:27:44 +0400, Vyacheslav Dubeyko wrote: >> > From: Vyacheslav Dubeyko <slava@xxxxxxxxxxx> >> > Subject: [PATCH] [CRITICAL] nilfs2: fix issue with race condition of competition between segments for dirty blocks >> >> Before considering to apply this patch, please confirm that the >> suspected issue, "dirty buffer_heads are added in several lists of >> payload_buffers" you pointed out, is actually happening. >> >> The easiest way for this is testing the following patch: >> >> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c >> index bd88a74..cab38db 100644 >> --- a/fs/nilfs2/segment.c >> +++ b/fs/nilfs2/segment.c >> @@ -668,6 +668,7 @@ static size_t nilfs_lookup_dirty_data_buffers(struct inode *inode, >> if (!buffer_dirty(bh)) >> continue; >> get_bh(bh); >> + BUG_ON(!list_empty(&bh->b_assoc_buffers)); >> list_add_tail(&bh->b_assoc_buffers, listp); >> ndirties++; >> if (unlikely(ndirties >= nlimit)) { >> @@ -701,6 +702,7 @@ static void nilfs_lookup_dirty_node_buffers(struct inode *inode, >> do { >> if (buffer_dirty(bh)) { >> get_bh(bh); >> + BUG_ON(!list_empty(&bh->b_assoc_buffers)); >> list_add_tail(&bh->b_assoc_buffers, >> listp); >> } >> > > Yes, I can confirm the issue. I have applied your patch. And I have such > output in my syslog after reproducing of the issue: > > [ 424.605561] ------------[ cut here ]------------ > [ 424.605576] kernel BUG at fs/nilfs2/segment.c:672! > [ 424.605581] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC > [ 424.605589] Modules linked in: rfcomm bnep bluetooth parport_pc ppdev snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm xfs snd_seq_midi kvm_amd snd_rawmidi kvm snd_seq_midi_event snd_seq r8712u(C) snd_timer snd_seq_device snd sp5100_tco microcode soundcore k10temp i2c_piix4 lp snd_page_alloc parport mac_hid video ext2 sky2 btrfs raid6_pq xor zlib_deflate libcrc32c > [ 424.605650] CPU: 0 PID: 358 Comm: segctord Tainted: G C 3.10.0-rc5+ #116 > [ 424.605654] Hardware name: /EDGE-FT1M1 E450, BIOS 4.6.4 11/15/2011 > [ 424.605660] task: ffff880208ad22a0 ti: ffff8802085e0000 task.ti: ffff8802085e0000 > [ 424.605663] RIP: 0010:[<ffffffff812bff53>] [<ffffffff812bff53>] nilfs_lookup_dirty_data_buffers.isra.22+0x203/0x270 > [ 424.605678] RSP: 0000:ffff8802085e1b08 EFLAGS: 00010293 > [ 424.605682] RAX: ffff880219589618 RBX: ffff8802085e1c08 RCX: ffff88022dfb6000 > [ 424.605685] RDX: ffff880219589660 RSI: ffff880219589618 RDI: ffff88022dff8aa8 > [ 424.605689] RBP: ffff8802085e1be8 R08: 4600000000000000 R09: a8001c7123000000 > [ 424.605692] R10: 57ffcd8ee41c48c0 R11: 0000000000000000 R12: 0000000000000000 > [ 424.605695] R13: ffffea00071c48c0 R14: 0000000000000000 R15: 0000000000000800 > [ 424.605699] FS: 00002aaaaaae3780(0000) GS:ffff880225a00000(0000) knlGS:0000000000000000 > [ 424.605703] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 424.605706] CR2: 00007fff2945bff8 CR3: 00000002186f9000 CR4: 00000000000007f0 > [ 424.605710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 424.605713] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 424.605716] Stack: > [ 424.605719] ffff8802085e1b78 ffff880219177508 ffff8802191773cc ffffffffffffffff > [ 424.605728] ffff88021a637958 000000000000000e 0000000000000000 ffffea00071c48c0 > [ 424.605735] ffffea00071c4880 ffffea00071c4840 ffffea00071c4800 ffffea00071dd7c0 > [ 424.605742] Call Trace: > [ 424.605753] [<ffffffff812c0050>] nilfs_segctor_scan_file+0x90/0x190 > [ 424.605760] [<ffffffff812c0736>] nilfs_segctor_do_construct+0x596/0x1ba0 > [ 424.605767] [<ffffffff812c2053>] nilfs_segctor_construct+0x183/0x2a0 > [ 424.605774] [<ffffffff812c22b6>] nilfs_segctor_thread+0x146/0x370 > [ 424.605781] [<ffffffff812c2170>] ? nilfs_segctor_construct+0x2a0/0x2a0 > [ 424.605788] [<ffffffff8106bc9d>] kthread+0xed/0x100 > [ 424.605794] [<ffffffff8106bbb0>] ? flush_kthread_worker+0x190/0x190 > [ 424.605802] [<ffffffff816ef2dc>] ret_from_fork+0x7c/0xb0 > [ 424.605808] [<ffffffff8106bbb0>] ? flush_kthread_worker+0x190/0x190 > [ 424.605811] Code: ef 8b 08 d3 e6 48 63 f6 e8 2b 3d ef ff e9 19 ff ff ff 85 c0 74 ac 66 90 eb 9c 8b 95 48 ff ff ff 85 d2 75 09 e8 6f 48 42 00 eb ad <0f> 0b 48 8d bd 48 ff ff ff e8 1f 29 e7 ff e8 5a 48 42 00 eb 98 > [ 424.605886] RIP [<ffffffff812bff53>] nilfs_lookup_dirty_data_buffers.isra.22+0x203/0x270 > [ 424.605893] RSP <ffff8802085e1b08> > [ 424.605900] ---[ end trace 0061903595fb4cff ]--- Thank you for testing the patch. Ok, let's narrow down the root cause of this issue. >> You can rewrite the above BUG_ON() check with the following debug code >> if you want to get detail information on the buffer: >> >> if (!list_empty(&bh->b_assoc_buffers)) { >> /* dump information on bh */ >> } >> >> As a workaround, we can avoid double registration to different payload >> lists with list_empty(&bh->b_assoc_buffers) instead of adding >> BH_async_write flag. >> > > Ok. I agree that we need to elaborate another fix if you think that my > approach is not clear enough. > >> However, I think we should first narrow down the root cause of why the >> issue "dirty buffer_heads are added in several lists of >> payload_buffers" happens. >> >> We designed the log writer so that the same buffer is not registered >> to two different payload buffer lists at the same time. If it is not >> satisfied, it's a bug. >> >> What is the root cause of the "double registration" of a buffer head, >> do you think? >> > > I have such understanding of the issue's reason. > > First of all, I am talking about full segments case but not about > partial segments. As I see, all works fine for partial segments case > because processing of partial segments of one full segment is located > inside of one iteration of cycle in nilfs_segctor_do_construct() method. > > So, if we have a file with dirty blocks count is greater than full > segment size then such dirty file will be processed by means of several > iterations of cycle in nilfs_segctor_do_construct() method. Also it > needs to take into account the asynchronous nature of processing of > submitted bio requests on block layer. The searching of dirty blocks is > made in nilfs_lookup_dirty_data_buffers() and > nilfs_lookup_dirty_node_buffers() on the basis of lookup of dirty memory > pages in pagecache and selecting found dirty buffers. Found dirty blocks > are added in current segbuf. Then it makes preparation of dirty blocks > for write operation and submission write requests on block layer. When > block layer ends processing of some bio then it calls > nilfs_end_bio_write(). The segctor thread can complete write request in > nilfs_segctor_complete_write() only after nilfs_end_bio_write() call. > Processed buffers are made as clean namely in > nilfs_segctor_complete_write() method. But if we submitted dirty blocks > of file A in previous iteration and try to process dirty blocks of file > A in next iteration again without > nilfs_end_bio_write()/nilfs_segctor_complete_write() pair for previous > iteration then we will have the issue of "double registration" of a > buffer head (because we find pages under asynchronous write as dirty > again). Usually pages in nilfs are not marked dirty while the log writer is writing. ns_segctor_sem is used for this exclusion control. But, we have an exception -- nilfs_set_page_dirty(). I now suspect the root cause is that nilfs_set_page_dirty() can asynchronously mark pages dirty on mmapped files. Can you confirm this assumption by comparing addresses of the following two page structures ? 1. Address of page structures passed to nilfs_set_page_dirty(). 2. bh->b_page of the buffer head captured by the above BUG_ON checks. Regards, Ryusuke Konishi > Maybe I don't clear understand a condition of this issue or NILFS2 > driver's architecture but I can't see another solution for this issue > yet. I suppose that we need to talk over the issue in more details. What > do you think? > > You said that the log writer design should prevent from registering of > buffer in several payload buffer lists. Could you describe this in more > details? > > With the best regards, > Vyacheslav Dubeyko. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html