> > On Wed 18-11-15 22:36:51, Jan Kara wrote: > > On Tue 17-11-15 13:47:50, Namjae Jeon wrote: > > > > On Mon 09-11-15 14:21:11, Namjae Jeon wrote: > > > > > > > > > > > > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote: > > > > > > > > Interestingly we're not seeing these memory leaks on the truncate > > > > > > > > path, so I suspect the issue is in how collapse range is clearing > > > > > > > > pages from the page cache, especially pages that were freshly written > > > > > > > > to the journal by the commit but which hadn't yet been writtten to > > > > > > > > disk and then marked as complete so we can allow the relevant > > > > > > > > transaction to be checkpointed. (Although we're not leaking the > > > > > > > > journal head structures, but only the buffer heads, so the story most > > > > > > > > be a bit more complicated than that.) > > > > > > > > > > > > > > Okay, Thanks for sharing your view and points !! > > > > > > > > > > > > > > Currently I can reproduce memory leak issue without collase/insert/zero range. > > > > > > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option > and > > > > add -y > > > > > > option instead of -W) > > > > > > > 1. small size parition(1GB) > > > > > > > 2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 > -Z -R > > > > -y - > > > > > > I -C -z testfile" > > > > > > > And same result with generic/091 is showing (buffer_head leak) > > > > > > > > > > > > > > So I am starting to find root-cause base on your points. > > > > > > > I will share the result or the patch. > > > > > > > > > > > > Thanks, that's very interesting data point. So this makes it appear > > > > > > that the problem *is* probably with how we deal with checkpointing > > > > > > buffers after the pages get discarded using either a truncate or a > > > > > > collapse_range, since the 'y' option causes a lot fsync's, and hence > > > > > > commits, some of which are happening after a truncate command. > > > > > > > > > > > > Thanks for a taking a look at this. I really appreciate it. > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > Hi Ted, > > > > > > > > > > Could you review this patch? > > > > > > > > > > Thanks! > > > > > --------------------------------------------------------------------------------- > > > > > Subject: [PATCH] jbd2: try to free buffers from truncated page after > > > > > checkpoint > > > > > > > > > > when ext4 is mounted in data=journal mode, and truncate operation > > > > > such as settatr(size), collopse, insert and zero range are used, there are > > > > > are many truncated pages with NULL page->mapping. Such truncated pages > > > > > pile up quickly due to truncate_pagecache on data pages associated with journal. > > > > > As page->mapping is NULL for such truncated pages, they are not freed > > > > > by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases > > > > > quickly and active buffer_head slab objects grow in /proc/slabinfo. > > > > > This patch attempts to free buffers from such pages at the end of jbd2 > > > > > checkpoint, if pages do not have any busy buffers and NULL mapping. > > > > > > > > Hum, why such pages didn't get freed by release_buffer_page() call > > > > happening when processing transaction's forget list? Because the idea is > > > > that such pages should be discarded at that point... > > > Hi Jan, > > > > > > When I checked this code, release_buffer_page is not called > > > when buffer_jbddirty is true. Such buffers with JBD_Dirty are added > > > to new checkpoint. > > > > > > if (buffer_jbddirty(bh)) { > > > JBUFFER_TRACE(jh, "add to new checkpointing trans"); > > > __jbd2_journal_insert_checkpoint(jh, commit_transaction); > > > if (is_journal_aborted(journal)) > > > clear_buffer_jbddirty(bh); > > > } else { > > > J_ASSERT_BH(bh, !buffer_dirty(bh)); > > > /* > > > * The buffer on BJ_Forget list and not jbddirty means > > > * it has been freed by this transaction and hence it > > > * could not have been reallocated until this > > > * transaction has committed. *BUT* it could be > > > * reallocated once we have written all the data to > > > * disk and before we process the buffer on BJ_Forget > > > * list. > > > */ > > > if (!jh->b_next_transaction) > > > try_to_free = 1; > > > } > > > JBUFFER_TRACE(jh, "refile or unfile buffer"); > > > __jbd2_journal_refile_buffer(jh); > > > jbd_unlock_bh_state(bh); > > > > > > Next buffer was unfiled by __jbd2_journal_refile_buffer, JBD_Dirty cleared and > > > BH_Dirty was set in same function. Later it must have been written back as > > > BH_Dirty was cleared. > > > And ext4_wait_for_tail_page_commit-> __ext4_journalled_invalidatepage -> > > > journal_unmap_buffer zaps the buffer: > > > > > > if (!buffer_dirty(bh)) { > > > /* bdflush has written it. We can drop it now */ > > > goto zap_buffer; > > > } > > > > > > next, truncate_pagecache is called, which clears the page mapping. > > > Eventually, remove checkpoint is called, but such page with NULL mapping was > > > not freed. So, I had added release_buffer_page at the end of remove checkpoint > > > to attempt to free such free buffer pages. Please let me know your opinion. > > > > OK, thanks for the detailed analysis. But when the buffer gets truncated, > > jbd2_journal_invalidatepage() either removes the buffer from the > > transaction (obviously didn't happen here) or it sets buffer_freed and > > buffer_jbddirty should get cleared when processing the BJ_Forget list. So > > why that didn't happen? Can you have a look into what > > jbd2_journal_invalidatepage() did to buffer in that page? > > > > Generally truncated buffers should not enter checkpoint list since writing > > them is just a waste of disk bandwidth... > > I thought a bit more about this and my guess is that > jbd2_journal_invalidatepage() fails to invalidate the partial page and > returns EBUSY. However we should see warnings about that from > ext4_journalled_invalidatepage(). Can you see them? This would perfectly > match Ted's observation that the problem happens only for fallocate > operations but not for truncate which does the dance with > ext4_wait_for_tail_page_commit() but fallocate operations don't do it... > > Now if I'm right and this is indeed the path we are hitting, we need to put > more thought into how we handle truncation of partial pages in data=journal > mode. Actually, I can quickly create memleak with simple testcase like: while(1000) { Random seek(max 500M) -> write 1M -> fsync ->truncate(random size) } As there is significant memleak in this situation also, I started focusing on this simple testcase only. When use this testcase, I never saw EBUSY path in jbd2_journal_invalidatepage() -> journal_unmap_buffer(). On the other hand(), for memleak pages jbd2_journal_invalidatepage() -> journal_unmap_buffer() was almost bailing out every time from if (!buffer_dirty(bh)) { /* bdflush has written it. We can drop it now */ goto zap_buffer; } As per debugging, NULL mapping page or buffer is created in below scenario: Write(or Kjournald) -> jbd2_journal_commit_transaction -> BH_JBDDirty buffer is added to forget list. -> Buffer is added to new checkpoint, additional reference count is taken on b_jcount which keeps buffer busy until remove checkpoint. -> Buffer is unfiled, removed from forget list, BH_JBDDirty is cleared and BH_Dirty is set, it is exposed to VM. As BH_Dirty is set, buffer is written by ext4_writepage by flusher. ext4_writepage finally clears BH_Dirty. Now, ftruncate is called, almost every time code is returned from below path: Ftruncate -> jbd2_journal_invalidatepage -> journal_unmap_buffer if (!buffer_dirty(bh)) { /* bdflush has written it. We can drop it now */ goto zap_buffer; } page was not released by jbd2_journal_invalidatepage, mainly because b_jcount was 2 on entry of jbd2_journal_invalidatepage (addional reference by insert checkpoint above), buffer state was sane. At exit time jbd2_journal_invalidatepage, b_jcount is still 1. Next, truncate_pagecache is called which makes page mapping NULL. Later, __jbd2_journal_remove_checkpoint is called, it makes b_jcount zero and remove BH_JBD from buffer. Such pages with NULL mapping pages are not freed. Previous patch fixed this leak if try release_page_buffer at the end of checkpoint. > > BTW: The ext4_force_commit() call in fallocate operations is definitely > racy as pages can get dirtied and journalled in the running transaction > anytime while we wait for transaction commit (which I suspect is happening > in your testcase)... So that needs some more thought as well. Need to check more on this. I can reproduce memleak without fsync (i.e. ext4_force_commit), though it is less. Thanks! > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html