On 28 June 2015 at 03:39, Hin-Tak Leung <hintak.leung@xxxxxxxxx> wrote: > From: Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> > > Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and > hfs_bnode_find() for finding or creating pages corresponding to an inode) are > immediately kmap()'ed and used (both read and write) and kunmap()'ed, and > should not be page_cache_release()'ed until hfs_bnode_free(). > > This patch fixes a problem I first saw in July 2012: merely running "du" on a > large hfsplus-mounted directory a few times on a reasonably loaded system would > get the hfsplus driver all confused and complaining about B-tree > inconsistencies, and generates a "BUG: Bad page state". Most recently, I can > generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5, by > running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du /mnt" > simultaneously on two windows, where /mnt is a lightly-used QEMU VM image of > the full Mac OS X 10.9: > > $ df -i / /home /mnt > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/mapper/fedora-root 3276800 551665 2725135 17% / > /dev/mapper/fedora-home 52879360 716221 52163139 2% /home > /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt > > After applying the patch, I was able to run "du /" (60+ times) and > "du /mnt" (150+ times) continuously and simultaneously for 6+ hours. > > There are many reports of the hfsplus driver getting confused under load and > generating "BUG: Bad page state" or other similar issues over the years. [1] > > The unpatched code [2] has always been wrong since it entered the kernel tree. > The only reason why it gets away with it is that the kmap/memcpy/kunmap follow > very quickly after the page_cache_release() so the kernel has not had a chance > to reuse the memory for something else, most of the time. > > The current RW driver appears to have followed the design and development of > the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001) had a > B-tree node-centric approach to read_cache_page()/page_cache_release() per > bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of caching > and releasing pages per inode extents. When the current RW code first entered > the kernel [2] in 2005, there was an REF_PAGES conditional (and "//" commented > out code) to switch between B-node centric paging to inode-centric paging. > There was a mistake with the direction of one of the REF_PAGES conditionals in > __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the > read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were > removed, but a page_cache_release() was mistakenly left in (propagating the > "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out page_cache_release() > in bnode_release() (which should be spanned by !REF_PAGES) was never enabled. > > References: > [1]: > Michael Fox, Apr 2013 > http://www.spinics.net/lists/linux-fsdevel/msg63807.html > ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'") > > Sasha Levin, Feb 2015 > http://lkml.org/lkml/2015/2/20/85 ("use after free") > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814 > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887 > https://bugzilla.kernel.org/show_bug.cgi?id=42342 > https://bugzilla.kernel.org/show_bug.cgi?id=63841 > https://bugzilla.kernel.org/show_bug.cgi?id=78761 > > [2]: > http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ > fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db > commit d1081202f1d0ee35ab0beb490da4b65d4bc763db > Author: Andrew Morton <akpm@xxxxxxxx> > Date: Wed Feb 25 16:17:36 2004 -0800 > > [PATCH] HFS rewrite > > http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ > fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd > > commit 91556682e0bf004d98a529bf829d339abb98bbbd > Author: Andrew Morton <akpm@xxxxxxxx> > Date: Wed Feb 25 16:17:48 2004 -0800 > > [PATCH] HFS+ support > > [3]: > http://sourceforge.net/projects/linux-hfsplus/ > > http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/ > http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/ > > http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\ > fs/hfsplus/bnode.c?r1=1.4&r2=1.5 > > Date: Thu Jun 6 09:45:14 2002 +0000 > Use buffer cache instead of page cache in bnode.c. Cache inode extents. > > [4]: > http://git.kernel.org/cgit/linux/kernel/git/\ > stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d > > commit a5e3985fa014029eb6795664c704953720cc7f7d > Author: Roman Zippel <zippel@xxxxxxxxxxxxxx> > Date: Tue Sep 6 15:18:47 2005 -0700 > > [PATCH] hfs: remove debug code > > Signed-off-by: Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Sergei Antonov <saproj@xxxxxxxxx> > Reviewed-by: Anton Altaparmakov <anton@xxxxxxxxxx> > Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Vyacheslav Dubeyko <slava@xxxxxxxxxxx> > Cc: Sougata Santra <sougata@xxxxxxxxxx> > --- > fs/hfs/bnode.c | 9 ++++----- > fs/hfsplus/bnode.c | 3 --- > 2 files changed, 4 insertions(+), 8 deletions(-) > > diff --git a/fs/hfs/bnode.c b/fs/hfs/bnode.c > index d3fa6bd..221719e 100644 > --- a/fs/hfs/bnode.c > +++ b/fs/hfs/bnode.c > @@ -288,7 +288,6 @@ static struct hfs_bnode *__hfs_bnode_create(struct hfs_btree *tree, u32 cnid) > page_cache_release(page); > goto fail; > } > - page_cache_release(page); > node->page[i] = page; > } > > @@ -398,11 +397,11 @@ node_error: > > void hfs_bnode_free(struct hfs_bnode *node) > { > - //int i; > + int i; > > - //for (i = 0; i < node->tree->pages_per_bnode; i++) > - // if (node->page[i]) > - // page_cache_release(node->page[i]); > + for (i = 0; i < node->tree->pages_per_bnode; i++) > + if (node->page[i]) > + page_cache_release(node->page[i]); > kfree(node); > } > > diff --git a/fs/hfsplus/bnode.c b/fs/hfsplus/bnode.c > index 759708f..6392466 100644 > --- a/fs/hfsplus/bnode.c > +++ b/fs/hfsplus/bnode.c > @@ -454,7 +454,6 @@ static struct hfs_bnode *__hfs_bnode_create(struct hfs_btree *tree, u32 cnid) > page_cache_release(page); > goto fail; > } > - page_cache_release(page); > node->page[i] = page; > } > > @@ -566,13 +565,11 @@ node_error: > > void hfs_bnode_free(struct hfs_bnode *node) > { > -#if 0 > int i; > > for (i = 0; i < node->tree->pages_per_bnode; i++) > if (node->page[i]) > page_cache_release(node->page[i]); > -#endif > kfree(node); > } > > -- > 2.4.3 If I fix something else in hfsplus in the future, will you again submit a combined hfsplus+hfs patch? I would prefer separation. Hoped to receive your "Tested-by:" for my "hfsplus: release bnode pages after use, not before" and then submit a V2 of it with a longer description. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html