Re: [PATCH] hfsplus: release bnode pages after use, not before

Anton Altaparmakov <anton@xxxxxxxxxx> · Tue, 9 Jun 2015 23:00:10 +0000

Hi Andrew,

> On 10 Jun 2015, at 01:15, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, 8 Jun 2015 18:50:00 +0200 Sergei Antonov <saproj@xxxxxxxxx> wrote:
>>>> You are basically saying you don___t understand it. Too bad, because the
>>>> bug is very simple. It is the ___use after free___ type of bug, and it can
>>>> be illustrated by this:
>>>> (1) void *ptr = malloc(___);
>>>> (2) free(ptr);
>>>> (3) memcpy(___, ptr, 1);
>>>> Guess which two of these three lines are executed in wrong order.
>>>> 
>>>> My patch is about the same type of bug, but with memory pages mapping.
>>>> The driver currently accesses pages that may be unavailable, or
>>>> contain different data. The problem is more likely to occur when
>>>> memory is a limited resource. I reproduced it while running a
>>>> memory-hungry program.
>>> 
>>> I worried not about myself but about potential readers of description of
>>> the fix. The description is completely obscure. And it needs to describe
>>> the fix in clear and descriptive manner. This is my request. Please,
>>> describe the fix in a clear way.
>> 
>> The description is just right.
> 
> Yes, I too would like to hear much more about your thinking on this,
> and a detailed description of the bug and how the patch fixes it.
> 
> 
> The code is distressingly undocumented and has been that way since
> Roman Zippel's original commit in 2004.
> 
> From the looks of it, that loop in __hfs_bnode_create() is simply doing
> readahead and is designed as a performance optimisation.  The pages are
> pulled into pagecache in the expectation that they will soon be
> accessed.  What your patch does is to instead pin the pages in
> pagecache until the bnode is freed.  If we're going to do this then we
> need to be very careful about worst-case scenarios: we could even run
> the machine out of memory.
> 
> If I'm correct, and this is just readahead then the bug lies elsewhere:
> if other parts of hfsplus are assuming that this memory is in pagecache
> then that's an error - that code (where is it?) should instead be performing
> a pagecache lookup and if the page isn't present, read it from disk
> again.
> 
> But for others to be able to review and understand this change and
> suggest alternatives, we'll need a much much better changelog!

Given I had just looked at the code...  __hfs_bnode_create() is not doing read-ahead at all from my reading of the code.  What it does is to gather the needed pages that are immediately processed, i.e.

hfs_bnode_find() does:

- call __hfs_bnode_create() which gathers the pages into the array of pages @node->page

- kmap() some page from @node->page array, read some information from the kmapped page then kunmap it again (this is actually insane - either all the pages should have been kmapped in __hfs_bnode_create or it should be using kmap_atomic!).

Here is the actual code:

        desc = (struct hfs_bnode_desc *)(kmap(node->page[0]) +
                        node->page_offset);
        node->prev = be32_to_cpu(desc->prev);
        node->next = be32_to_cpu(desc->next);
        node->num_recs = be16_to_cpu(desc->num_recs);
        node->type = desc->type;
        node->height = desc->height;
        kunmap(node->page[0]);

That actually makes some sense (though kmap_atomic would still be much better!) but then this follows:

	off = hfs_bnode_read_u16(node, rec_off);

and (omitting lines for clarity):

        for (i = 1; i <= node->num_recs; off = next_off, i++) {
                ...
                next_off = hfs_bnode_read_u16(node, rec_off);
                ...
                key_size = hfs_bnode_read_u16(node, off) + 2;
                ...
        }

Where all those hfs_bnode_read_u16() are simply "kmap, copy u16 from kmapped page, kunmap" so again crazy thing to be doing - either kmap_atomic or probably actually better just kmap all the pages to start with in @node->page array at the time of reading them all in...

If you consider __hfs_bnode_create to be doing readahead then all those calls to hfs_bnode_read_u16() would have to do a read_mapping_page() and incur the overhead of a radix tree lookup for each u16 read.  That would not be very good for performance/cpu usage.

Also note there is no danger of running out of RAM as the largest allowed B-tree node size in HFS+ according to Apple's own specification is 32kiB, i.e. on 4k page size only 8 pages maximum - readpages functions deal with significantly more pages at a time without worrying about running out of RAM.

But yes I agree with you that HFS+ is an undocumented mess but at least Sergei is putting some effort to make it better!

Best regards,

	Anton
-- 
Anton Altaparmakov <anton at tuxera.com> (replace at with @)
Lead in File System Development, Tuxera Inc., http://www.tuxera.com/
Linux NTFS maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html