Re: [PATCH] hfsplus: release bnode pages after use, not before

Anton Altaparmakov <anton@xxxxxxxxxx> · Tue, 9 Jun 2015 23:23:05 +0000

btw. XFS does something very similar except they actually go and vm_map_ram() the pages into a contiguous virtual memory region so they can just do normal accesses into the linear buffer instead of having to mess about with the fact the node is spread across multiple pages like HFS+ is doing and they use submit_bio() directly instead of read_mapping_page() but the idea is much the same...  See fs/xfs/xfs_buf.c...

Best regards,

	Anton

> On 10 Jun 2015, at 02:08, Anton Altaparmakov <anton@xxxxxxxxxx> wrote:
> 
> Hi Andrew,
> 
> Forgot to reply to one point you made:
> 
>> On 10 Jun 2015, at 01:15, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>> Yes, I too would like to hear much more about your thinking on this,
>> and a detailed description of the bug and how the patch fixes it.
> 
> Perhaps the patch description is lacking but here is what the current code does:
> 
> struct page *page = read_mapping_page();
> page_cache_release(page);
> u8 *kaddr = kmap(page);
> memcpy(..., kaddr, ...);
> kunmap(page);
> 
> Now in what world is that a valid thing to do?  When the page_cache_release() happens the page is no longer allocated and the kmap() is referencing not-in-use memory and so is the memcpy() and so is the kunmap().
> 
> The only reason the code gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else.
> 
> Sergei said that he got a problem when he was running memory intensive processes at same time so the kernel was thrashing/evicting/resuing page cache pages at high speed and then obviously the kmap() actually mapped a different page to the original that was page_cache_release()d and thus the memcpy() effectively copied random data which was then considered corrupt by the verification code and thus the entire B tree node was considered corrupt and in Sergei's case the volume thus failed to mount.
> 
> And his patch changes the above to this instead:
> 
> struct page *page = read_mapping_page();
> u8 *kaddr = kmap(page);
> memcpy(..., kaddr, ...);
> kunmap(page);
> page_cache_release(page);
> 
> Which is the correct sequence of events.
> 
> Although perhaps there should also be a mark_page_accessed(page); thrown in there, too, before the page_cache_release() in the expectation that the B tree node is likely to be used again?
> 
> Best regards,
> 
> 	Anton
-- 
Anton Altaparmakov <anton at tuxera.com> (replace at with @)
Lead in File System Development, Tuxera Inc., http://www.tuxera.com/
Linux NTFS maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html