On Tue, 9 Jun 2015 23:08:48 +0000 Anton Altaparmakov <anton@xxxxxxxxxx> wrote: > Hi Andrew, > > Forgot to reply to one point you made: > > > On 10 Jun 2015, at 01:15, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > Yes, I too would like to hear much more about your thinking on this, > > and a detailed description of the bug and how the patch fixes it. > > Perhaps the patch description is lacking but here is what the current code does: > > struct page *page = read_mapping_page(); > page_cache_release(page); > u8 *kaddr = kmap(page); > memcpy(..., kaddr, ...); > kunmap(page); > > Now in what world is that a valid thing to do? When the page_cache_release() happens the page is no longer allocated and the kmap() is referencing not-in-use memory and so is the memcpy() and so is the kunmap(). > > The only reason the code gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else. > > Sergei said that he got a problem when he was running memory intensive processes at same time so the kernel was thrashing/evicting/resuing page cache pages at high speed and then obviously the kmap() actually mapped a different page to the original that was page_cache_release()d and thus the memcpy() effectively copied random data which was then considered corrupt by the verification code and thus the entire B tree node was considered corrupt and in Sergei's case the volume thus failed to mount. > > And his patch changes the above to this instead: > > struct page *page = read_mapping_page(); > u8 *kaddr = kmap(page); > memcpy(..., kaddr, ...); > kunmap(page); > page_cache_release(page); > > Which is the correct sequence of events. OK, pinning 8 pages for the duration of hfs_bnode_find() sounds reasonable. This is a painful way to write a changelog :( > Although perhaps there should also be a mark_page_accessed(page); > thrown in there, too, before the page_cache_release() in the > expectation that the B tree node is likely to be used again? Probably. Also, using read_mapping_page() is quite inefficient: it's a synchronous read. Putting a single call to read_cache_pages() before the loop would be sufficient to get all that IO done in a single lump. But first we fix the bug. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html