On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > > > > where we have possibilities of data lost. This is because in this fix > > > > > > AS_HWPOISON is cleared when the inode cache is dropped. > > > .... > > > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > > > +++ v3.6-rc1/mm/truncate.c > > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > > > > > > > > > oldsize = inode->i_size; > > > > > > i_size_write(inode, newsize); > > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > > > > > So only a truncate to zero size will clear the poison flag? > > > > > > > > Yes, this is because we only know if the file is affected by hwpoison, > > > > but not where the hwpoisoned page is in the file. We could remember it, > > > > but I did not do it for simplicity. > > > > > > Surely the page has flags on it to say it is poisoned? That is, > > > after truncating the page cache, if the address space is poisoned, > > > then you can do a pass across the mapping tree checking if there are > > > any poisoned pages left? Or perhaps adding a new mapping tree tag so > > > that the poisoned status is automatically determined by the presence > > > of the poisoned page in the mapping tree? > > > > The answer for the first question is yes. And for the second question, > > I don't think it's easy because the mapping tree has no reference to > > the error page (I explain more about this below, please see also it,) > > and it can cost too much to search poisoned pages over page cache in > > each request. > > Which is my point about a radix tree tag - that's very efficient. > > > And for the third question, I think we could do this, but to do it > > we need an additional space (8 bytes) in struct radix_tree_node. > > Considering that this new tag is not used so frequently, so I'm not > > sure that it's worth the cost. > > A radix tree node is currently 560 bytes on x86_64, packed 7 to a > page. i.e. using 3920 bytes. We can add another 8 bytes to it > without increasing memory usage at all. So, no cost at all. OK. > > > > > What happens if it is the last page in the mapping that is poisoned, > > > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > > > > > When we handle the hwpoisoned inode, the error page should already > > > > be removed from pagecache, so the remaining pages are not related > > > > to the error and we need not care about them when we consider bit > > > > clearing. > > > > > > Sorry, I don't follow. What removed the page from the page cache? > > > The truncate_page_cache() call that follows the above code hunk is > > > what does that, so I don't see how it can already be removed from > > > the page cache.... > > > > Memory error handler (memory_failure() in mm/memory-failure.c) has > > removed the error page from the page cache. > > And truncate_page_cache() that follows this hunk removes all pages > > belonging to the page cache of the poisoned file (where the error > > page itself is not included in them.) > > > > Let me explain more to clarify my whole scenario. If a memory error > > hits on a dirty pagecache, kernel works like below: > > > > 1. handles a MCE interrupt (logging MCE events,) > > 2. calls memory error handler (doing 3 to 6,) > > 3. sets PageHWPoison flag on the error page, > > 4. unmaps all mappings to processes' virtual addresses, > > So nothing in userspace sees the bad page after this. > > > 5. sets AS_HWPOISON on mappings to which the error page belongs > > 6. invalidates the error page (unlinks it from LRU list and removes > > it from pagecache,) > > (memory error handler finished) > > Ok, so the moment a memory error is handled, the page has been > removed from the inode's mapping, and it will never be seen by > aplications again. It's a transient error.... > > > 7. later accesses to the file returns -EIO, > > 8. AS_HWPOISON is cleared when the file is removed or completely > > truncated. > > .... so why do we have to keep an EIO on the inode forever? > > If the page is not dirty, then just tossing it from the cache (as > is already done) and rereading it from disk next time it is accessed > removes the need for any error to be reported at all. It's > effectively a transient error at this point, and as such no errors > should be visible from userspace. > > If the page is dirty, then it needs to be treated just like any > other failed page write - the page is invalidated and the address > space is marked with AS_EIO, and that is reported to the next > operation that waits on IO on that file (i.e. fsync) > > If you have a second application that reads the files that depends > on a guarantee of good data, then the first step in that process is > that application that writes it needs to use fsync to check the data > was written correctly. That ensures that you only have clean pages > in the cache before the writer closes the file, and any h/w error > then devolves to the above transient clean page invalidation case. Thank you for detailed explanations. And yes, I understand it's ideal, but many applications choose not to do that for performance reason. So I think it's helpful if we can surely report to such applications. > Hence I fail to see why this type of IO error needs to be sticky. > The error on the mapping is transient - it is gone as soon as the > page is removed from the mapping. Hence the error can be dropped as > soon as it is reported to userspace because the mapping is now error > free. It's error free only for the applications which do fsync check in each write, but not for the applications which don't do. I think the penalty for the latters (ignore dirty data lost and get wrong results) is too big to consider it as a reasonable trade-off. > > You may think it strange that the condition of clearing AS_HWPOISON > > is checked with file granularity. > > I don't think it is strange, I think it is *wrong*. OK, we can move to a new tag approach, and we can avoid this. > > This is because currently userspace > > applications know the memory errors only with file granularity for > > simplicity, when they access via read(), write() and/or fsync(). > > Trying to report this error to every potential future access > regardless of whether the error still exists or the access is to the > poisoned range with magical sticky errors on inode address spaces > and hacks to memory the reclaim subsystems smells to me like a really > bad hack to work around applications that use bad data integrity > practices. > > As such, I think you probably need to rethink the approach you are > taking to handling this error. The error is transient w.r.t. to the > mapping and page cache, and needs to be addressed consistently > compared to other transient IO errors that are reported through the > mapping.... Agreed. We can handle this error without a controversial flag on the mapping, in new pagecache tag approach. I'll try that one. Thanks, Naoya -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>