On Thu, Sep 15, 2011 at 10:47:48AM -0500, Shawn Bohrer wrote: > Thanks Christoph, > > On Thu, Sep 15, 2011 at 10:55:57AM -0400, Christoph Hellwig wrote: > > On Thu, Sep 15, 2011 at 09:47:55AM -0500, Shawn Bohrer wrote: > > > I've got a workload that is latency sensitive that writes data to a > > > memory mapped file on XFS. With the 3.0 kernel I'm seeing stalls of > > > up to 100ms that occur during writeback that we did not see with older > > > kernels. I've traced the stalls and it looks like they are blocking > > > on wait_on_page_writeback() introduced in > > > d76ee18a8551e33ad7dbd55cac38bc7b094f3abb "fs: block_page_mkwrite > > > should wait for writeback to finish" > > > > > > Reading the commit description doesn't really explain to me why this > > > change was needed. > > > > It it there to avoid pages beeing modified while they are under > > writeback, which defeats various checksumming like DIF/DIX, the iscsi > > CRCs, or even just the RAID parity calculations. All of these either > > failed before, or had to work around it by copying all data was > > written. > > I'm assuming you mean software RAID here? We do have a hardware RAID Yes. > controller. Also for anything that was working around this issue > before by copying the data, are those workarounds still in place? I suspect iscsi and md-raid5 are still making shadow copies of data blocks before writing them out. However, there was no previous workaround for DIF/DIX errors -- this ("*_page_mkwrite should wait...") patch series _is_ the fix for DIF/DIX. I recall that we rejected the shadow buffer approach for DIF/DIX because allocating new pages is expensive if we do it for each disk write in anticipation of future page writes... > > If you don't use any of these you can remove the call and things > > will work like they did before. > > I may do this for now. > > In the longer term is there any chance this could be made better? I'm > not an expert here so my suggestions may be naive. Could a mechanism > be made to check if the page needs to be checksummed and only block in ...however, one could replace that wait_on_page_writeback with some sort of call that would duplicate the page, update each program's page table to point to the new page, and then somehow reap the page that's under IO when the IO completes. That might also be complicated to implement, I don't know. If there aren't any free pages, then this scheme (and the one I mentioned in the previous paragraph) will block a thread while the system tries to reclaim some pages. I think we also talked about a block device flag to signal that the device requires stable page writes, which would let us turn off the waits on devices that don't care. That at least could defer this discussion until you encounter one of these devices that wants stable page writes. I'm curious, is this program writing to the mmap region while another program is trying to fsync/fdatasync/sync dirty pages to disk? Is that how you noticed the jittery latency? We'd figured that not many programs would notice the latency unless there was something that was causing a lot of dirty page writes concurrent to something else dirtying a lot of pages. Clearly we failed in your case. Sorry. :/ That said, imagine if we revert to the pre-3.0 mechanism (or add that flag): if we start transferring page A to the disk for writing and your program comes in and changes A to A' before that transfer completes, then the disk will see a data blob that is partly A and partly A', and the proportions of A/A' are ill-defined. I agree that ~100ms latency is not good, however. :( What are your program's mmap write latency requirements? > that case? Or perhaps some mount option, madvise() flag or other hint > from user-mode to disable this, or hint that I'm going to be touching > that page again soon? --D -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html