On Fri 28-10-11 16:37:03, Andy Lutomirski wrote: > On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@xxxxxxx> wrote: > >> - Why are we calling file_update_time at all? Presumably we also > >> update the time when the page is written back (if not, that sounds > >> like a bug, since the contents may be changed after something saw the > >> mtime update), and, if so, why bother updating it on the first write? > >> Anything that relies on this behavior is, I think, unreliable, because > >> the page could be made writable arbitrarily early by another program > >> that changes nothing. > > We don't update timestamp when the page is written back. I believe this > > is mostly because we don't know whether the data has been changed by a > > write syscall, which already updated the timestamp, or by mmap. That is > > also the reason why we update the timestamp at page fault time. > > > > The reason why file_update_time() blocks for you is probably that it > > needs to get access to buffer where inode is stored on disk and because a > > transaction including this buffer is committing at the moment, your thread > > has to wait until the transaction commit finishes. This is mostly a problem > > specific to how ext4 works so e.g. xfs shouldn't have it. > > > > Generally I believe the attempts to achieve any RT-like latencies when > > writing to a filesystem are rather hopeless. How much hopeless depends on > > the load of the filesystem (e.g., in your case of mostly idle filesystem I > > can imagine some tweaks could reduce your latencies to an acceptable level > > but once the disk gets loaded you'll be screwed). So I'd suggest that > > having RT thread just store log in memory (or write to a pipe) and have > > another non-RT thread write the data to disk would be a much more robust > > design. > > Windows seems to do pretty well at this, and I think it should be fixable on > Linux too. "All" that needs to be done is to remove the pte_wrprotect from > page_mkclean_one. The fallout from that might be unpleasant, though, but > it would probably speed up a number of workloads. Well, but Linux's mm pretty much depends the pte_wrprotect() so that's unlikely to go away in a forseeable future. The reason is that we need to reliably account the number of dirty pages so that we can throttle processes that dirty too much of memory and also protect agaist system going into out-of-memory problems when too many pages would be dirty (and thus hard to reclaim). Thus we create clean pages as write-protected, when they are first written to, we account them as dirtied and unprotect them. When pages are cleaned by writeback, we decrement number of dirty pages accordingly and write-protect them again. > Adding a whole separate process just to copy data from memory to disk sounds > a bit like a hack -- that's what mmap + mlock would do if it worked better. Well, always only guarantees you cannot hit major fault when accessing the page. And we keep that promise - we only hit a minor fault. But I agree that for your usecase this is impractical. I can see as theoretically feasible for writeback to skip mlocked pages which would help your case. But practically, I do not see how to implement that efficiently (just skipping a dirty page when we find it's mlocked seems like a way to waste CPU needlessly). > Incidentally, pipes are no good. I haven't root-caused it yet, but both > reading to and writing from pipes, even if O_NONBLOCK, can block. I > haven't root-caused it yet. Interesting. I imagine they could block on memory allocation but I guess you don't put that much pressure on your system. So it might be interesting to know where else they block... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html