On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote: [..] > > An fsync has two basic parts > > > > 1) write the file data pages > > 2a) flush data=ordered in reiserfs/ext34 > > 2b) do the real transaction commit > > > > > > We can do part one in parallel across any number of writers. For part > > two, there is only one running transaction. If the FS is smart, the > > commit will only force down the transaction that last modified the > > file. 50 procs running fsync may only need to trigger one commit. > > Right. However the real issue here, I think, is that the IO comes > from a thread not associated with writeback nor is in any way cgroup > aware. IOWs, getting the right context to each block being written > back will be complex and filesystem specific. > > The other thing that concerns me is how metadata IO is accounted and > throttled. Doing stuff like creating lots of small files will > generate as much or more metadata IO than data IO, and none of that > will be associated with a cgroup. Indeed, in XFS metadata doesn't > even use the pagecache anymore, and it's written back by a thread > (soon to be a workqueue) deep inside XFS's journalling subsystem, so > it's pretty much impossible to associate that IO with any specific > cgroup. > > What happens to that IO? Blocking it arbitrarily can have the same > effect as blocking transaction completion - it can cause the > filesystem to completely stop.... Dave, As of today, the cgroup/context of IO is decided from the IO submitting thread context. So any IO submitted by kernel threads (flusher, kjournald, workqueue threads) goes to root group IO which should remain unthrottled. (It is not a good idea to put throttling rules for root group). Now any meta data operation happening in the context of process will still be subject to throttling (is there any?). If that's a concern, can filesystem mark that bio (REQ_META?) and throttling logic can possibly let these bio pass through. Determining the cgroup/context from submitting process has the issue of that any writeback IO is not throttled and we are looking for a way to control buffered writes also. If we start determining the cgroup from some information stored in page_cgroup, then we are more likely to run into issues of priority inversion (filesystem in ordered mode flushing data first before committing meta data changes). So should we throttle buffered writes when page cache is being dirtied and not when these writes are being written back to device. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html