Excerpts from Dave Chinner's message of 2011-03-31 18:14:25 -0400: > On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote: > > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400: > > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > > > > > [..] > > > > > It should not happen that flusher > > > > > thread gets blocked somewhere (trying to get request descriptors on > > > > > request queue) > > > > > > > > A major design principle of the bdi-flusher threads is that they > > > > are supposed to block when the request queue gets full - that's how > > > > we got rid of all the congestion garbage from the writeback > > > > stack. > > > > > > Instead of blocking flusher threads, can they voluntarily stop submitting > > > more IO when they realize too much IO is in progress. We aready keep > > > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and > > > flusher tread can use that? > > > > We could, but the difficult part is keeping the hardware saturated as > > requests complete. The voluntarily stopping part is pretty much the > > same thing the congestion code was trying to do. > > And it was the bit that was causing most problems. IMO, we don't want to > go back to that single threaded mechanism, especially as we have > no shortage of cores and threads available... Getting rid of the congestion code was my favorite part of the per-bdi work. > > > > > There are plans to move the bdi-flusher threads to work queues, and > > > > once that is done all your concerns about blocking and parallelism > > > > are pretty much gone because it's trivial to have multiple writeback > > > > works in progress at once on the same bdi with that infrastructure. > > > > > > Will this essentially not nullify the advantage of IO less throttling? > > > I thought that we did not want have multiple threads doing writeback > > > at the same time to avoid number of seeks and achieve better throughput. > > > > Work queues alone are probably not appropriate, at least for spinning > > storage. It will introduce seeks into what would have been > > sequential writes. I had to make the btrfs worker thread pools after > > having a lot of trouble cramming writeback into work queues. > > That was before the cmwq infrastructure, right? cmwq changes the > behaviour of workqueues in such a way that they can simply be > thought of as having a thread pool of a specific size.... > > As a strict translation of the existing one flusher thread per bdi, > then only allowing one work at a time to be issued (i.e. workqueue > concurency of 1) would give the same behaviour without having all > the thread management issues. i.e. regardless of the writeback > parallelism mechanism we have the same issue of managing writeback > to minimise seeking. cmwq just makes the implementation far simpler, > IMO. > > As to whether that causes seeks or not, that depends on how we are > driving the concurrent works/threads. If we drive a concurrent work > per dirty cgroup that needs writing back, then we achieve the > concurrency needed to make the IO scheduler appropriately throttle > the IO. For the case of no cgroups, then we still only have a single > writeback work in progress at a time and behaviour is no different > to the current setup. Hence I don't see any particular problem with > using workqueues to acheive the necessary writeback parallelism that > cgroup aware throttling requires.... Yes, as long as we aren't trying to shotgun style spread the inodes across a bunch of threads, it should work well enough. The trick will just be making sure we don't end up with a lot of inode interleaving in the delalloc allocations. > > > > > > or it tries to dispatch too much IO from an inode which > > > > > primarily contains pages from low prio cgroup and high prio cgroup > > > > > task does not get enough pages dispatched to device hence not getting > > > > > any prio over low prio group. > > > > > > > > That's a writeback scheduling issue independent of how we throttle, > > > > and something we don't do at all right now. Our only decision on > > > > what to write back is based on how low ago the inode was dirtied. > > > > You need to completely rework the dirty inode tracking if you want > > > > to efficiently prioritise writeback between different groups. > > > > > > > > Given that filesystems don't all use the VFS dirty inode tracking > > > > infrastructure and specific filesystems have different ideas of the > > > > order of writeback, you've got a really difficult problem there. > > > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity > > > > purposes which will completely screw any sort of prioritised > > > > writeback. Remember the ext3 "fsync = global sync" latency problems? > > > > > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered" > > > mode we will flush all the writes to disk before committing meta data. > > > > > > I have no knowledge of filesystem code so here comes a stupid question. > > > Do multiple fsyncs get completely serialized or they can progress in > > > parallel? IOW, if a fsync is in progress and we slow down the writeback > > > of that inode's pages, can other fsync still make progress without > > > getting stuck behind the previous fsync? > > > > An fsync has two basic parts > > > > 1) write the file data pages > > 2a) flush data=ordered in reiserfs/ext34 > > 2b) do the real transaction commit > > > > > > We can do part one in parallel across any number of writers. For part > > two, there is only one running transaction. If the FS is smart, the > > commit will only force down the transaction that last modified the > > file. 50 procs running fsync may only need to trigger one commit. > > Right. However the real issue here, I think, is that the IO comes > from a thread not associated with writeback nor is in any way cgroup > aware. IOWs, getting the right context to each block being written > back will be complex and filesystem specific. The ext3 style data=ordered requires that we give the same amount of bandwidth to all the data=ordered IO during commit. Otherwise we end up making the commit wait for some poor page in the data=ordered list and that slows everyone down. ick. > > The other thing that concerns me is how metadata IO is accounted and > throttled. Doing stuff like creating lots of small files will > generate as much or more metadata IO than data IO, and none of that > will be associated with a cgroup. Indeed, in XFS metadata doesn't > even use the pagecache anymore, and it's written back by a thread > (soon to be a workqueue) deep inside XFS's journalling subsystem, so > it's pretty much impossible to associate that IO with any specific > cgroup. > > What happens to that IO? Blocking it arbitrarily can have the same > effect as blocking transaction completion - it can cause the > filesystem to completely stop.... ick again, it's the same problem as the data=ordered stuff exactly. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html