On Tue 19-04-11 10:30:22, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote: > > If you want to throttle journal operations, then we probably need to > > throttle metadata operations that commit to the journal, not the > > journal IO itself. The journal is a shared global resource that all > > cgroups use, so throttling journal IO inappropriately will affect > > the performance of all cgroups, not just the one that is "hogging" > > it. > > Agreed. > > > > > In XFS, you could probably do this at the transaction reservation > > stage where log space is reserved. We know everything about the > > transaction at this point in time, and we throttle here already when > > the journal is full. Adding cgroup transaction limits to this point > > would be the place to do it, but the control parameter for it would > > be very XFS specific (i.e. number of transactions/s). Concurrency is > > not an issue - the XFS transaction subsystem is only limited in > > concurrency by the space available in the journal for reservations > > (hundred to thousands of concurrent transactions). > > Instead of transaction per second, can we implement some kind of upper > limit of pending transactions per cgroup. And that limit does not have > to be user tunable to begin with. The effective transactions/sec rate > will automatically be determined by IO throttling rate of the cgroup > at the end nodes. > > I think effectively what we need is that the notion of parallel > transactions so that transactions of one cgroup can make progress > independent of transactions of other cgroup. So if a process does > an fsync and it is throttled then it should block transaction of > only that cgroup and not other cgroups. > > You mentioned that concurrency is not an issue in XFS and hundreds of > thousands of concurrent trasactions can progress depending on log space > available. If that's the case, I think to begin with we might not have > to do anything at all. Processes can still get blocked but as long as > we have enough log space, this might not be a frequent event. I will > do some testing with XFS and see can I livelock the system with very > low IO limits. > > > > > FWIW, this would even allow per-bdi-flusher thread transaction > > throttling parameters to be set, so writeback triggered metadata IO > > could possibly be limited as well. > > How does writeback trigger metadata IO? Because by writing data, you may need to do block allocation or mark blocks as written on disk, or similar changes to metadata... > In the first step I was looking to not throttle meta data IO as that > will require even more changes in file system layer. I was thinking > that if we provide throttling only for data and do changes in filesystems > so that concurrent transactions can exist and make progress and file > system IO does not serialize behind slow throttled cgroup. Yes, I think not throttling metadata is a good start. > This leads to weaker isolation but atleast we don't run into livelocking > or filesystem scalability issues. Once that's resolved, we can handle the > case of throttling meta data IO also. > > In fact if metadata is dependent on data (in ordered mode) and if we are > throttling data, then we automatically throttle meata for select cases. > > > > > I'm not sure whether this is possible with other filesystems, and > > ext3/4 would still have the issue of ordered writeback causing much > > more writeback than expected at times (e.g. fsync), but I suspect > > there is nothing that can really be done about this. > > Can't this be modified so that multiple per cgroup transactions can make > progress. So if one fsync is blocked, then processes in other cgroup > should still be able to do IO using a separate transaction and be able > to commit it. Not really. Ext3/4 has always a single running transaction and all metadata updates from all threads are recorded in it. When the transaction grows large/old enough, we commit it and start a new transaction. The fact that there is always just one running transaction is heavily used in the journaling code so it would need serious rewrite of JBD2... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html