Peter Staubach wrote: > Trond Myklebust wrote: > >> On Tue, 2009-06-02 at 14:37 -0400, Peter Staubach wrote: >> >> >>> Trond Myklebust wrote: >>> >>> >>>> So, how about doing this by modifying balance_dirty_pages() instead? >>>> Limiting pages on a per-inode basis isn't going to solve the common >>>> problem of 'ls -l' performance, where you have to stat a whole bunch of >>>> files, all of which may be dirty. To deal with that case, you really >>>> need an absolute limit on the number of dirty pages. >>>> >>>> Currently, we have only relative limits: a given bdi is allowed a >>>> maximum percentage value of the total write back cache size... We could >>>> add a 'max_pages' field, that specifies an absolute limit at which the >>>> vfs should start writeback. >>>> >>>> >>> Interesting thought. From a high level, it sounds like a good >>> strategy. The details start to get a little troubling to me >>> though. >>> >>> First thing that strikes me is that this may result in >>> suboptimal WRITE requests being issued over the wire. If the >>> page quota is filled with many pages from one file and just a >>> few from another due to timing, we may end up issuing small >>> over the wire WRITE requests for the one file, even during >>> normal operations. >>> >>> >> balance_dirty_pages() will currently call writeback_inodes() to actually >> flush out the pages. The latter will again check the super block dirty >> list to determine candidate files; it doesn't favour the particular file >> on which we called balance_dirty_pages_ratelimited(). >> >> >> > > It doesn't favor any files. It runs on all of them. Without > some more clever smarts, we end up with small over the wire > writes, which are to be avoided if at all possible. > > >> That said, balance_dirty_pages_ratelimited() does take the mapping as an >> argument. You could, therefore, in theory have it make decisions on a >> per-mapping basis. >> >> >> > > I will have to think about this more. Could you elaborate on > what you were thinking that we might be able to do? > > >>> We don't want to flush pages in the page cache until an entire >>> wsize'd transfer can be constructed for the specific file. >>> Thus, it seems to me that we still need to track the number of >>> dirty pages per file. >>> >>> We also need to know that those pages are contiguous in the >>> file. We can determine, heuristically, whether the pages are >>> contiguous in the file or not by tracking the access pattern. >>> For random access, we can assume that the pages are not >>> contiguous and we can assume that they are contiguous for >>> sequential access. This isn't perfect and can be fooled, >>> but should hold for most applications which access files >>> sequentially. >>> >>> Also, we don't want to proactively flush the cache if the >>> application is doing random access. The application may come >>> back to the page and we could get away with a single WRITE >>> instead of multiple WRITE requests for the same page. With >>> sequential access, we can generally know that it is safe to >>> proactively flush pages because the application won't be >>> accessing them again. Once again, this heuristic is not >>> foolproof, but holds most of the time. >>> >>> >> I'm not sure I follow you here. Why is the random access case any >> different to the sequential access case? Random writes are obviously a >> pain to deal with since you cannot predict access patterns. However, >> AFAICS if we want to provide a faster generic stat(), then we need to >> deal with random writes too: a gigabyte of data will take even longer to >> flush out when it is in the form of non-contiguous writes. >> >> >> > > I think that access patterns are important because we can't solve > the ls performance problem at the expense of ruining all other > performance. During normal operations, ie. without ls running in > the directory, performance should as close to what exists today > as possible, or even better. I think that folks running a > database in a file would probably not be happy with a tradeoff > that makes ls run on the database files run faster while making > the applications which update the database run slower. We have > been busy trying to convince people to run databases on top of > file systems instead of raw partitions and this would hurt. > > It would be nice to provide a faster generic stat(). However, > I don't easily see how to do this and it not clear that we > actually have to do this. We do need a faster stat() on files > that are being sequentially written to. We have customer > bugzillas and reports on this already. The people who tend > to run applications which use random access on files tend > to be those who care more about the performance of those > applications than someone running ls. > > > >>> For the ls case, we really want to manage the page cache on a >>> per-directory of files case. I don't think that this is going >>> to happen. The only directions to go from there are more >>> coarse, per-bdi, or less coarse, per-file. >>> >>> >> Ugh. No... >> >> >> > > Ugh, indeed. :-) > > >>> If we go the per-bdi approach, then we would need to stop >>> all modifications to the page cache for that particular bdi >>> during the duration of the ls processing. Otherwise, as we >>> stat 1 file at a time, the other files still needing to be >>> stat'd would just refill the page cache with dirty pages. >>> We could solve this by setting the max_pages limit to be a >>> reasonable number to flush per file, but then that would be >>> too small a limit for the entire file system. >>> >>> >> True, but if you have applications writing to all the files in your >> directory, then 'ls -l' performance is likely to suck anyway. Even if >> you do have per-file limits, those write-backs to the other files will >> be competing for RPC slots with the write-backs from the file that is >> being stat()ed. >> >> >> > > There is going to be a cost to running ls in a directory which > contains files which are being actively written to. I don't > see how to avoid this, given the architecture of maintaining > file times by the server and the semantics required. We can > strive to limit the cost though. I think that we can limit > the cost without affecting normal operations. > > >>> So, I don't see how to get around managing the page cache on >>> a per-file basis, at least to some extent, in order to manage >>> the amount of dirty data that must be flushed. >>> >>> It does seem like the right way to do this is via a combination >>> of per-bdi and per-file support, but I am not sure that we have >>> the right information at the right levels to achieve this now. >>> >>> Thanx... >>> >>> ps >>> >>> >> In the long run, I'd like to see us merge something like the fstatat() >> patches that were hacked together at the LSF'09 conference. >> If applications can actually tell the NFS client that they don't care >> about a/c/mtime accuracy, then we can avoid this whole flushing nonsense >> altogether. It would suffice to teach 'ls' to start using the >> AT_NO_TIMES flag that we defined... >> > > Is someone pursuing those patches which were hacked together > at LSF'09? Or, is there a specification or some sort of design > document for the work? > > This would help in some cases. Using ls without any arguments > that require file times could certainly be made to run lickety > split. However, we would still be stuck with the crowd that > needed to know file sizes or file times for their own nefarious > reasons. It would be easy to dismiss these folks, until one > becomes a paying support customer with enough clout to demand > that something be done. I, myself, even, fall into this > situation often enough that I wish for the situation to be > addressed. > > Thanx... > > ps Hi. I still need to move this along. The proposed patch does make things better and as far as I can tell from my monitoring, does not seem to make things worse from the user's perspective. It does restore some of the support and complexity that was previously removed, but I don't see how to solve the problem with it. What'd you think? Thanx... ps -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html