On 07/17/2012 11:11 PM, Miklos Szeredi wrote: > Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes: Miklos, sorry for the late response. Please, find the answers inline. >> On 07/13/2012 08:57 PM, Miklos Szeredi wrote: >>> Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes: >>> >>>> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP >>>> counter is hight ehough. This prevents us from having too many dirty >>>> pages on fuse, thus giving the userspace part of it a chance to write >>>> stuff properly. >>>> >>>> Note, that the existing balance logic is per-bdi, i.e. if the fuse >>>> user task gets stuck in the function this means, that it either >>>> writes to the mountpoint it serves (but it can deadlock even without >>>> the writeback) or it is wrting to some _other_ dirty bdi and in the >>>> latter case someone else will free the memory for it. >>> >>> This is not just about deadlocking. Unprivileged fuse filesystems >>> should not impact the operation of other filesystems. I.e. a fuse >>> filesystem which is not making progress writing out pages shouln't cause >>> a write on an unrelated filesystem to block. >>> >>> I believe this patch breaks that promise. >> >> Hm... I believe it does not, and that's why. >> >> When a task writes to some bdi the balance_dirty_pages will evaluate the >> amount of time to block this task on based on this bdi dirty set counters. >> The global stats are only used to a) check whether this decision should be >> made at all > > Okay, maybe I'm blind but if this is true, then how is > balance_dirty_pages() supposed to ensure that the per-bdi limit is not > exceeded? The balance_dirty_pages logic is _very_ roughly the the following: Let this_bdi be a bdi the current task is writing to Let D be the total amount of dirty and writeback memory (and writeback_tmp after this patch) Let L be the limit of dirty memory (L = ram_size * ratio) Let d be the amount of dirty and writeback on this_bdi And let l be the limit of dirty memory on this_bdi With that the balancer logic look like while (1) { if (D < L) return; start_background_writeback(this_bdi); if (d < l) return; timeout = get_sleep_timeout(d, l, D, L); shcedule_timeout(timeout); } The d and l are calculated out of the D and L using this_bdi and global IO completions proportions (with more complexity, but still). Thus, since we throttle tasks looking ad d and l only we cannot affect all the bdis in the system by live-locking a single one of them. Accounting for writeback_tmp is required since the D should become high when there are lots of pages in-flight in FUSE. Otherwise, the balance_dirty_pages will not limit the task writing on a fuse mount. >> and b) evaluate the dirty "fraction" of a bdi. That said, even >> if we stop the fuse daemon (I actually did this) other filesystems won't >> lock. The global counter would be high, yes, but the dirty set fraction of >> non-fuse bdi would be low thus allowing others to progress. > > That makes some sense, but it looks to me that FUSE, NFS and friends > want a stricter dirty balancing logic that looks at the bdi thresholds > even if the global limits are not exceeded. Probably, but I did a very straighforward test -- I just stopped the fuse daemon and started writing to a fuse file. After some time the writing task was locked in balance_dirty_pages, since fuse daemon didn't ack-ed writeback. At the same time I tried to write to other bdis (disks and nfs) and none of them was locked, all the writes succeeded. After I let the fuse daemon run again the fuse-writer unlocked and went on writing. Do you have some trickier scenario in mind? > Thanks, > Miklos > . > Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html