Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)

Curt Wohlgemuth <curtw@xxxxxxxxxx> · Thu, 17 Mar 2011 11:55:34 -0700

On Thu, Mar 17, 2011 at 10:32 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Thu 17-03-11 08:46:23, Curt Wohlgemuth wrote:
>> On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara <jack@xxxxxxx> wrote:
>> The design of IO-less foreground throttling of writeback in the context of
>> memory cgroups is being discussed in the memcg patch threads (e.g.,
>> "[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got
>> another concern as well.  And that's how restricting per-BDI writeback to a
>> single task will affect proposed changes for tracking and accounting of
>> buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup
>> isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ).
>>
>> It seems totally reasonable that reducing competition for write requests to
>> a BDI -- by using the flusher thread to "handle" foreground writeout --
>> would increase throughput to that device.  At Google, we experiemented with
>> this in a hacked-up fashion several months ago (FG task would enqueue a work
>> item and sleep for some period of time, wake up and see if it was below the
>> dirty limit), and found that we were indeed getting better throughput.
>>
>> But if one of one's goals is to provide some sort of disk isolation based on
>> cgroup parameters, than having at most one stream of write requests
>> effectively neuters the IO scheduler.  We saw that in practice, which led to
>> abandoning our attempt at "IO-less throttling."

>  Let me check if I understand: The problem you have with one flusher
> thread is that when written pages all belong to a single memcg, there is
> nothing IO scheduler can prioritize, right?

Correct.  Well, perhaps.  Given that the memory cgroups and the IO
cgroups may not overlap, it's possible that write requests from a
single memcg might be targeted to multiple IO cgroups, and scheduling
priorities can be maintained.  Of course, the other way round might be
the case as well.

The point is just that from however many memcgs the flusher thread is
working on behalf of, there's only a single stream of requests, which
are *likely* for a single IO cgroup, and hence there's nothing to
prioritize.

>> One possible solution would be to put some of the disk isolation smarts into
>> the writeback path, so the flusher thread could choose inodes with this as a
>> criteria, but this seems ugly on its face, and makes my head hurt.
>  Well, I think it could be implemented in a reasonable way but then you
> still miss reads and direct IO from the mix so it will be a poor isolation.

Um, not really, would it?  Presumably there are separate tasks
(directly) issuing simultaneous requests for reads, and DIO writes;
these should interact just fine with writes from the single flusher
thread.

> But maybe we could propagate the information from IO scheduler to flusher
> thread? If IO scheduler sees memcg has run out of its limit, it could hint
> to a flusher thread that it should switch to an inode from a different memcg.
> But still the details get nasty as I think about them (how to pick next
> memcg, how to pick inodes,...). Essentially, we'd have to do with flusher
> threads what old pdflush did when handling congested devices. Ugh.

Yeah, plus what I said above, that memcgs and IO cgroups aren't
necessarily the same cgroups.

>> Otherwise, I'm having trouble thinking of a way to do effective isolation in
>> the IO scheduler without having competing threads -- for different cgroups --
>> making write requests for buffered data.  Perhaps the best we could do would
>> be to enable IO-less throttling in writeback as a config option?

>  Well, nothing prevents us to choose to do foreground writeback throttling
> for memcgs and IO-less one without them but as Christoph writes, this
> doesn't seem very compeling either... I'll let this brew in my head for
> some time and maybe something comes.

I agree with Christoph too; I mainly wanted to get the issue out
there, and will be thinking on it more as well.

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html