Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Greg Thelen <gthelen@xxxxxxxxxx> · Thu, 31 Mar 2011 07:50:23 -0700

On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
>
> [..]
>> > It should not happen that flusher
>> > thread gets blocked somewhere (trying to get request descriptors on
>> > request queue)
>>
>> A major design principle of the bdi-flusher threads is that they
>> are supposed to block when the request queue gets full - that's how
>> we got rid of all the congestion garbage from the writeback
>> stack.
>
> Instead of blocking flusher threads, can they voluntarily stop submitting
> more IO when they realize too much IO is in progress. We aready keep
> stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> flusher tread can use that?
>
> Jens mentioned this idea of how about getting rid of this request accounting
> at request queue level and move it somewhere up say at bdi level.
>
>>
>> There are plans to move the bdi-flusher threads to work queues, and
>> once that is done all your concerns about blocking and parallelism
>> are pretty much gone because it's trivial to have multiple writeback
>> works in progress at once on the same bdi with that infrastructure.
>
> Will this essentially not nullify the advantage of IO less throttling?
> I thought that we did not want have multiple threads doing writeback
> at the same time to avoid number of seeks and achieve better throughput.
>
> Now with this I am assuming that multiple work can be on progress doing
> writeback. May be we can limit writeback work one per group so in global
> context only one work will be active.
>
>>
>> > or it tries to dispatch too much IO from an inode which
>> > primarily contains pages from low prio cgroup and high prio cgroup
>> > task does not get enough pages dispatched to device hence not getting
>> > any prio over low prio group.
>>
>> That's a writeback scheduling issue independent of how we throttle,
>> and something we don't do at all right now. Our only decision on
>> what to write back is based on how low ago the inode was dirtied.
>> You need to completely rework the dirty inode tracking if you want
>> to efficiently prioritise writeback between different groups.
>>
>> Given that filesystems don't all use the VFS dirty inode tracking
>> infrastructure and specific filesystems have different ideas of the
>> order of writeback, you've got a really difficult problem there.
>> e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
>> purposes which will completely screw any sort of prioritised
>> writeback. Remember the ext3 "fsync = global sync" latency problems?
>
> Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> mode we will flush all the writes to disk before committing meta data.
>
> I have no knowledge of filesystem code so here comes a stupid question.
> Do multiple fsyncs get completely serialized or they can progress in
> parallel? IOW, if a fsync is in progress and we slow down the writeback
> of that inode's pages, can other fsync still make progress without
> getting stuck behind the previous fsync?
>
> For me knowing this is also important in another context of absolute IO
> throttling.
>
> - If a fsync is in progress and gets throttled at device, what impact it
>  has on other file system operations. What gets serialized behind it.
>
> [..]
>> > and also try to select inodes intelligently (cgroup aware manner).
>>
>> Such selection algorithms would need to be able to handle hundreds
>> of thousands of newly dirtied inodes per second so sorting and
>> selecting them efficiently will be a major issue...
>
> There was proposal of memory cgroup maintaining a per memory cgroup per
> bdi structure which will keep a list of inodes which need writeback
> from that cgroup.

FYI, I have patches which implement this per memcg per bdi dirty inode
list.  I want to debug a few issues before posting an RFC series.  But
it is getting close.

> So any cgroup looking for a writeback will queue up this structure on
> bdi and flusher threads can walk though this list and figure out
> which memory cgroups and which inodes within memory cgroup need to
> be written back.

The way these memcg-writeback patches are currently implemented is
that when a memcg is over background dirty limits, it will queue the
memcg a on a global over_bg_limit list and wakeup bdi flusher.  There
is no context (memcg or otherwise) given to the bdi flusher.  After
the bdi flusher checks system-wide background limits, it uses the
over_bg_limit list to find (and rotate) an over limit memcg.  Using
the memcg, then the per memcg per bdi dirty inode list is walked to
find inode pages to writeback.  Once the memcg dirty memory usage
drops below the memcg-thresh, the memcg is removed from the global
over_bg_limit list.

> Thanks
> Vivek
> _______________________________________________
> Lsf mailing list
> Lsf@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linux-foundation.org/mailman/listinfo/lsf
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html