Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

Curt Wohlgemuth <curtw@xxxxxxxxxx> · Wed, 29 Jun 2011 14:30:17 -0700

On Wed, Jun 29, 2011 at 11:00 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> On Wed, Jun 29, 2011 at 10:26:33AM -0700, Curt Wohlgemuth wrote:
>> The semantics did change between 2.6.34 and 2.6.35 though.  When the
>> work item queue was introduced in 2.6.32, the semantics changed from
>> what you describe above to what's present through 2.6.34:
>> writeback_inodes_sb() would enqueue a work item, and return.  Your
>> commit 83ba7b07 ("writeback: simplify the write back thread queue")
>> added the wait_for_completion() call, putting the semantics back to
>> where they were pre-2.6.32.
>
> Yes.  The kernels inbetween had that nasty writeback vs umount races
> that we could trigger quite often.

My apologies for not being aware of these races (I tried to search
mailing lists) -- how did they manifest themselves?  I don't see
anything about it in your commit message of 83ba7b07.  I do see that
we are not taking s_umount for sys_sync any longer (only in
sync_filesystem(), e.g., for sys_syncfs)...

>> Though one additional change between the old way (pre-2.6.32) and
>> today:  with the old kernel, the pdflush thread would operate
>> concurrently with the first (and second?) sync path through writeback.
>>  Today of course, they're *all* serialized.  So really a call to
>> sys_sync() will enqueue 3 work items -- one from
>> wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
>> sync_inodes_sb().
>
> Yes.  And having WB_SYNC_NONE items from both wakeup_flusher_threads vs
> writeback_inodes_sb is plain stupid.  The initial conversion to the
> new sync_filesystem scheme had removed the wakeup_flusher_threads
> call, but that caused a huge regression in some benchmark.
>
> As mentioned before Wu was working on some code to introduce tagging
> so that the WB_SYNC_ALL call won't start writing pages dirtied after
> the sync call, which should help with your issue.  Although to
> completely solve it we really need to get down to just two passes.

I can definitely see how tagging with the start of sync would be
helpful; also how going from three to two passes seems like a
no-brainer.

But what's the real point of doing even two passes -- one SYNC_NONE
followed by one SYNC_ALL?  Is it try to get through as many inodes as
possible before we potentially lock up by waiting on an inode on an
unavailable device?

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html