More ext3 fileserver woes ...

neilb@cse.unsw.edu.au (Neil Brown) · Wed, 12 Jun 2002 09:42:31 +1000 (EST)

On Thursday June 6, akpm@zip.com.au wrote:
> Neil,
> 
> I think this is a better fix...

Thanks.  This does look better in that it is more locallised and only
affects the observed problem.

Though I really liked the idea of refile_buffer called
set_buffer_flushtime.  It is the best way to make sure the invariant
of "dirty list always sorted" is maintained.

This actually begs the question:  why doesn't ext3 call
mark_buffer_dirty in __journal_unfile_buffer?  That would seem to be
the "right" thing to do, and would avoid this whole problem.

... but on trying it, it doesn't actually work, at least not
completely:
Jun 12 09:18:44 elfman kernel: buffer on 0 has age 530
Jun 12 09:18:44 elfman kernel: buffer on 0 has age 546
Jun 12 09:18:44 elfman kernel: buffer on 0 has age 561

buffers are still on the dirty list out of order.  This patches only
fixes __journal_refile_buffer and doesn't fix any calls to
__journal_unfile_buffer which I think is the real culprit.

But to continue my story of woes.....

I left the kernel without this patch running over my extended weekend
with "sync" running every minute.
This worked ok until Tuesday afternoon (and I got back on
Wednesday...).
At different times on Tuesday afternoon, all three of my fileservers
locked-up. 
I don't have many details.. just a "ps axgl" listing.  I suspect some
sort of deadlock happening between sync and kjournald...

I will be running with the first patch and no syncing soon and see how
that goes.

NeilBrown