Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge"

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 20 Jun 2015 00:07:39 +0100

On Wed, Jun 17, 2015 at 09:31:16PM +0100, Al Viro wrote:
> On Wed, Jun 17, 2015 at 10:57:15AM +0200, Jan Kara wrote:
> > > Joy...  Folks, is anybody actively maintaining fs/ufs these days?
> > 
> > Looking into the changelog there wasn't anyone seriously looking into UFS
> > for at least 5-6 years... Fabian did some cleanups recently but they were
> > mostly cosmetic. So I don't think it's really maintained. OTOH clearly
> > there are some users since occasionally someone comes with a bug report.
> > Welcome to the world where we have ~50 on disk / network filesystems :-|
> 
> FWIW, I've a partial rewrite of that area (completely untested, etc.)
> in vfs.git#ufs; it's still in progress, but hopefully it'll be done
> by tomorrow.  Basically, it switches the sucker to locking scheme
> similar to ext2 (modulo seqlock instead of rwlock for protection of
> block pointers, having to be more careful due to 64bit assignments not
> being atomic and being unable to fall back to ->truncate_mutex in
> case of race on the read side) and starts cleaning the hell out of
> the truncate (and eventually create side of get_block as well).
> 
> As far as I can see, the root of the problems with that sucker is the
> misplaced handling of tail-packing logics.  That had ballooned into
> a lot of convoluted codepaths ;-/  I'm not blaming Evgeniy - it *is*
> painful to massage into saner shape and the damn thing kept missing
> cleanups that were done to similar filesystems due to that...
> 
> One thing I'm somewhat unhappy about in what Evgeniy had done is dealing with
> UFS vs. UFS2 differences at such a low level.  Sure, the way we did endianness
> handling in there provoked exactly that, but it might have been better to
> go the other way round; see e.g. fs/minix/itree*.c for opposite approach.
> 
> All this stuff is currently completely untested; I'll be using generic
> parts of xfstests for beating it up, but any assistance would be welcome.
> Note: all testing I'll be realistically able to do will be for FreeBSD
> UFS variants - I don't have Solaris left on any boxen, to start with...

FWIW, I've got to one bloody unpleasant part of UFS we'd never managed to
get right and it seems that I have a potential solution.

Problem: UFS has larger-than-page allocation units.  And when we allocate
a block in a hole, we *must* zero the whole thing out.  The trouble being,
we notice that inside writepage(), where we are already holding a lock on
our page and where we can't lock other pages without risking a deadlock.

What's more, as soon as we'd inserted a reference to that block into inode,
we are open to readpage on another page backed by the same block coming
and seeing a normal mapped area.  So UFS block allocator ends up zeroing
them out via _buffer_ cache, waiting for writes to finish before returning
the block to caller.  It doesn't wait for metadata blocks (they are accessed
via buffer cache), but for data we end up with zeroes being written to disk,
no matter what.

That works, but it obviously hurts the performance.  Badly.  Writes into
a hole are relatively rare, but the same thing hurts normal sequential
writes to the end of file, which is not rare at all.

How about the following approach:
	* Let allocation of data block at the EOF return the sucker without
writing zeroes out.
	* Let ->write_iter() check if it's about to create a hole at the
(current) EOF.  In such case start with writing zeroes (in normal fashion,
via ->begin_write()/memset()/->end_write() through the page cache) from the
EOF to block boundary (or beginning of the area we were going to write to,
whichever's closer).  Then proceed in normal fashion.  Incidentally, it will
deal with tail unpacking for free.
	* Do the same upon extending truncate.

Does anybody see holes in that?  readpage() crossing the EOF isn't a problem -
it will not even attempt to map past the current ->i_size and will zero the
page tail out just fine.  We get junk past the EOF on disk, but we zero it
out (or overwrite with our data) before increasing ->i_size.

Dealing with sparse files is also possible (taking allocations from
->writepage() to ->page_mkwrite(), where we are not holding a page lock
and thus can grab all affected pages), but it would be bigger, trickier
and benefit only a relatively rare case.

Comments?

PS: UFS tail unpacking is broken - in some cases it can be tricked into
trying to extend an indirect block, of all things, and that's not the
only fishy case in there.  I think I've fixed that crap in the local tree...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in