Re: UFS s_maxbytes bogosity

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 9 Jun 2017 18:34:38 +0100

On Fri, Jun 09, 2017 at 04:35:26AM +0100, Al Viro wrote:
> On Thu, Jun 08, 2017 at 05:11:39PM -0700, Richard Narron wrote:
> 
> > Test results don't look pretty on FreeBSD.  (I will also test OpenBSD and
> > NetBSD.)
> 
> OK, here's the cumulative diff so far - easy-to-backport parts only; that'll
> be split into 6 commits (plus whatever else gets added).   It really needs
> beating...

FWIW, so far it seems to survive xfstest generic/*, modulo simulated power
loss - I'm running it without -o sync and we don't have UFS2 journalling
support, so that's to be expected...  Tons of tests don't run due to lack
of various (mis)features, so it's not _that_ much, and there's nothing
that would try to deliberately hit UFS-specific interesting cases.
xattrs and acls can be supported reasonably easily, so can quota.
O_DIRECT is a real bitch for fragment reallocation handling - no idea
how painful would that be.

UFS2 journal support is probably a lot more massive work than I'm willing
to go into.

Another bug I see there is recovery after failing copy from userland in
write() on append-only file.  We have allocated blocks already, so we
might need to truncate the damn things.  However, ufs_truncate_blocks()
will see IS_APPEND(inode) and bail out, leaving garbage in the end of
file.  Not that hard to fix - these checks are simply not needed in the
ufs_write_failed() case.

I'm not happy with the way tail unpacking is done - we *probably* manage
to avoid deadlocks, but the proof is a whole lot more subtle than I'd like,
assuming it is correct in the first place.  And we have a nasty trap
caused by the way balloc works: when doing reallocation on failing
attempt to extend tail in place we do have logics that tries to put the
new copy into an empty block if filesystem is not too fragmented, but
the *first* allocation has nothing of that sort going on.  So if you
have a block with 7 fragments in it in each cylinder group (just create
a bunch of  28Kb files in different directories), any attempt to write
more than 4K into a new file will *always* go like this:
	* for the first page, allocate 4Kb fragment.  That has a goof
chance of going into that almost full block - all 
	* for the next page, notice that we need to expand that tail
and can't do that in place.  Now the anti-fragmentation heuristics
hits and we pick two fragments in an empty block.  And copy the one
we'd just written into the new place.
	* next 6 pages go extending the tail we'd got.  However, on
the next page the whole thing repeats.

	FreeBSD avoids that mess by doing bigger allocations - in the
same scenario it would've gone in 32Kb steps rather than 4Kb ones.
Looks like we need a different ->write_iter() there; generic one is
bloody painful in that respect...