Re: [PATCH v3 2/2] 9p: use i_lock to protect update of inode->i_blocks under 32-bit

Dominique Martinet <asmadeus@xxxxxxxxxxxxx> · Thu, 24 Jan 2019 15:45:39 +0100

Hou Tao wrote on Thu, Jan 24, 2019:
> > hmm, part of me wants to abuse v9fs_i_size_write to also update i_blocks
> > as I think we'd usually want to keep both in sync.We can do that for
> > v9fs_stat2inode(), but for v9fs_stat2inode_dotl()
>
> the updates of i_block & i_size are controlled by different flags, so
> separated updates are needed (maybe also add an extra flags ?).

Is there any case where i_block and i_size are not updated together?
I would be happy with an extra flag if required, I just do not see where
right now (admitedly didn't look for long, will have a second look
tomorrow)

> > In v9fs_file_write_iter we do not update i_blocks directly but
> > inode_add_bytes basically does the same thing... Speaking of which it's
> > assuming inode->i_bytes is set correctly initially; I _think_ we're
> > supposed to call inode_set_bytes() instead of writing i_blocks directly
> > in stat2inode?
> > Sorry, I'm not too familiar with this code and there doesn't seem to be
> > many users of inode_set_bytes so it's a bit puzzling to me.
>
> In my understanding, now only disk quota uses i_bytes, so for 9p there is not so
> much different between updating i_blocks directly and using inode_set_bytes(),
> but maybe using inode_set_bytes() will be appropriate.

I also think only quota uses i_bytes directly, but 9p also uses it
indirectly through inode_add_bytes() - i_blocks is incremented by 1
every 512 bytes and i_bytes is reduced to its last few bits (basically
%512)

> > I'm not using cache mode much but I have some very weird behaviour
> > currently if I mount something with cache=fscache :
> >  - if I create a new file from 9p and stat on the same client, a size of
> >  0-511 has block = 0 then it's just always off by 1

> Is this not the expected result ? v9fs_write_end() will call
> inode_add_bytes(inode,511) if we write 511 bytes to the new file and
> i_blocks will be 0.

That is the obvious result from the current code, yes - but it
definitely is wrong.
i_blocks should be 1 from byte 1 onwards until 512, and bumped to 2 at
513. From a filesystem point of view, a block is already "consumed" from
the first byte onward.
(This isn't what inode_set_bytes does..)

In practice, this actually is fairly important - tools like tar are
optimized such that if a file has no blocks it is considered sparse and
isn't read at all, so this could lead to data loss.
Interestingly I cannot reproduce with a recent gnu tar but I am sure I
have seen the behaviour (some lustre bug[1] talks about this as well and
the code is still present[2]); we should not report a st_blocks of 0 is
there is data -- v9fs_stat2inode (not v9fs_stat2inode_dotl) also rounds
up i_blocks so this is the right behaviour, we're just not doing it when
creating new files apparently.

[1] https://jira.whamcloud.com/browse/LU-3864
[2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c#n273

If this optimization didn't exist and given i_bytes existence it does
seem more correct to do like inode_set_bytes (e.g. blocks = bytes / 512
and bytes = bytes % 512) but this really strikes me as odd; this might
be why hardly anyone uses this...

> >  - if I create a file outside of 9p and stat it, for example starting
> >  with 400 bytes, it starts with 8 blocks then the number of blocks
> >  increase by 1 everytime 512 bytes are written regardless of initial
> >  size.
>
> The initial i_blocks comes from the underlying filesystem, so I think 8 blocks
> just mean the block size of the underlying filesystem is 4KB. If another 512 bytes
> are written into the file, inode_add_bytes() will increase i_blocks by one instead of
> fetching i_blocks from server, though i_blocks in server side will
> still be 8.

Ah, I had only looked at v9fs_stat2inode() which creates the value from
i_size, not v9fs_stat2inode_dotl().
v9fs_stat2inode_dotl() does take the value from the RPC, and then 8 is
indeed correct -- it will just be very painful to keep synchronized with
the underlying fs; we have no ways of knowing when the remote filesystem
decides to allocate new blocks.
If for example the write iter only writes 0 and the underlying fs is
optimized st_blocks might not increase at all.

> Maybe we should call v9fs_invalidate_inode_attr() in v9fs_write_end() ?

Hmm, That will probably add quite a lot of RPCs and make the cache
option rather useless...
Honestly as far as I'm concerned the value of i_blocks is mostly
informative (it's used by tools like du to give a size estimate, and
that's about it?) - with the exception of the 0 value I was talking
earlier.

I'd advocate to never display the remote st_blocks but do like
v9fs_stat2inode() and always just set blocks appropriately from the
size (i.e. i_blocks = (i_size + 512 - 1) >> 9 and i_bytes = 0), but I'd
like some opinions first.
I'll send a mail to fsdevel asking vfs folks for advices tomorrow; let's
put this patch on hold for a few days.

Thanks,
-- 
Dominique