Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 24 Jan 2024 14:02:31 +0000

On Wed, Jan 24, 2024 at 10:05:15AM +0000, Yuezhang.Mo@xxxxxxxx wrote:
> From: Matthew Wilcox <willy@xxxxxxxxxxxxx> 
> Sent: Wednesday, January 24, 2024 1:21 PM
> To: Mo, Yuezhang <Yuezhang.Mo@xxxxxxxx>
> Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()
> > On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@xxxxxxxx wrote:
> > > inode->i_rwsem should be locked when writing file. But the lock
> > > is missing when writing zeros to the file in exfat_file_mmap().
> > 
> > This is actually very weird behaviour in exfat.  This kind of "I must
> > manipulate the on-disc layout" is not generally done in mmap(), it's
> > done in ->page_mkwrite() or even delayed until we actually do writeback.
> > Why does exfat do this?
> 
> In exfat, "valid_size" describes how far into the data stream user data has been
> written and "size" describes the file size.  Return zeros if read "valid_size"~"size".

Yes, it's like a very weak implementation of sparse files.  There can
only be one zero range and it must be at the end (as opposed to most
unix derived filesystems which allow arbitrary zero ranges anywhere in
the file).

> For example,
> 
> (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
>      - Write 0x59 to 0~1023
>      - both "size" and "valid_size" are 1024

Yes.

> (2) xfs_io -t -f -c "truncate 4K" $filename
>      - "valid_size" is still 1024
>      - "size" is changed to 4096
>      - 1024~4095 is not zeroed
>      - return zeros if read 1024~4095

Yes.

> (3) xfs_io -t -f -c "mmap -rw 0 3072" -c  "mwrite -S 0x5a 2048 512" $filename
>      (3.1) "mmap -rw 0 3072"
>               -  write zeros to 1024~3071
>               -  "valid_size" is changed to 3072
>               - "size" is still 4096

That's not what the code says you do.  Is this from a trace or were you
making up an example?

        loff_t start = ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
        loff_t end = min_t(loff_t, i_size_read(inode),
                        start + vma->vm_end - vma->vm_start);
                ret = exfat_file_zeroed_range(file, ei->valid_size, end);

vm_end - vm_start will be 4kB because Linux rounds to PAGE_SIZE even if
you ask to map 3072 bytes.

In any case, most filesystems do not do this, and I don't understand why
exfat does.  Just because a file is mmaped writable doesn't mean that
we're necessarily going to write to it.  The right thing to do here is
defer the changing of ->valid_size until you have already written back
the new data.

>      (3.2) "mwrite -S 0x5a 2048 512"
>              - write 0x5a to 2048~2559
>              - "valid_size" is still 3072
>              -  "size" is still 4096
> 
> To avoid 1024~2047 is not zeroed and no need to update "valid_size" in (3.2),
> I zero 1024~3071 in (3.1).
> 
> If you have a better solution, welcome to contribute to  exfat or share your
> solution in detail.

Update ->valid_size in the writeback path.  If I'm reading
exfat_get_block() correctly, you already correctly zero pages in the page
cache that are read after ->valid_size, so there is very little work for
you to do.

Oh!  I just figured out why you probably do it this way.

(1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
(2) xfs_io -t -f -c "truncate 4T" $filename
(3) xfs_io -t -f -c "mmap -rw 3T 4096" -c  "mwrite -S 0x5a 3T 512" $filename

Now (at whatever point you're delaying writing zeroes to) you have to
write 3TB of zeroes.  And it's probably better to do that at mmap time
than at page fault time, so you can still return an error.  It's a bit
weird to return ENOSPC from mmap, but here we are.

It'd be nice to have a comment to explain this.  Also, it seems to me
that I can write a script that floods the kernel log with:

                        exfat_err(inode->i_sb,
                                  "mmap: fail to zero from %llu to %llu(%d)",
                                  start, end, ret);

That error message should probably be taken out entirely (maybe use a
tracepoint if you really want some kind of logging).