On Wed, Jan 24, 2024 at 10:05:15AM +0000, Yuezhang.Mo@xxxxxxxx wrote: > From: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Sent: Wednesday, January 24, 2024 1:21 PM > To: Mo, Yuezhang <Yuezhang.Mo@xxxxxxxx> > Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap() > > On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@xxxxxxxx wrote: > > > inode->i_rwsem should be locked when writing file. But the lock > > > is missing when writing zeros to the file in exfat_file_mmap(). > > > > This is actually very weird behaviour in exfat. This kind of "I must > > manipulate the on-disc layout" is not generally done in mmap(), it's > > done in ->page_mkwrite() or even delayed until we actually do writeback. > > Why does exfat do this? > > In exfat, "valid_size" describes how far into the data stream user data has been > written and "size" describes the file size. Return zeros if read "valid_size"~"size". Yes, it's like a very weak implementation of sparse files. There can only be one zero range and it must be at the end (as opposed to most unix derived filesystems which allow arbitrary zero ranges anywhere in the file). > For example, > > (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename > - Write 0x59 to 0~1023 > - both "size" and "valid_size" are 1024 Yes. > (2) xfs_io -t -f -c "truncate 4K" $filename > - "valid_size" is still 1024 > - "size" is changed to 4096 > - 1024~4095 is not zeroed > - return zeros if read 1024~4095 Yes. > (3) xfs_io -t -f -c "mmap -rw 0 3072" -c "mwrite -S 0x5a 2048 512" $filename > (3.1) "mmap -rw 0 3072" > - write zeros to 1024~3071 > - "valid_size" is changed to 3072 > - "size" is still 4096 That's not what the code says you do. Is this from a trace or were you making up an example? loff_t start = ((loff_t)vma->vm_pgoff << PAGE_SHIFT); loff_t end = min_t(loff_t, i_size_read(inode), start + vma->vm_end - vma->vm_start); ret = exfat_file_zeroed_range(file, ei->valid_size, end); vm_end - vm_start will be 4kB because Linux rounds to PAGE_SIZE even if you ask to map 3072 bytes. In any case, most filesystems do not do this, and I don't understand why exfat does. Just because a file is mmaped writable doesn't mean that we're necessarily going to write to it. The right thing to do here is defer the changing of ->valid_size until you have already written back the new data. > (3.2) "mwrite -S 0x5a 2048 512" > - write 0x5a to 2048~2559 > - "valid_size" is still 3072 > - "size" is still 4096 > > To avoid 1024~2047 is not zeroed and no need to update "valid_size" in (3.2), > I zero 1024~3071 in (3.1). > > If you have a better solution, welcome to contribute to exfat or share your > solution in detail. Update ->valid_size in the writeback path. If I'm reading exfat_get_block() correctly, you already correctly zero pages in the page cache that are read after ->valid_size, so there is very little work for you to do. Oh! I just figured out why you probably do it this way. (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename (2) xfs_io -t -f -c "truncate 4T" $filename (3) xfs_io -t -f -c "mmap -rw 3T 4096" -c "mwrite -S 0x5a 3T 512" $filename Now (at whatever point you're delaying writing zeroes to) you have to write 3TB of zeroes. And it's probably better to do that at mmap time than at page fault time, so you can still return an error. It's a bit weird to return ENOSPC from mmap, but here we are. It'd be nice to have a comment to explain this. Also, it seems to me that I can write a script that floods the kernel log with: exfat_err(inode->i_sb, "mmap: fail to zero from %llu to %llu(%d)", start, end, ret); That error message should probably be taken out entirely (maybe use a tracepoint if you really want some kind of logging).