On Sun, Apr 16, 2023 at 03:07:33PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > > I don't think we > > > should be overriding the aops, and if we narrow the scope of large folio > > > support in blockdev t only supporting folio_size == LBA size, it becomes > > > much more feasible. > > > > I'm trying to think of the possible use cases where folio_size != LBA size > > and I cannot immediately think of some. Yes there are cases where a > > filesystem may use a different block for say meta data than data, but that > > I believe is side issue, ie, read/writes for small metadata would have > > to be accepted. At least for NVMe we have metadata size as part of the > > LBA format, but from what I understand no Linux filesystem yet uses that. > > NVMe metadata is per-block metadata -- a CRC or similar. Filesystem > metadata is things like directories, inode tables, free space bitmaps, > etc. > > > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > > bool retry) > > { > [...] > > head = NULL; > > offset = PAGE_SIZE; > > while ((offset -= size) >= 0) { > > > > I see now what you say about the buffer head being of the block size > > bh->b_size = size above. > > Yes, just changing that to 'offset = page_size(page);' will do the trick. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > > into the page cache. > > > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > > they did that. > > IRIX didn't have an sb_bread() ;-) > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > > I can overwrite the superblock. Do we want filesystems to see that > > > kind of vandalism, or do we want the mounted filesystem to have its > > > own copy of the data and overwrite what userspace wrote the next time it > > > updates the superblock? > > > > Oh, what happens today? > > Depends on the filesystem, I think? Not really sure, to be honest. The filesystem driver sees the vandalism, and can very well crash as a result[1]. In that case it was corrupted journal contents being replayed, but the same thing would happen if you wrote a malicious userspace program to set the metadata_csum feature flag in the ondisk superblock after mounting the fs. https://bugzilla.kernel.org/show_bug.cgi?id=82201#c4 I've tried to prevent people from writing to mounted block devices in the past, but did not succeed. If you try to prevent programs from opening such devices with O_RDWR/O_WRONLY you then break lvm tools which require that ability even though they don't actually write anything to the block device. If you make the block device write_iter function fail, then old e2fsprogs breaks and you get shouted at for breaking userspace. Hence I decided to let security researchers find these bugs and control the design discussion via CVE. That's not correct and it's not smart, but it preserves some of my sanity. --D