On Aug 3, 2019, at 2:24 PM, Arnd Bergmann <arnd@xxxxxxxx> wrote: > > On Sat, Aug 3, 2019 at 6:03 PM Theodore Y. Ts'o <tytso@xxxxxxx> wrote: >> >> On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote: >>> >>> I see in the ext4 code that we always try to expand i_extra_size >>> to s_want_extra_isize in ext4_mark_inode_dirty(), and that >>> s_want_extra_isize is always at least s_min_extra_isize, so >>> we constantly try to expand the inode to fit. >> >> Yes, we *try*. But we may not succeed. There may actually be a >> problem here if the cause is due to there simply is no space in the >> external xattr block, so we might try and try every time we try to >> modify that inode, and it would be a performance mess. If it's due to >> there being no room in the current transaction, then it's highly >> likely it will succeed the next time. >> >>> Did older versions of ext4 or ext3 ignore s_min_extra_isize >>> when creating inodes despite >>> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE, >>> or is there another possibility I'm missing? >> >> s_min_extra_isize could get changed in order to make room for some new >> file system feature --- such as extended timestamps. > > Ok, that explains it. I assumed s_min_extra_isize was meant to > not be modifiable, and did not find a way to change it using the > kernel or tune2fs, but now I can see that debugfs can set it. > >> If you want to pretend that file systems never get upgraded, then life >> is much simpler. The general approach is that for less-sophisticated >> customers (e.g., most people running enterprise distros) file system >> upgrades are not a thing. But for sophisticated users, we do try to >> make thing work for people who are aware of the risks / caveats / >> rough edges. Google won't have been able to upgrade thousands and >> thousands of servers in data centers all over the world if we limited >> ourselves to Red Hat's support restrictions. Backup / reformat / >> restore really isn't a practical rollout strategy for many exabytes of >> file systems. >> >> It sounds like your safety checks / warnings are mostly targeted at >> low-information customers, no? > > Yes, that seems like a reasonable compromise: just warn based > on s_min_extra_isize, and assume that anyone who used debugfs > to set s_min_extra_isize to a higher value from an ext3 file system > during the migration to ext4 was aware of the risks already. > > That leaves the question of what we should set the s_time_gran > and s_time_max to on a superblock with s_min_extra_isize<16 > and s_want_extra_isize>=16. > > If we base it on s_min_extra_isize, we never try to set a timestamp > later than 2038 and so will never fail, but anyone with a grandfathered > s_min_extra_isize from ext3 won't be able to set extended > timestamps on any files any more. Based on s_want_extra_isize > we would keep the current behavior, but could add a custom > warning in the ext4 code about the small s_min_extra_isize > indicating a theoretical problem. I think it makes the most sense to always try to set timestamps on inodes that have enough space for them. The chance of running into a filesystem with 256-byte inode size but *no* space in the inode to store an extended timestamp, but is *also* being modified by a new kernel after 2038 is vanishingly small. This would require formatting the filesystem with non-default mke2fs for ext3, using the filesystem and storing enough xattrs on inodes that there isn't space for 12 bytes of extra isize, and using it for 30+ years without upgrading to use ext4 (which will also try to expand the inode to store the nsec timestamps) and then modifying the inode after 2038. Rather than printing a warning at mount time (which may be confusing to users for a problem they may never see), it makes sense to only print such a warning in the vanishingly small case that someone actually tries to modify the inode timestamp but it doesn't fit, rather than on the theoretical case that may never happen. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP