On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote: > > Thanks for the explanation. That makes sense to me. For corruptions > where it is impossible to map to a mountpoint, I thought they could be > considered global filesystem errors, being exposed only to someone > watching the entire filesystem (like FAN_MARK_FILESYSTEM). At least for ext4, there are only 3 ext4_error_*() that we could map to a subtree without having to make changes to the call points: % grep -i ext4_error_file\( fs/ext4/*.c | wc -l 3 % grep -i ext4_error_inode\( fs/ext4/*.c | wc -l 79 % grep -i ext4_error\( fs/ext4/*.c | wc -l 42 So in practice, unless we want to make a lot of changes to ext4, most of them will be global file system errors.... > But, as you mentioned regarding the google use case, the entire idea of > watching a subtree is a bit beyond the scope of my use-case, and was > only added given the feedback on the previous proposal of this feature. > While nice to have, I don't have the need to watch different mountpoints > for errors, only the entire filesystem. I suspect that for most use cases, the most interesting thing is the first error. We already record this in the ext4 superblock, because unfortunately, I can't guarantee that system administrators have correctly configured their system logs, so when handling upstream bug reports, I can just ask them to run dumpe2fs -h on the file system: FS Error count: 2 First error time: Tue Feb 2 16:27:42 2021 First error function: ext4_lookup First error line #: 1704 First error inode #: 12 First error err: EFSCORRUPTED Last error time: Tue Feb 2 16:27:59 2021 Last error function: ext4_lookup Last error line #: 1704 Last error inode #: 12 Last error err: EFSCORRUPTED So it's not just the Google case. I'd argue for most system administrator, one of the most useful things is when the file system was first found to be corrupted, so they can try correlating file system corruptions, with, say, reports of I/O errors, or OOM kils, etc. This can also be useful for correlating the start of file system problems with problems at the application layer --- say, MongoDB, MySQL, etc. The reason why a notification system useful is because if you are using database some kind of high-availability replication system, and if there are problems detected in the file system of the primary MySQL server, you'd want to have the system fail over to the secondary MySQL server. Sure, you *could* do this by polling the superblock, but that's not the most efficient way to do things. > There was a concern raised against my original submission which did > superblock watching, due to the fact that a superblock is an internal > kernel structure that must not be visible to the userspace. It was > suggested we should be speaking in terms of paths and mountpoint. But, > considering the existence of FAN_MARK_FILESYSTEM, I think it is the exact > same object I was looking for when proposing the watch_sb syscall. Sure, using the owner of the root directory (the mountpoint), might be one way of doing things. I think CAP_SYS_ADMIN is also fine, since for many of the use cases, such as shutting down the file system so the database server can fail over to the secondary service, you're going to need root anyway. (Especially for ext4, the vast majority of the errors are going to be FAN_MARK_FILESYSTEM anyway.) > The main use case is, as you said, corruption detection with enough > information to allow us to trigger automatic recovery and data rebuilding > tools. I understand now I can drop most of the debug info, as you > mentioned. In this sense, the feature looks more like netoops. Things like the bad block number, and the more structured information is a nice to have. It might also be that using the structured information might be a more efficient way to get the information to userspace. So one could imagine using, say, an 32 character error text code, sasy, "ext4_bad_dir_block_csum", followed by a 64-bit inode number, a 64-bit containing directory inode number, a 64-bit block number, and 8-bits of filename length, followed the first 7 characters of the filename, followed by the last 8 characters of the filename. That's 72 bytes, which is quite compact. Adding something like 16 bytes of function names and 2 bytes of line number would net 90 bytes. > But there are other uses that interests us, like pattern analysis of > error locations, such that we can detect and perhaps predict errors. > One idea that was mentioned is if an error occurs frequently enough > in a specific function, there might be a bug in that function. This is > one of the reasons we are pushing to include function:line in the error > message. I agree. But if this is causing too much resistance, we can just encode some the error information as a file system specific text or byte string. Even we are restricted to, say, 96 or 80 bytes, there's a lot we can do, even if we can't do notification continuations. Cheers, - Ted