On Sat, Jan 15, 2022 at 12:16 AM Khazhy Kumykov <khazhy@xxxxxxxxxx> wrote: > > On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi > <krisman@xxxxxxxxxxxxx> wrote: > > > > Amir Goldstein <amir73il@xxxxxxxxx> writes: > > > > >> > But the question remains, what is so special about shmem that > > >> > your use case requires fsnotify events to handle ENOSPC? > > >> > > > >> > Many systems are deployed on thin provisioned storage these days > > >> > and monitoring the state of the storage to alert administrator before > > >> > storage gets full (be it filesystem inodes or blocks or thinp space) > > >> > is crucial to many systems. > > >> > > > >> > Since the ENOSPC event that you are proposing is asynchronous > > >> > anyway, what is the problem with polling statfs() and meminfo? > > >> > > >> Amir, > > >> > > >> I spoke a bit with Khazhy (in CC) about the problems with polling the > > >> existing APIs, like statfs. He has been using a previous version of > > >> this code in production to monitor machines for a while now. Khazhy, > > >> feel free to pitch in with more details. > > >> > > >> Firstly, I don't want to treat shmem as a special case. The original > > >> patch implemented support only for tmpfs, because it was a fs specific > > >> solution, but I think this would be useful for any other (non-pseudo) > > >> file system in the kernel. > > >> > > >> The use case is similar to the use case I brought up for FAN_FS_ERROR. > > >> A sysadmin monitoring a fleet of machines wants to be notified when a > > >> service failed because of lack of space, without having to trust the > > >> failed application to properly report the error. > > >> > > >> Polling statfs is prone to missing the ENOSPC occurrence if the error is > > >> ephemeral from a monitoring tool point of view. Say the application is > > >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes > > >> the partial file. If that happens, a daemon might miss the chance to > > >> observe the lack of space in statfs. Doing it through fsnotify, on the > > >> other hand, always catches the condition and allows a monitoring > > >> tool/sysadmin to take corrective action. > > >> > > >> > I guess one difference is that it is harder to predict page allocation failure > > >> > that causes ENOSPC in shmem, but IIUC, your patch does not report > > >> > an fsevent in that case only in inode/block accounting error. > > >> > Or maybe I did not understand it correctly? > > >> > > >> Correct. But we cannot predict the enospc, unless we know the > > >> application. I'm looking for a way for a sysadmin to not have to rely > > >> on the application caring about the file system size. > > >> > > > > > > In the real world, ENOSPC can often be anticipated way ahead of time > > > and sysadmins are practically required to take action when storage space is low. > > > Getting near 90% full filesystem is not healthy on many traditional disk > > > filesystems and causes suboptimal performance and in many cases > > > (especially cow filesystems) may lead to filesystem corruption. > > > > > > All that said, yes, *sometimes* ENOSPC cannot be anticipated, > > > but EIO can never be anticipated, so why are we talking about ENOSPC? > > > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify > > > monitoring for filesystem ephemeral errors. > > > > > > The problem with fsnotify events for ephemeral filesystem errors > > > and that there can be a *lot* of them compared to filesystem corruption > > > errors that would usually put the filesystem in an "emergency" state > > > and stop the events from flooding. > > > For that reason I still think that a polling API for fs ephemeral errors > > > is a better idea. > > > > > > w.r.t to ephemeral errors on writeback we already have syncfs() as > > > a means to provide publish/subscribe API for monitoring applications, > > > to check if there was any error since last check, but we do not have an > > > API that provides this information without the added costs of performing > > > syncfs(). > > > > > > IMO, a proper solution would look something like this: > > > > > > /* per-sb errseq_t for reporting writeback errors via syncfs */ > > > errseq_t s_wb_err; > > > + /* per-sb errseq_t for reporting vfs errors via fstatfs */ > > > + errseq_t s_vfs_err; > > > > > > > I think making it a polling API wouldn't be a problem for our use case, > > as long as it is kept as an always increasing counter, we should be able > > to detect changes and not miss events. > > > > The problem with the proposal, in my opinion, is the lack of > > differentiation of the errors. We want to be able to tell apart an EIO > > from a ENOSPC, and it might not be clear from the other fields in > > fstatfs what has happened. > > These tmpfs are associated with a containerized task (and often aren't > very large), so it isn't possible to predict a full filesystem. > > For our use case (and what makes this "shmem specific") is we want to > differentiate between a user getting ENOSPC due to insufficiently > provisioned fs size, vs. due to running out of memory in a container - > both of which return ENOSPC to the process. Mixing EIO into the same > signal ("there were errors, ever") hides this information. So what we > were looking for was some sort of way of signaling that user saw > enospc, and a way for tmpfs to signal "why". Our first approach was > just to keep a counter of how many times this situation (ENOSPC due to > max size) occurred, but this seemed too niche an interface. Using the > fanotify interface seems like a good candidate, and has the additional > (potential) benefit that a similar notification can be used for any > error on any type of fs, though as you mentioned the volume of > potential errors is much higher, so perhaps sticking to polling is the > way to go. > > To my understanding, errseq_t stores a counter for 1 type of error, > and if we see multiple types of errors, we'll overwrite the errno (so, > EIO followed by ENOSPC, or vice versa, would result in missing info) > > > > > Also, I suspect Google might care about what inode triggered the error. > > If I understand correctly their use case, that would allow them to trace > > back the origin of the issue. Either way, wouldn't it be useful for > > applications in general to be able to know what specific writeback failed? > > For our usage, just knowing that an error occurred should be good > enough, and if that simplifies things let's omit the extra > functionality. Yes, it simplifies things a lot. The use case sounds very shmem specific and unless there are other users that need fsnotify of ENOSPC events with inode information I would stay away from this. I suggest that you export a counter of shmem ENOSPC errors of both kinds (page allocation and block accounting) via procfs/sysfs and poll for the counter that you care about. If you want to follow precedents, there is /sys/fs/ext4/sda3/errors_count and friends. Thanks, Amir.