Re: [PATCH 0/2] shmem: Notify user space when file system is full

Amir Goldstein <amir73il@xxxxxxxxx> · Sat, 15 Jan 2022 13:30:54 +0200

On Sat, Jan 15, 2022 at 12:16 AM Khazhy Kumykov <khazhy@xxxxxxxxxx> wrote:
>
> On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi
> <krisman@xxxxxxxxxxxxx> wrote:
> >
> > Amir Goldstein <amir73il@xxxxxxxxx> writes:
> >
> > >> > But the question remains, what is so special about shmem that
> > >> > your use case requires fsnotify events to handle ENOSPC?
> > >> >
> > >> > Many systems are deployed on thin provisioned storage these days
> > >> > and monitoring the state of the storage to alert administrator before
> > >> > storage gets full (be it filesystem inodes or blocks or thinp space)
> > >> > is crucial to many systems.
> > >> >
> > >> > Since the ENOSPC event that you are proposing is asynchronous
> > >> > anyway, what is the problem with polling statfs() and meminfo?
> > >>
> > >> Amir,
> > >>
> > >> I spoke a bit with Khazhy (in CC) about the problems with polling the
> > >> existing APIs, like statfs.  He has been using a previous version of
> > >> this code in production to monitor machines for a while now.  Khazhy,
> > >> feel free to pitch in with more details.
> > >>
> > >> Firstly, I don't want to treat shmem as a special case.  The original
> > >> patch implemented support only for tmpfs, because it was a fs specific
> > >> solution, but I think this would be useful for any other (non-pseudo)
> > >> file system in the kernel.
> > >>
> > >> The use case is similar to the use case I brought up for FAN_FS_ERROR.
> > >> A sysadmin monitoring a fleet of machines wants to be notified when a
> > >> service failed because of lack of space, without having to trust the
> > >> failed application to properly report the error.
> > >>
> > >> Polling statfs is prone to missing the ENOSPC occurrence if the error is
> > >> ephemeral from a monitoring tool point of view. Say the application is
> > >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes
> > >> the partial file.  If that happens, a daemon might miss the chance to
> > >> observe the lack of space in statfs.  Doing it through fsnotify, on the
> > >> other hand, always catches the condition and allows a monitoring
> > >> tool/sysadmin to take corrective action.
> > >>
> > >> > I guess one difference is that it is harder to predict page allocation failure
> > >> > that causes ENOSPC in shmem, but IIUC, your patch does not report
> > >> > an fsevent in that case only in inode/block accounting error.
> > >> > Or maybe I did not understand it correctly?
> > >>
> > >> Correct.  But we cannot predict the enospc, unless we know the
> > >> application.  I'm looking for a way for a sysadmin to not have to rely
> > >> on the application caring about the file system size.
> > >>
> > >
> > > In the real world, ENOSPC can often be anticipated way ahead of time
> > > and sysadmins are practically required to take action when storage space is low.
> > > Getting near 90% full filesystem is not healthy on many traditional disk
> > > filesystems and causes suboptimal performance and in many cases
> > > (especially cow filesystems) may lead to filesystem corruption.
> > >
> > > All that said, yes, *sometimes* ENOSPC cannot be anticipated,
> > > but EIO can never be anticipated, so why are we talking about ENOSPC?
> > > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify
> > > monitoring for filesystem ephemeral errors.
> > >
> > > The problem with fsnotify events for ephemeral filesystem errors
> > > and that there can be a *lot* of them compared to filesystem corruption
> > > errors that would usually put the filesystem in an "emergency" state
> > > and stop the events from flooding.
> > > For that reason I still think that a polling API for fs ephemeral errors
> > > is a better idea.
> > >
> > > w.r.t to ephemeral errors on writeback we already have syncfs() as
> > > a means to provide publish/subscribe API for monitoring applications,
> > > to check if there was any error since last check, but we do not have an
> > > API that provides this information without the added costs of performing
> > > syncfs().
> > >
> > > IMO, a proper solution would look something like this:
> > >
> > >          /* per-sb errseq_t for reporting writeback errors via syncfs */
> > >          errseq_t s_wb_err;
> > > +        /* per-sb errseq_t for reporting vfs errors via fstatfs */
> > > +        errseq_t s_vfs_err;
> > >
> >
> > I think making it a polling API wouldn't be a problem for our use case,
> > as long as it is kept as an always increasing counter, we should be able
> > to detect changes and not miss events.
> >
> > The problem with the proposal, in my opinion, is the lack of
> > differentiation of the errors.  We want to be able to tell apart an EIO
> > from a ENOSPC, and it might not be clear from the other fields in
> > fstatfs what has happened.
>
> These tmpfs are associated with a containerized task (and often aren't
> very large), so it isn't possible to predict a full filesystem.
>
> For our use case (and what makes this "shmem specific") is we want to
> differentiate between a user getting ENOSPC due to insufficiently
> provisioned fs size, vs. due to running out of memory in a container -
> both of which return ENOSPC to the process. Mixing EIO into the same
> signal ("there were errors, ever") hides this information. So what we
> were looking for was some sort of way of signaling that user saw
> enospc, and a way for tmpfs to signal "why". Our first approach was
> just to keep a counter of how many times this situation (ENOSPC due to
> max size) occurred, but this seemed too niche an interface. Using the
> fanotify interface seems like a good candidate, and has the additional
> (potential) benefit that a similar notification can be used for any
> error on any type of fs, though as you mentioned the volume of
> potential errors is much higher, so perhaps sticking to polling is the
> way to go.
>
> To my understanding, errseq_t stores a counter for 1 type of error,
> and if we see multiple types of errors, we'll overwrite the errno (so,
> EIO followed by ENOSPC, or vice versa, would result in missing info)
>
> >
> > Also, I suspect Google might care about what inode triggered the error.
> > If I understand correctly their use case, that would allow them to trace
> > back the origin of the issue.  Either way, wouldn't it be useful for
> > applications in general to be able to know what specific writeback failed?
>
> For our usage, just knowing that an error occurred should be good
> enough, and if that simplifies things let's omit the extra
> functionality.

Yes, it simplifies things a lot.
The use case sounds very shmem specific and unless there are other
users that need fsnotify of ENOSPC events with inode information
I would stay away from this.

I suggest that you export a counter of shmem ENOSPC errors of
both kinds (page allocation and block accounting) via procfs/sysfs
and poll for the counter that you care about.

If you want to follow precedents, there is /sys/fs/ext4/sda3/errors_count
and friends.

Thanks,
Amir.