On 15/9/23 11:06, Amir Goldstein wrote:
On Fri, Sep 15, 2023 at 4:20 AM Ian Kent <raven@xxxxxxxxxx> wrote:
On 14/9/23 14:47, Amir Goldstein wrote:
On Wed, Sep 13, 2023 at 6:22 PM Miklos Szeredi <mszeredi@xxxxxxxxxx> wrote:
Implement the mount querying syscalls agreed on at LSF/MM 2023. This is an
RFC with just x86_64 syscalls.
Excepting notification this should allow full replacement for
parsing /proc/self/mountinfo.
Since you mentioned notifications, I will add that the plan discussed
in LFSMM was, once we have an API to query mount stats and children,
implement fanotify events for:
mount [mntuid] was un/mounted at [parent mntuid],[dirfid+name]
As with other fanotify events, the self mntuid and dirfid+name
information can be omitted and without it, multiple un/mount events
from the same parent mntuid will be merged, allowing userspace
to listmnt() periodically only mntuid whose child mounts have changed,
with little risk of event queue overflow.
The possible monitoring scopes would be the entire mount namespace
of the monitoring program or watching a single mount for change in
its children mounts. The latter is similar to inotify directory children watch,
where the watches needs to be set recursively, with all the weight on
userspace to avoid races.
It's been my belief that the existing notification mechanisms don't
quite fully satisfy the needs of users of these calls (aka. the need
I found when implementing David's original calls into systemd).
Specifically the ability to process a batch of notifications at once.
Admittedly the notifications mechanism that David originally implemented
didn't fully implement what I found I needed but it did provide for a
settable queue length and getting a batch of notifications at a time.
Am I mistaken in my belief?
I am not sure I understand the question.
fanotify has an event queue (16K events by default), but it can
also use unlimited size.
With a limited size queue, event queue overflow generates an
overflow event.
event listeners can read a batch of events, depending on
the size of the buffer that they provide.
when multiple events with same information are queued,
for example "something was un/mounted over parent mntuid 100"
fanotify will merged those all those events in the queue and the
event listeners will get only one such event in the batch.
Don't misunderstand me, it would be great for the existing notification
mechanisms to support these system calls, I just have a specific use case
in mind that I think is important, at least to me.
Please explain the use case and your belief about existing fanotify
limitations. I did not understand it.
Yes, it's not obvious, I'll try and explain it more clearly.
I did some work to enable systemd to use the original fsinfo() call
and the notifications system David had written.
My use case was perhaps unrealistic but I have seen real world reports
with similar symptoms and autofs usage can behave like this usage at
times as well so it's not entirely manufactured. The use case is basically
when there are a large number of mounts occurring for a sustained amount
of time.
Anyway, systemd processes get notified when there is mount activity and
it then reads the mount table to update it state. I observed there are
usually 3 separate systemd processes monitoring mount table changes and,
under the above load, they use around 80-85% of a CPU each.
Thing is systemd is actually pretty good at processing notifications so
when there is sustained mount activity and the fsinfo() call was used the
load changes from processing the table to processing notifications. The
load goes down to a bit over 40% for each process.
But if you can batch those notifications, like introduce a high water
mark (yes I know this is not at all simple and I'm by no means suggesting
this is all that needs to be done), to get a bunch of these notifications
at once the throughput increases quite a bit. In my initial testing adding
a delay of 10 or 20 milliseconds before fetching the queue of notifications
and processing them saw a reduction of CPU usage to around 8% per process.
What I'm saying is I've found that system calls to get the information
directly isn't all that's needed to improve the scalability.
Ian