On Thu, Mar 23, 2023 at 1:18 PM Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote: > > > > On 3/21/23 10:35, Amir Goldstein wrote: > > On Tue, Mar 21, 2023 at 3:11 AM Bernd Schubert <bschubert@xxxxxxx> wrote: > >> > >> This adds support for uring communication between kernel and > >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic > >> appraoch was taken from ublk. The patches are in RFC state - > >> I'm not sure about all decisions and some questions are marked > >> with XXX. > >> > >> Userspace side has to send IOCTL(s) to configure ring queue(s) > >> and it has the choice to configure exactly one ring or one > >> ring per core. If there are use case we can also consider > >> to allow a different number of rings - the ioctl configuration > >> option is rather generic (number of queues). > >> > >> Right now a queue lock is taken for any ring entry state change, > >> mostly to correctly handle unmount/daemon-stop. In fact, > >> correctly stopping the ring took most of the development > >> time - always new corner cases came up. > >> I had run dozens of xfstest cycles, > >> versions I had once seen a warning about the ring start_stop > >> mutex being the wrong state - probably another stop issue, > >> but I have not been able to track it down yet. > >> Regarding the queue lock - I still need to do profiling, but > >> my assumption is that it should not matter for the > >> one-ring-per-core configuration. For the single ring config > >> option lock contention might come up, but I see this > >> configuration mostly for development only. > >> Adding more complexity and protecting ring entries with > >> their own locks can be done later. > >> > >> Current code also keep the fuse request allocation, initially > >> I only had that for background requests when the ring queue > >> didn't have free entries anymore. The allocation is done > >> to reduce initial complexity, especially also for ring stop. > >> The allocation free mode can be added back later. > >> > >> Right now always the ring queue of the submitting core > >> is used, especially for page cached background requests > >> we might consider later to also enqueue on other core queues > >> (when these are not busy, of course). > >> > >> Splice/zero-copy is not supported yet, all requests go > >> through the shared memory queue entry buffer. I also > >> following splice and ublk/zc copy discussions, I will > >> look into these options in the next days/weeks. > >> To have that buffer allocated on the right numa node, > >> a vmalloc is done per ring queue and on the numa node > >> userspace daemon side asks for. > >> My assumption is that the mmap offset parameter will be > >> part of a debate and I'm curious what other think about > >> that appraoch. > >> > >> Benchmarking and tuning is on my agenda for the next > >> days. For now I only have xfstest results - most longer > >> running tests were running at about 2x, but somehow when > >> I cleaned up the patches for submission I lost that. > >> My development VM/kernel has all sanitizers enabled - > >> hard to profile what happened. Performance > >> results with profiling will be submitted in a few days. > > > > When posting those benchmarks and with future RFC posting, > > it's would be useful for people reading this introduction for the > > first time, to explicitly state the motivation of your work, which > > can only be inferred from the mention of "benchmarks". > > > > I think it would also be useful to link to prior work (ZUFS, fuse2) > > and mention the current FUSE performance issues related to > > context switches and cache line bouncing that was discussed in > > those threads. > > Oh yes sure, entirely forgot to mention the motivation. Will do in the > next patch round. You don't have these links by any chance? I know that I have this thread which you are on: https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@xxxxxxxxxxxxxx/ > there were several zufs threads, but I don't remember discussions about > cache line - maybe I had missed it. I can try to read through the old > threads, in case you don't have it. Miklos talked about it somewhere... > Our own motivation for ring basically comes from atomic-open benchmarks, > which gave totally confusing benchmark results in multi threaded libfuse > mode - less requests caused lower IOPs - switching to single threaded > then gave expect IOP increase. Part of it was due to a libfuse issue - > persistent thread destruction/creation due to min_idle_threads, but > another part can be explained with thread switching only. When I added > (slight) spinning in fuse_dev_do_read(), the hard part/impossible part > was to avoid letting multiple threads spin - even with a single threaded > application creating/deleting files (like bonnie++), multiple libfuse > threads started to spin for no good reason. So spinning resulted in a > much improved latency, but high cpu usage, because multiple threads were > spinning. I will add those explanations to the next patch set. > That would be great. Thanks, Amir.