On 6/11/24 10:20, Miklos Szeredi wrote: > On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@xxxxxxx> wrote: >> >> From: Bernd Schubert <bschubert@xxxxxxx> >> >> This adds support for uring communication between kernel and >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >> appraoch was taken from ublk. The patches are in RFC state, >> some major changes are still to be expected. > > Thank you very much for tackling this. I think this is an important > feature and one that could potentially have a significant effect on > fuse performance, which is something many people would love to see. > > I'm thinking about the architecture and there are some questions: > > Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV? That's > would just be the async part, without the mapped buffer. I suspect > most of the performance advantage comes from this and the per-CPU > queue, not from the mapped buffer, yet most of the complexity seems to > be related to the mapped buffer. I didn't try because IORING_OP_URING_CMD seems to be exactly made for our case. Firstly and although I didn't have time to look into it yet, but with the current approach it should be rather simple to switch to another ring as Kent has suggested. Secondly, with IORING_OP_URING_CMD we already have only a single command to submit requests and fetch the next one - half of the system calls. Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach? https://github.com/uroni/fuseuring? I.e. it hook into the existing fuse and just changes from read()/write() of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero control which ring/queue and which ring-entry handles the request. Thirdly, initially I had even removed the allocation of 'struct fuse_req' and directly allocated these on available ring entries. I.e. the application thread was writing to the mmap ring buffer. I just removed that code for now, as it introduced additional complexity with an unknown performance outcome. If it should be helpful we could add that later. I don't think we have that flexibility with IORING_OP_READV/IORING_OP_WRITEV. And then, personally I do not see mmap complexity - it is just very convenient to write/read to/from the ring buffer from any kernel thread. Currently not supported by io-uring, but I think we could even avoid any kind of system call and let the application poll for results. Similar to what IORING_SETUP_SQPOLL does, but without the requirement of another kernel thread. Most complexity and issues I got, come from the requirement of io_uring to complete requests with io_uring_op_cmd_done. In RFCv1 you had annotated the async shutdown method - that was indeed really painful and resulted in a never ending number of shutdown races. Once I removed that and also removed using a bitmap (I don't even know why I used a bitmap in the first place in RFCv1 and not lists as in RFCv2) shutdown became manageable. If there would be way to tell io-uring that kernel fuse is done and to let it complete itself whatever was not completed yet, it would would be great help. Although from my point of view, that could be done once this series is merged. Or we decide to switch to Kents new ring buffer and might not have that problem at all... > > Maybe there's an advantage in using an atomic op for WRITEV + READV, > but I'm not quite seeing it yet, since there's no syscall overhead for > separate ops. Here I get confused, could please explain? Current fuse has a read + write operation - a read() system call to process a fuse request and a write() call to submit the result and then read() to fetch the next request. If userspace has to send IORING_OP_READV to fetch a request and complete with IORING_OP_IORING_OP_WRITEV it would go through existing code path with operations? Well, maybe userspace could submit with IOSQE_IO_LINK, but that sounds like it would need to send two ring entries? I.e. memory and processing overhead? And then, no way to further optimize and do fuse_req allocation on the ring (if there are free entries). And probably also no way that we ever let the application work in the SQPOLL way, because the application thread does not have the right to read from the fuse-server buffer? I.e. what I mean is that IORING_OP_URING_CMD gives a better flexibility. Btw, another issue that is avoided with the new ring-request layout is compatibility and alignment. The fuse header is always in a 4K section of the request data follow then. I.e. extending request sizes does not impose compatibility issues any more and also for passthrough and similar - O_DIRECT can be passed through to the backend file system. Although these issues probably need to be solved into the current fuse protocol. > > What's the reason for separate async and sync request queues? To have credits for IO operations. For an overlay file system it might not matter, because it might get stuck with another system call in the underlying file system. But for something that is not system call bound and that has control, async and sync can be handled with priority given by userspace. As I had written in the introduction mail, I'm currently working on different IO sizes per ring queue - it gets even more fine grained and with the advantage of reduced memory usage per queue when the queue has entries for many small requests and a few large ones. Next step would here to add credits for reads/writes (or to reserve credits for meta operations) in the sync queue, so that meta operations can always go through. If there should be async meta operations (through application io-uring requests?) would need to be done for the async queue as well. Last but not least, with separation there is no global async queue anymore - no global lock and cache issues. > >> Avoiding cache line bouncing / numa systems was discussed >> between Amir and Miklos before and Miklos had posted >> part of the private discussion here >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/ >> >> This cache line bouncing should be addressed by these patches >> as well. > > Why do you think this is solved? I _guess_ that writing to the mmaped buffer and processing requests on the same cpu core should make it possible to keep things in cpu cache. I did not verify that with perf, though. > >> I had also noticed waitq wake-up latencies in fuse before >> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@xxxxxxxxxxx/T/ >> >> This spinning approach helped with performance (>40% improvement >> for file creates), but due to random server side thread/core utilization >> spinning cannot be well controlled in /dev/fuse mode. >> With fuse-over-io-uring requests are handled on the same core >> (sync requests) or on core+1 (large async requests) and performance >> improvements are achieved without spinning. > > I feel this should be a scheduler decision, but the selecting the > queue needs to be based on that decision. Maybe the scheduler people > can help out with this. For sync requests getting the scheduler involved is what is responsible for making really fuse slow. It schedules on random cores, that are in sleep states and additionally frequency scaling does not go up. We really need to stay on the same core here, as that is submitting the result, the core is already running (i.e. not sleeping) and has data in its cache. All benchmark results with sync requests point that out. For async requests, the above still applies, but basically one thread is writing/reading and the other thread handles/provides the data. Random switching of cores is then still not good. At best and to be tested, submitting rather large chunks to other cores. What is indeed to be discussed (and think annotated in the corresponding patch description), if there is a way to figure out if the other core is already busy. But then the scheduler does not know what it is busy with - are these existing fuse requests or something else. That part is really hard and I don't think it makes sense to discuss this right now before the main part is merged. IMHO, better to add a config flag for the current cpu+1 scheduling with an annotation that this setting might go away in the future. Thanks, Bernd