On Fri, Jul 10, 2020 at 7:39 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > > On 7/10/20 7:10 AM, Christoph Hellwig wrote: > > On Fri, Jul 10, 2020 at 12:35:43AM +0530, Kanchan Joshi wrote: > >> Append required special treatment (conversion for sector to bytes) for io_uring. > >> And we were planning a user-space wrapper to abstract that. > >> > >> But good part (as it seems now) was: append result went along with cflags at > >> virtually no additional cost. And uring code changes became super clean/minimal > >> with further revisions. > >> While indirect-offset requires doing allocation/mgmt in application, > >> io-uring submission > >> and in completion path (which seems trickier), and those CQE flags > >> still get written > >> user-space and serve no purpose for append-write. > > > > I have to say that storing the results in the CQE generally make > > so much more sense. I wonder if we need a per-fd "large CGE" flag > > that adds two extra u64s to the CQE, and some ops just require this > > version. > > I have been pondering the same thing, we could make certain ops consume > two CQEs if it makes sense. It's a bit ugly on the app side with two > different CQEs for a request, though. We can't just treat it as a large > CQE, as they might not be sequential if we happen to wrap. But maybe > it's not too bad. Did some work on the two-cqe scheme for zone-append. First CQE is the same (as before), while second CQE does not keep res/flags and instead has 64bit result to report append-location. It would look like this - struct io_uring_cqe { __u64 user_data; /* sqe->data submission passed back */ - __s32 res; /* result code for this event */ - __u32 flags; + union { + struct { + __s32 res; /* result code for this event */ + __u32 flags; + }; + __u64 append_res; /*only used for append, in secondary cqe */ + }; And kernel will produce two CQEs for append completion- static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) { - struct io_uring_cqe *cqe; + struct io_uring_cqe *cqe, *cqe2 = NULL; - cqe = io_get_cqring(ctx); + if (unlikely(req->flags & REQ_F_ZONE_APPEND)) + /* obtain two CQEs for append. NULL if two CQEs are not available */ + cqe = io_get_two_cqring(ctx, &cqe2); + else + cqe = io_get_cqring(ctx); + if (likely(cqe)) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, res); WRITE_ONCE(cqe->flags, cflags); + /* update secondary cqe for zone-append */ + if (req->flags & REQ_F_ZONE_APPEND) { + WRITE_ONCE(cqe2->append_res, + (u64)req->append_offset << SECTOR_SHIFT); + WRITE_ONCE(cqe2->user_data, req->user_data); + } mutex_unlock(&ctx->uring_lock); This seems to go fine in Kernel. But the application will have few differences such as: - When it submits N appends, and decides to wait for all completions it needs to specify min_complete as 2*N (or at least 2N-1). Two appends will produce 4 completion events, and if application decides to wait for both it must specify 4 (or 3). io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig); - Completion-processing sequence for mixed-workload (few reads + few appends, on the same ring). Currently there is a one-to-one relationship. Application looks at N CQE entries, and treats each as distinct IO completion - a for loop does the work. With two-cqe scheme, extracting, from a bunch of completion, the ones for read (one cqe) and append (two cqe): flow gets somewhat non-linear. Perhaps this is not too bad, but felt that it must be put here upfront. -- Kanchan Joshi