Re: [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/21/22 13:18, Stefan Metzmacher wrote:

Hi Pavel,

If network sends anything it should return how many bytes
it queued for sending, otherwise there would be duplicated
packets / data on the other endpoint in userspace, and I
don't think any driver / lower layer would keep memory
after returning an error.

As I'm also working on a socket driver for smbdirect,
I already thought about how I could hook into
IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
a loop sending individual fragments, which have a reference,
but if I find a connection drop after the first one, I'd
return ECONNRESET or EPIPE in order to get faster recovery
instead of announcing a short write to the caller.

I doesn't sound right for me, but neither I know samba to
really have an opinion. In any case, I see how it may be
more robust if we always try to push a notification cqe.
Will you send a patch?

You mean the IORING_OP_SEND_ZC should always
issue a NOTIF cqe, one it passed the io_sendzc_prep stage?

Currently we add F_MORE iff cqe->res >= 0, but I want to
decouple them so users don't try to infer one from another.

In theory, it's already a subset of this, but it sounds
like a good idea to always emit notifications (when we can)
to stop users making assumptions about it. And should also
be cleaner.

If we would take my 5/5 we could also have a different
strategy to check decide if MORE/NOTIF is needed.
If notif->cqe.res is still 0 and io_notif_flush drops
the last reference we could go without MORE/NOTIF at all.
In all other cases we'd either set MORE/NOTIF at the end
of io_sendzc of in the fail hook.

I had a similar optimisation, i.e. when io_notif_flush() in
the submission path is dropping the last ref, but killed it
as it was completely useless, I haven't hit this path even
once even with UDP, not to mention TCP.

If I remember correctly I hit it all the time on loopback,
but I'd have to recheck.

Yeah, I meant real network in particular. I've seen it with
virtual interfaces, but for instance loopback is not interesting
as it doesn't support zerocopy in the first place. You was
probably testing with a modified skb_orphan_frags_rx().

In any case, I was looking on a bit different problem, but
it should look much cleaner using the same approach, see
branch [1], and patch [3] for sendzc in particular.

[1] https://github.com/isilence/linux.git partial-fail
[2] https://github.com/isilence/linux/tree/io_uring/partial-fail
[3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894

     const struct io_op_def *def = &io_op_defs[req->opcode];

     req_set_fail(req);
     io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
     if (def->fail)
         def->fail(req);
     io_req_complete_post(req);

Will loose req->cqe.flags, but the fail hook in general looks like a good idea.

I just don't like those sporadic changes all across core io_uring
code also adding some overhead.

And don't we care about the other failure cases where req->cqe.flags gets overwritten?

We don't usually carry them around ->issue handler boundaries,
e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);

IORING_CQE_F_BUFFER is a bit more trickier, but there is
special handling for this one and it wouldn't fit "set cflags
in advance" logic anyway.

iow, ->fail callback sounds good enough for now, we'll change
it later if needed.

The fail hook should re-add the MORE flag?

Never set CQE_SKIP for notifications and add MORE flag in the
fail hook, sounds good.


So I'll try to do the following changes:

1. take your ->fail() patches

2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes
    (But the documentation should still make it clear that
     userspace have to cope with just a single cqe (without MORE)
     for both successs and failure, so that we can improve things later)

I've sent a liburing man patch, should be good enough.

3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ?
    That would indicate to userspace that we don't give any information
    if zero copy was actually used or not. This would present someone
    from relying on cqe.res == 0 and we can improve it by providing
    more useful values in future.

I don't get the difference, someone just as easily may rely on
0xf..ff. What does it gives us? 0 usually suits better as default.

Are you ok with that plan for 6.0?

It's late 1-2 are better to be 6.1 + stable. It doesn't break uapi,
so it's fine. It's not even strictly necessary to back port it
(but still a good idea to do it).

--
Pavel Begunkov



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux