On 20.12.2022 11:33, Stefano Garzarella wrote: > On Tue, Dec 20, 2022 at 07:14:27AM +0000, Arseniy Krasnov wrote: >> On 19.12.2022 18:41, Stefano Garzarella wrote: >> >> Hello! >> >>> Hi Arseniy, >>> >>> On Sat, Dec 17, 2022 at 8:42 PM Arseniy Krasnov <AVKrasnov@xxxxxxxxxxxxxx> wrote: >>>> >>>> Hello, >>>> >>>> seems I found strange thing(may be a bug) where sender('tx' later) and >>>> receiver('rx' later) could stuck forever. Potential fix is in the first >>>> patch, second patch contains reproducer, based on vsock test suite. >>>> Reproducer is simple: tx just sends data to rx by 'write() syscall, rx >>>> dequeues it using 'read()' syscall and uses 'poll()' for waiting. I run >>>> server in host and client in guest. >>>> >>>> rx side params: >>>> 1) SO_VM_SOCKETS_BUFFER_SIZE is 256Kb(e.g. default). >>>> 2) SO_RCVLOWAT is 128Kb. >>>> >>>> What happens in the reproducer step by step: >>>> >>> >>> I put the values of the variables involved to facilitate understanding: >>> >>> RX: buf_alloc = 256 KB; fwd_cnt = 0; last_fwd_cnt = 0; >>> free_space = buf_alloc - (fwd_cnt - last_fwd_cnt) = 256 KB >>> >>> The credit update is sent if >>> free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE [64 KB] >>> >>>> 1) tx tries to send 256Kb + 1 byte (in a single 'write()') >>>> 2) tx sends 256Kb, data reaches rx (rx_bytes == 256Kb) >>>> 3) tx waits for space in 'write()' to send last 1 byte >>>> 4) rx does poll(), (rx_bytes >= rcvlowat) 256Kb >= 128Kb, POLLIN is set >>>> 5) rx reads 64Kb, credit update is not sent due to * >>> >>> RX: buf_alloc = 256 KB; fwd_cnt = 64 KB; last_fwd_cnt = 0; >>> free_space = 192 KB >>> >>>> 6) rx does poll(), (rx_bytes >= rcvlowat) 192Kb >= 128Kb, POLLIN is set >>>> 7) rx reads 64Kb, credit update is not sent due to * >>> >>> RX: buf_alloc = 256 KB; fwd_cnt = 128 KB; last_fwd_cnt = 0; >>> free_space = 128 KB >>> >>>> 8) rx does poll(), (rx_bytes >= rcvlowat) 128Kb >= 128Kb, POLLIN is set >>>> 9) rx reads 64Kb, credit update is not sent due to * >>> >>> Right, (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) is still false. >>> >>> RX: buf_alloc = 256 KB; fwd_cnt = 196 KB; last_fwd_cnt = 0; >>> free_space = 64 KB >>> >>>> 10) rx does poll(), (rx_bytes < rcvlowat) 64Kb < 128Kb, rx waits in poll() >>> >>> I agree that the TX is stuck because we are not sending the credit >>> update, but also if RX sends the credit update at step 9, RX won't be >>> woken up at step 10, right? >> >> Yes, RX will sleep, but TX will wake up and as we inform TX how much >> free space we have, now there are two cases for TX: >> 1) send "small" rest of data(e.g. without blocking again), leave 'write()' >> and continue execution. RX still waits in 'poll()'. Later TX will >> send enough data to wake up RX. >> 2) send "big" rest of data - if rest is too big to leave 'write()' and TX >> will wait again for the free space - it will be able to send enough data >> to wake up RX as we compared 'rx_bytes' with rcvlowat value in RX. > > Right, so I'd update the test to behave like this. Sorry, You mean vsock_test? To cover TX waiting for free space at RX, thus checking this kernel patch logic? > And I'd explain better the problem we are going to fix in the commit message. Ok > >>> >>>> >>>> * is optimization in 'virtio_transport_stream_do_dequeue()' which >>>> sends OP_CREDIT_UPDATE only when we have not too much space - >>>> less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE. >>>> >>>> Now tx side waits for space inside write() and rx waits in poll() for >>>> 'rx_bytes' to reach SO_RCVLOWAT value. Both sides will wait forever. I >>>> think, possible fix is to send credit update not only when we have too >>>> small space, but also when number of bytes in receive queue is smaller >>>> than SO_RCVLOWAT thus not enough to wake up sleeping reader. I'm not >>>> sure about correctness of this idea, but anyway - I think that problem >>>> above exists. What do You think? >>> >>> I'm not sure, I have to think more about it, but if RX reads less than >>> SO_RCVLOWAT, I expect it's normal to get to a case of stuck. >>> >>> In this case we are only unstucking TX, but even if it sends that single >>> byte, RX is still stuck and not consuming it, so it was useless to wake >>> up TX if RX won't consume it anyway, right? >> >> 1) I think it is not useless, because we inform(not just wake up) TX that >> there is free space at RX side - as i mentioned above. >> 2) Anyway i think that this situation is a little bit strange: TX thinks that >> there is no free space at RX and waits for it, but there is free space at RX! >> At the same time, RX waits in poll() forever - it is ready to get new portion >> of data to return POLLIN, but TX "thinks" exactly opposite thing - RX is full >> of data. Of course, if there will be just stalls in TX data handling - it will >> be ok - just performance degradation, but TX stucks forever. > > We did it to avoid a lot of credit update messages. Yes, i see > Anyway I think here the main point is why RX is setting SO_RCVLOWAT to 128 KB and then reads only half of it? > > So I think if the users set SO_RCVLOWAT to a value and then RX reads less then it, is expected to get stuck. That a really interesting question, I've found nothing about this case in Google(not sure for 100%) or POSIX. But, i can modify reproducer: it sets SO_RCVLOWAT to 128Kb BEFORE entering its last poll where it will stuck. In this case behaviour looks more legal: it uses default SO_RCVLOWAT of 1, read 64Kb each time. Finally it sets SO_RCVLOWAT to 128Kb(and imagine that it prepares 128Kb 'read()' buffer) and enters poll() - we will get same effect: TX will wait for space, RX waits in 'poll()'. > > Anyway, since the change will not impact the default behaviour (SO_RCVLOWAT = 1) we can merge this patch, but IMHO we need to explain the case better and improve the test. I see, of course I'm not sure about this change, just want to ask someone who knows this code better > >> >>> >>> If RX woke up (e.g. SO_RCVLOWAT = 64KB) and read the remaining 64KB, >>> then it would still send the credit update even without this patch and >>> TX will send the 1 byte. >> >> But how RX will wake up in this case? E.g. it calls poll() without timeout, >> connection is established, RX ignores signal > > RX will wake up because SO_RCVLOWAT is 64KB and there are 64 KB in the buffer. Then RX will read it and send the credit update to TX because > free_space is 0. IIUC, i'm talking about 10 steps above, e.g. RX will never wake up, because TX is waiting for space. > > Thanks, > Stefano > Thanks, Arseniy