Hi Mike, >>>> On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote: >>>>> On 6/7/22 10:55 AM, Mike Christie wrote: >>>>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote: >>>>>>> In function iscsi_data_xmit (TX worker) there is walking through the >>>>>>> queue of new SCSI commands that is replenished in parallell. And only >>>>>>> after that queue got emptied the function will start sending pending >>>>>>> DataOut PDUs. That lead to DataOut timer time out on target side and >>>>>>> to connection reinstatment. >>>>>>> >>>>>>> This patch swaps walking through the new commands queue and the pending >>>>>>> DataOut queue. To make a preference to ongoing commands over new ones. >>>>>>> >>>>>> >>>>>> ... >>>>>> >>>>>>> task = list_entry(conn->cmdqueue.next, struct iscsi_task, >>>>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn) >>>>>>> */ >>>>>>> if (!list_empty(&conn->mgmtqueue)) >>>>>>> goto check_mgmt; >>>>>>> + if (!list_empty(&conn->requeue)) >>>>>>> + goto check_requeue; >>>>>> >>>>>> >>>>>> >>>>>> Hey, I've been posting a similar patch: >>>>>> >>>>>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg156939.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-BiuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$ >>>>>> >>>>>> A problem I hit is a possible pref regression so I tried to allow >>>>>> us to start up a burst of cmds in parallel. It's pretty simple where >>>>>> we allow up to a queue's worth of cmds to start. It doesn't try to >>>>>> check that all cmds are from the same queue or anything fancy to try >>>>>> and keep the code simple. Mostly just assuming users might try to bunch >>>>>> cmds together during submission or they might hit the queue plugging >>>>>> code. >>>>>> >>>>>> What do you think? >>>>> >>>>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun. >>>>> 0 would check after every transmission like above. >>>> >>>> Did you really face with a perf regression? I could not imagine how it is >>>> possible. >>>> DataOut PDU contains a data too, so a throughput performance cannot be >>>> decreased by sending DataOut PDUs. >>> >>> >>> We can agree that queue plugging and batching improves throughput right? >>> The app or block layer may try to batch commands. It could be with something >>> like fio's batch args or you hit the block layer queue plugging. >> >> I agree that those features 100% gives an improvement of a throughput on local >> devices on serial interfaces like SATA1. Since SATA2 (Native Command Queuing) >> devices can reorder incoming commmands to provide the best thoughput. >> SCSI I guess has similar feature from the very beginning. >> But on remote devices (iSCSI and FC) it is not 100% - initiators's command >> order may be reordered by the network protocol nature itself. I mean 1PDU vs >> R2T+DataOut PDUs, PDU resending due to crc errors or something like that. >I think we are talking about slightly different things. You are coming up with >different possible scenarios where it doesn't work. I get them. You are correct >for those cases. I'm not talking about those cases. I'm talking about the specific >case I described. > >I'm saying we have targets where we use backends that get improved performance >when they get batched cmds. When the network is ok, and the user's app is >batching cmds, they come from the app down to the target's backend device as >a batch. My concern is that with your patch we will no longer get that behavior. > >The reason is that the target and initiator can do: > >1. initiator sends scsi cmd pdu1 >2. target sends R2T >3. initiator sees R2T and hits the goto. Sends data >4. target reads in data. Sends cmd to block layer. >5. initiator sends scsi cmd pdu2 >6. target sends R2T >7. initiator reads in R2T sends data. >8. target reads in data and sends cmd to block layer. > >The problem here could be between 4 and 8 the block layer has run the queue >and sent that one cmd to the real device already because we have that extra >delay now. So when I implemented the fix in the same way as you and I run >iostat I would see lower aqu-sz for example. > >With the current code we might not have that extra delay between 4 - 8 so >I would see: > >1. initiator sends scsi cmd pdu1 >2. initiator sends scsi cmd pdu2 >3. initiator sends scsi cmd pdu3 >4. target reads in all those cmd pdus >5. target sends R2Ts for each cmd. >6. initiator reads in all the R2Ts >7. initiator sends data for cmd 1 - 3. >8. target reads in data and sends cmd1 to block layer >9. target reads in data and sends cmd2 to block layer >10. target reads in data and sends cmd3 to block layer >11. block layer/iosched hits unplug limit and then sends >those cmds to the real device. > >In this case the time between 8 - 9 is small since we are just reading >from the socket, building structs and passing the cmds down to the block >layer. Here when I'm watching the backend device with iostat I see higher >qu-sz values. > >Note that you might not be seeing this with LIO for example because we >plug/unplug on every cmd. > >When the network is congested, the lun is shared, we are doing retries, >machine gets bogged donw, etc then yeah, we might not get the above behavior. >The user knows this is and it's expected. It's not expected that they just >update the kernel and they get a perf drop in the normal case. > >How do we not give these users a regression while fixing the bug? Now I've got your concern. Due to change of pattern of DataOuts in the some specific case there will be a throughput regression. I agree that this should be addressed in my patch. >> >>> With the current code we can end up sending all cmds to the target in a way >>> the target can send them to the real device batched. For example, we send off >>> the initial N scsi command PDUs in one run of iscsi_data_xmit. The target reads >>> them in, and sends off N R2Ts. We are able to read N R2Ts in the same call. >>> And again we are able to send the needed data for them in one call of >>> iscsi_data_xmit. The target is able to read in the data and send off the >>> WRITEs to the physical device in a batch. >>> >>> With your patch, we can end up not batching them like the app/block layer >>> intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop. >>> We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to >>> handle. So we goto check_requeue. We send the needed data. The target then >>> starts to send the cmd to the physical device. If we have read in multiple >>> R2Ts then we will continue the requeue loop. And so we might be able to send >>> the data fast enough that the target can then send those commands to the >>> physical device. But we've now broken up the batching the upper layers sent >>> to us and we were doing before. >> >> In my head everything is vice-versa :) > >:) > >> Current code breaks a batching by sending new commands instead of completing >> the transmission of current commands that could be in its own batch. > >We are not talking about the same thing. I'm talking about specifically what >a user would do today to boost perf for their app where they have a single app >using the LUN. Single LUN per session. App uses something like fio's batch >args. You have it send down 32 cmds. When those complete, it sends down 32 again. >We are not getting a constant stream or mix of cmds from different sources. > >We are not talking about multiple users using the same LUN or even same session. > > >> With my patch the batches will be received (in best effort manner) on target >> port in the same order as block layer sends to iSCSI layer. BTW, another target >> ports may receive other commands in meantime and the real device will receive >> broken batch anyway. The initiator side cannot guarantee batching on the real >> device. > >The user does this. The user has set things up so they are using the device for >their one app. The app is sending down these commands as a batch. No other >initiators are using the device. > >It's *not* say a shared LUN example like we have N VMs on a LUN all doing their >own IO. The user does not expect the same perf for that type of case we they do >when they have specifically tuned their app and device use like I mention. > >> >> Lets imagine, that block layer submits two batches of big commands and then >> submits single small commands, the command queue will be > >That's not what I'm talking about. We are not getting a mix like this. > > >> "ABCD EFGH" + "S" + "S" + "S" + "S" >> and that what will be sent out: >> current code (worst case): A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch) >> current code (best case): A1B1C1D1 E1F1G1H1 SSSS A2B2C2D2 E2F2G2H2 (reorder) >> current code (bug addressed): A1B1C1D1 E1F1G1H1 SS...S (connection fail) >> current code (!impossible): A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder) >> with my patch (best case): A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder) >> with my patch (your case): A1B1C1D1 E1 A2 F1 B2 G1H1 C2 E2 D2 F2G2H2 SSSS (still inorder) >> with my patch (worst case): A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch) >> >> My better best case (command order as block layer submits) will never happen in >> the current code. >> If "S" comes in parrallel, the worst cases are the same, both may break a batch >> by new commands. But if "S" are already in the queue, my patch will produce >> (most likely) the best case instead of the worst case. >> >> >>>> The only thing is a latency performance. But that is not an easy question. >>> >>> Agree latency is important and that's why I was saying we can make it config >>> option. Users can continue to try and batch their cmds and we don't break >>> them. We also fix the bug in that we don't get stuck in the cmdqueue loop >>> always taking in new cmds. >> >>>> IMHO, a system should strive to reduce a maximum value of the latency almost >>>> without impacting of a minimum value (prefer current commands) instead of >>>> to reduce a minimum value of the latency to the detriment of maximum value >>>> (prefer new commands). >>>> >>>> Any preference of new commands over current ones looks like an io scheduler >>> >>> I can see your point of view where you see it as preferring new cmds >>> vs existing. It's probably due to my patch not hooking into commit_rqs >>> and trying to figure out the batching exactly. It's more of a simple >>> estimate. >>> >>> However, that's not what I'm talking about. I'm talking about the block >>> layer / iosched has sent us these commands as a batch. We are now more >>> likely to break that up. >> >> No, my patch does not break a batching more than current code, instead it >> decreases the probability of a break the currently sending batch by trying to >> transmit in the same order as they come from block layer. >> >> It's very complicated question - how to schedule "new vs old" commands. It's >> not just a counter. Even with a counter there are a lot of questions - should >> it begin from the cmdqueue loop or from iscsi_data_xmit, why it has static >> limit, per LUN or not, and so on and so on. > >What about a compromise? Instead of doing while (!list_empty). We switch it >to list_for_each and move the requeue handling to infront of the cmdqueue >handling. That will not help because libiscsi is not adapted for batching, it calls iscsi_data_xmit for each queued command. And the few first commands of a batch will be the only command in the cmdqueue list and be followed by its DatOut just because of DataOut will be sent first in next iscsi_data_xmit call. Probably you saw a regression exactly because of that. >We will know that the cmds on the cmdqueue at the time iscsi_data_xmit is run >were sent to us from the block layer at the same time so I get what I need. The >bug is fixed. I think the better solution will be to support commit_rqs callback and call iscsi_data_xmit there. Then cmdqueue list will have all commands from a batch. Then I should add some check whether the current command is from the same batch as the previous to make a decision to go to sending DataOuts. Sounds good? BR, Dmitry.