These patches contain a re-write of the iser code. Re-sending after fixing NOP and bi-dir commands issues. A separate transport lld (named "iser") is defined. New header iser.h is created. File iscsi_rdma.c is replaced by iser.c File iser_text.c contains the iscsi-text processing code replicated from iscsid.c. This is done because the functions there are not general enough, and rely on specifics of iscsi/tcp structs. This file will hopefully be removed after the code unification. Additional portions of code are copied from iscsid.c and changed appropriately. These changes do not always represent code duplication, but rather simplify or optimize the flow for iSER. This code fixes an occasional data corruption that happens with the current version. There are some unhandled error cases and rare conditions, left until there is a solid common iscsi framework to rely upon. Most of them are marked with ToDo comments. The code is fairly RDMA-transport agnostic (ib/iwarp). The code implies RDMA-only mode of work. This means the "first burst" including immediate data should be disabled, so that the entire data transfer is performed using RDMA. This is achieved by setting: target->session_param[ISCSI_PARAM_INITIAL_R2T_EN].val = 1; target->session_param[ISCSI_PARAM_IMM_DATA_EN].val = 0; in iser_target_create(). This mode is perhaps the most suitable one for iser in the majority of work scenarios. The only concern (performanc-wise) may arise about relatively small WRITE ios, which enjoy theoretically lower latencies using IB SEND instead of RDMA-RD. Implementing such "Immediate Data" mode (and any other mode that implies non-empty FirstBurst) is meanwhile precluded because it would lead to multiple buffers per iSER task (e.g. ImmediateData buffer received with the command PDU, and the rest of the data retrieved using RDMA-RD). The code introduces some preparations for handling such general scenarios, but as tgt has no framework for multi-buffer commands, these extra code segments are either commented out or conditioned upon events that should never take place. All such places have a ToDo comment over them. Apart of the known issues, there are feature regressions to be addressed in the next patches. Most notably AHS and bi-dir handling. This patch-set differs from the previously submitted RFC patches: + converted all functions in iser_text.c to receive iscsi_conn * and iser_pdu * for requets and response instead of two tasks. This may serve as an example for changes in iscsi text functions (in iscsid.c), where iscsi_pdu should be passed instead. iser_pdu differs from iscsi_pdu by having a pointer to BHS (received with the entire data, while iscsi_pdu contains the BHS struct, which is filled by a socket read operation), plus additional iser-specific fields. + fixed chap invocation, this is a temporary shortcut, done by allocating an iscsi_conn structure, copying the conn header common to iser and iscsi, then faking iscsi-tcp buffer-related fields, then caling auth function, then copying back the resulting size and also the entire iscsi_conn header (because its state fields have been changed). + bidirectional commands support: buffer allocation mechanism was fitted to handle both directions in a single command; AHS processing has been added. + fixed various issues related to connection state and login state. For example when the conn. state is inappropriate, ignore received data, stop posting rx buffers etc. In order to be able to discriminate between some fine-grained states, necessary for iser, three new states: STATE_INIT, STATE_START and STATE_READY are added. + upon tx completion check that it is full feature and then repost or if it is still login, then a state swapping function (which previously was lost in porting from tcp) is called: iser_login_tx_complete() (equivalent to cmnd_finish() in iscsid.c) + some fields in iser-related structures renamed for easier discrimination from tcp and automatic find/replace. + fixed a bug in disconnection flow with live i/o: Because READ i/o initiates RDMA-WR and then SEND, chained one to another, when the connection is broken, all tasks should be freed but only when both of the above IB ops get flushed. + fixed logout req. processing: send all pending tx pdus, logout rsp. being the last one, then set EXIT state. + fixed a bug related to possible reverse order between the conn-established event and the first data pdu (login). This "race" is an IB "feature", to be resolved in an application-specific way. The implemented solution is updating conn state and saving the login task if necessary to delay login-rx execution. + Merged flush and error handling + fixed QP async events handling: now processing all possible events and reacting to all relevant events, print out their info neatly (dev name, event type etc. using the rdma layer functions) + Because of a bug in the initiator InitiatorRDSL is sent 256KB irrespective of the iscsi.MaxRDSL setting in the connection config file. This caused allocation of 256K for tx buffer, one per task causing ~100MB extra allocation per connection. Now all tasks have one pdu (not req & rsp), and the pdu accepted from rx is reused when sending responses. Actually we only send sense data which is quite small, so there is no need to allocate huge tx buffers. In login phase 2 tasks are used, one for request, another for response. Both tasks contain a PDU with 8K data buffer. In full featured phase a special text_tx task is allocated with the pdu data according to the negotiated value. It is not not posted and used as response when a text pdu is received. + fixed the iser_sched_consume_cq() path, previously it sometimes (very rarely) failed to consume all completions until the CQ was empty. + fixed a bug: tasks were not marked with "in_scsi" flag before being submitted to the scsi target. When their connection is broken, they should not be "completed" or freed against the scsi target. + added sending NOP-IN. While not mandatory in iscsi, it is quite important in iSER (specifically in IB case). When a connection is broken, IB does not automatically generate error events if there is no operation to fail. Thus it is quite possible to have stale connections which will never clear, if the target has nothing to send (obviously, nothing will be received from now on). Sending NOP-IN periodically guarantees that such connections will be removed, as the send op will fail with an unrecoverable error. NOPs are sent using the timer-based work scheduler. Sending NOP-IN is enabled by default. To disable, invoke tgtd with: "--iser nop=off". + fixed conn. refcnt accounting for IB errors. If post_receive or post_send fail, we should traverse the request list and discriminate between successful and failed operations. Because the list is single-linked, the number of posted requests is passed to the corresponding functions. + added iser_conn_getn() that accepts an increment parameter, to save calls and spare debug messages when posting multiple ib requests, which increment refcnt by a value greater than 1 + added storing of initiator and target hostnames/ips in iser conn + eliminated all exit(1) calls in iscsi/iser_ib.c + enclosed conditions with likely()/unlikely() to minimize rare events overhead in data-path + targets are marked with "rdma" flag. When a TCP is being established to a RDMA target, it is being declined, and vice versa. -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html