On Fri, 2008-07-11 at 20:28 -0700, Nicholas A. Bellinger wrote: > On Fri, 2008-07-11 at 22:41 +0400, Vladislav Bolkhovitin wrote: > > Nicholas A. Bellinger wrote: > > >>>> And this is a real showstopper for making LIO-Core > > >>>> the default and the only SCSI target framework. SCST is SCSI-centric, > > >>> Well, one needs to understand that LIO-Core subsystem API is more than a > > >>> SCSI target framework. Its a generic method of accessing any possible > > >>> storage object of the storage stack, and having said engine handle the > > >>> hardware restrictions (be they physical or virtual) for the underlying > > >>> storage object. It can run as a SCSI engine to real (or emualted) SCSI > > >>> hardware from linux/drivers/scsi, but the real strength is that it sits > > >>> above the SCSI/BLOCK/FILE layers and uses a single codepath for all > > >>> underlying storage objects. For example in the lio-core-2.6.git tree, I > > >>> chose the location linux/drivers/lio-core, because LIO-Core uses 'struct > > >>> file' from fs/, 'struct block_device' from block/ and struct scsi_device > > >>> from drivers/scsi. > > >> SCST and iSCSI-SCST, basically, do the same things, except iSCSI MC/S > > >> and related, + something more, like 1-to-many pass-through and > > >> scst_user, which need a big chunks of code, correct? And they are > > >> together about 2 times smaller: > > > > > > Yes, something much more. A complete implementation of traditional > > > iSCSI/TCP (known as RFC-3720), iSCSI/SCTP (which will be important in > > > the future), and IPv6 (also important) is a significant amount of logic. > > > When I say a 'complete implementation' I mean: > > > > > > I) Active-Active connection layer recovery (known as > > > ErrorRecoveryLevel=2). (We are going to use the same code for iSER for > > > inter-nexus OS independent (eg: below the SCSI Initiator level) > > > recovery. Again, the important part here is that recovery and > > > outstanding task migration happens transparently to the host OS SCSI > > > subsystem. This means (at least with iSCSI and iSER): not having to > > > register multiple LUNs and depend (at least completely) on SCSI WWN > > > information, and OS dependent SCSI level multipath. > > > > > > II) MC/S for multiplexing (same as I), as well as being able to > > > multiplex across multiple cards and subnets (using TCP, SCTP has > > > multi-homing). Also being able to bring iSCSI connections up/down on > > > the fly, until we all have iSCSI/SCTP, is very important too. > > > > > > III) Every possible combination of RFC-3720 defined parameter keys (and > > > provide the apparatis to prove it). And yes, anyone can do this today > > > against their own Target. I created core-iscsi-dv specifically for > > > testing LIO-Target <-> LIO-Core back in 2005. Core-iSCSI-DV is the > > > _ONLY_ _PUBLIC_ RFC-3720 domain validation tool that will actually > > > demonstrate, using ANY data integrity tool complete domain validation of > > > user defined keys. Please have a look at: > > > > > > http://linux-iscsi.org/index.php/Core-iscsi-dv > > > > > > http://www.linux-iscsi.org/files/core-iscsi-dv/README > > > > > > Any traditional iSCSI target mode implementation + Storage Engine + > > > Subsystem Plugin that thinks its ready to go into the kernel will have > > > to pass at LEAST the 8k test loop interations, the simplest being: > > > > > > HeaderDigest, DataDigest, MaxRecvDataSegmentLength (512 -> 262144, in > > > 512 byte increments) > > > > > > Core-iSCSI-DV is also a great indication of stability and data integrity > > > of hardware/software of an iSCSI Target + Engine, espically when you > > > have multiple core-iscsi-dv nodes hitting multiple VHACS clouds on > > > physical machines within the cluster. I have never run IET against > > > core-iscsi-dv personally, and I don't think Ming or Ross has either. > > Ming or Ross, would you like to make a comment on this, considering > after it, it is your work..? hot water here ;) i never run that test on iet, probably nobody. if someone actually ran the test and find the failed case, i believe there are people who want to fix it. why not both of you write/reuse some test scripts to test a most advanced/fast target and let the number to talk? > > > So > > > until SOMEONE actually does this first, I think that iSCSI-SCST is more > > > of an experiment for your our devel that a strong contender for > > > Linux/iSCSI Target Mode. > > > > There are big doubts among storage experts if features I and II are > > needed at all, see, e.g. http://lkml.org/lkml/2008/2/5/331. > > Well, jgarzik is both a NETWORKING and STORAGE (he was a networking guy > first, mind you) expert! > > > I also tend > > to agree, that for block storage on practice MC/S is not needed or, at > > least, definitely doesn't worth the effort, because: > > > > Trying to agrue against MC/S (or against any other major part of > RFC-3720, including ERL=2) is saying that Linux/iSCSI should be BEHIND > what the greatest minds in the IETF have produced (and learned) from > iSCSI. Considering so many people are interested in seeing Linux/iSCSI > be best and most complete implementation possible, surely one would not > be foolish enough to try to debate that Linux should be BEHIND what > others have figured out, be it with RFCs or running code. > > Also, you should understand that MC/S is more than about just moving > data I/O across multiple TCP connections, its about being able to bring > those paths up/down on the fly without having to actually STOP/PAUSE > anything. Then you then add the ERL=2 pixie dust, which you should > understand, is the result of over a decade of work creating RFC-3720 > within the IETF IPS TWG. What you have is a fabric that does not > STOP/PAUSE from an OS INDEPENDENT LEVEL (below the OS dependent SCSI > subsystem layer) perspective, on every possible T/I node, big and small, > open or closed platform. Even as we move towards more logic in the > network layer (a la Stream Control Transmission Protocol), we will still > benefit from RFC-3720 as the years roll on. Quite a powerful thing.. > > > 1. It is useless for sync. untagged operation (regular reads in most > > cases over a single stream), when always there is only one command being > > executed at any time, because of the commands connection allegiance, > > which forbids transferring data for a command over multiple connections. > > > > This is a very Parallel SCSI centric way of looking at design of SAM. > Since SAM allows the transport fabric to enforce its own ordering rules > (it does offer some of its own SCSI level ones of course). Obviously > each fabric (PSCSI, FC, SAS, iSCSI) are very different from the bus > phase perspective. But, if you look back into the history of iSCSI, you > will see that an asymmetric design with seperate CONTROL/DATA TCP > connections was considered originally BEFORE the Command Sequence Number > (CmdSN) ordering algoritim was adopted that allows both SINGLE and > MULTIPLE TCP connections to move both CONTROL/DATA packets across a > iSCSI Nexus. > > Using MC/S with a modern iSCSI implementation to take advantage of lots > of cores and hardware threads is something that allows one to multiplex > across multiple vendor's NIC ports, with the least possible overhead, in > the OS INDEPENDENT manner. Keep in mind that you can do the allocation > and RX of WRITE data OOO, but the actual *EXECUTION* down via the > subsystem API (which is what LIO-Target <-> LIO-Core does, in a generic > way) MUST BE in the same over as the CDBs came from the iSCSI Initiator > port. This is the only requirement for iSCSI CmdSN order rules wrt the > SCSI Architecture Model. > > > 2. The only advantage it has over traditional OS multi-pathing is > > keeping commands execution order, but on practice at the moment there is > > no demand for this feature, because all OS'es I know don't rely on > > commands order to protect data integrity. They use other techniques, > > like queue draining. A good target should be able itself to scheduler > > coming commands for execution in the correct from performance POV order > > and not rely for that on the commands order as they came from initiators. > > > > Ok, you are completely missing the point of MC/S and ERL=2. Notice how > it works in both iSCSI *AND* iSER (even across DDP fabrics!). I > discussed the significant benefit of ERL=2 in numerious previous > threads. But they can all be neatly summerized in: > > http://linux-iscsi.org/builds/user/nab/Inter.vs.OuterNexus.Multiplexing.pdf > > Internexus Multiplexing is DESIGNED to work with OS dependent multipath > transparently, and as a matter of fact, it complements it quite well, in > a OSI (independent) method. Its completely up to the admin to determine > the benefit and configure the knobs. > > So, the bit: "We should not implement this important part of the RFC > just because I want some code in the kernel" is not going to get your > design very far. > > > From other side, devices bonding also preserves commands execution > > order, but doesn't suffer from the connection allegiance limitation of > > MC/S, so can boost performance ever for sync untagged operations. Plus, > > it's pretty simple, easy to use and doesn't need any additional code. I > > don't have the exact numbers of MC/S vs bonding performance comparison > > (mostly, because open-iscsi doesn't support MC/S, but very curious to > > see them), but have very strong suspicious that on modern OS'es, which > > do TCP frames reorder in zero-copy manner, there shouldn't be much > > performance difference between MC/S vs bonding in the maximum possible > > throughput, but bonding should outperform MC/S a lot in case of sync > > untagged operations. > > > > Simple case here for you to get your feet wet with MC/S. Try doing > bonding across 4x GB/sec ports on 2x socket 2x core x86_64 and compare > MC/S vs. OS dependent networking bonding and see what you find. There > about two iSCSI initiators for two OSes that implementing MC/S and > LIO-Target <-> LIO-Target. Anyone interested in the CPU overhead on > this setup between MC/S and Link Layer bonding across 2x 2x 1 Gb/sec > port chips on 4 core x86_64..? > > > Anyway, I think features I and II, if added, would increase iSCSI-SCST > > kernel side code not more than on 5K lines, because most of the code is > > already there, the most important part which missed is fixes of locking > > problems, which almost never add a lot of code. > > You can think whatever you want. Why don't you have a look at > lio-core-2.6.git and see how big they are for yourself. > > > Relating Core-iSCSI-DV, > > I'm sure iSCSI-SCST will pass it without problems among the required set > > of iSCSI features, although still there are some limitations, derived > > from IET, for instance, support for multu-PDU commands in discovery > > sessions, which isn't implemented. But for adding to iSCSI-SCST optional > > iSCSI features there should be good *practical* reasons, which at the > > moment don't exist. And unused features are bad features, because they > > overcomplicate the code and make its maintainance harder for no gain. > > > > Again, you can think whatever you want. But since you did not implement > the majority of the iSCSI-SCST code yourself, (or implement your own > iSCSI Initiator in parallel with your own iSCSI Target), I do not > believe you are in a position to say. Any IET devs want to comment on > this..? > > > So, current SCST+iSCSI-SCST 36K lines + 5K new lines = 41K lines, which > > still a lot less than LIO's 63K lines. I downloaded the cleanuped > > lio-core-2.6.git tree and: > > > > Blindly comparing lines of code with no context is usually dumb. But, > since that is what you seem to be stuck on, how about this: > > LIO 63k + > SCST (minus iSCSI) ??k + > iSER from STGT ??k == > > For the complete LIO-Core engine on fabrics, and which includes what > Rafiu from Openfiler has been so kind to call LIO-Target, "arguably the > most feature complete and mature implementation out there (on any > platform) " > > > $ find lio-core-2.6/drivers/lio-core -type f -name "*.[ch]"|xargs wc > > 57064 156617 1548344 total > > > > Still much bigger. > > > > > Obviously not. Also, what I was talking about there was the strength > > > and flexibility of the LIO-Core design (it even ran on the Playstation 2 > > > at one point, http://linux-iscsi.org/index.php/Playstation2/iSCSI, when > > > MIPS r5900 boots modern v2.6, then we will do it again with LIO :-) > > > > SCST and the target drivers have been successfully ran on PPC and > > Sparc64, so I don't see any reasons, why it can't be ran on Playstation > > 2 as well. > > > > Oh it can, can it..? Does your engine memory allocation algoritim > provide for a SINGLE method for allocating linked list scatterlists > containing page links of ANY (not just PAGE_SIZE) size handled > generically across both internal or preregistered memory allocation > acases, or coming from say, a software RNIC moving DDP packets for iSCSI > in a single code path..? > > And then it needs to be able to go down to the PS2-Linux PATA driver, > that does not show up under the SCSI subsystem mind you. Surely you > understand that because the MIPS r5900 is a non cache coherent > architecture that you simply cannot allocate out multiple page > contigious scatterlists for your I/Os, and simply expect it to work when > we are sending blocks down to the 32-bit MIPS r3000 IOP..? > > > >>>> - Pass-through mode (PSCSI) also provides non-enforced 1-to-1 > > >>>> relationship, as it used to be in STGT (now in STGT support for > > >>>> pass-through mode seems to be removed), which isn't mentioned anywhere. > > >>>> > > >>> Please be more specific by what you mean here. Also, note that because > > >>> PSCSI is an LIO-Core subsystem plugin, LIO-Core handles the limitations > > >>> of the storage object through the LIO-Core subsystem API. This means > > >>> that things like (received initiator CDB sectors > LIO-Core storage > > >>> object max_sectors) are handled generically by LIO-Core, using a single > > >>> set of algoritims for all I/O interaction with Linux storage systems. > > >>> These algoritims are also the same for DIFFERENT types of transport > > >>> fabrics, both those that expect LIO-Core to allocate memory, OR that > > >>> hardware will have preallocated memory and possible restrictions from > > >>> the CPU/BUS architecture (take non-cache coherent MIPS for example) of > > >>> how the memory gets DMA'ed or PIO'ed down to the packet's intended > > >>> storage object. > > >> See here: > > >> http://www.mail-archive.com/linux-scsi@xxxxxxxxxxxxxxx/msg06911.html > > >> > > > > > > <nod> > > > > > >>>> - There is some confusion in the code in the function and variable > > >>>> names between persistent and SAM-2 reservations. > > >>> Well, that would be because persistent reservations are not emulated > > >>> generally for all of the subsystem plugins just yet. Obviously with > > >>> LIO-Core/PSCSI if the underlying hardware supports it, it will work. > > >> What you did (passing reservation commands directly to devices and > > >> nothing more) will work only with a single initiator per device, where > > >> reservations in the majority of cases are not needed at all. > > > > > > I know, like I said, implementing Persistent Reservations for stuff > > > besides real SCSI hardware with LIO-Core/PSCSI is a TODO item. Note > > > that the VHACS cloud (see below) will need this for DRBD objects at some > > > point. > > > > The problem is that persistent reservations don't work for multiple > > initiators even for real SCSI hardware with LIO-Core/PSCSI and I clearly > > described why in the referenced e-mail. Nicholas, why don't you want to > > see it? > > > > Why don't you provide a reference in the code to where you think the > problem is, and/or problem case using Linux iSCSI Initiators VMs to > demonstrate the bug..? > > > >>>>> The more in fighting between the > > >>>>> leaders in our community, the less the community benefits. > > >>>> Sure. If my note hurts you, I can remove it. But you should also remove > > >>>> from your presentation and the summary paper those psychological > > >>>> arguments to not confuse people. > > >>>> > > >>> Its not about removing, it is about updating the page to better reflect > > >>> the bigger picture so folks coming to the sight can get the latest > > >>> information from last update. > > >> Your suggestions? > > >> > > > > > > I would consider helping with this at some point, but as you can see, I > > > am extremly busy ATM. I have looked at SCST quite a bit over the years, > > > but I am not the one making a public comparision page, at least not > > > yet. :-) So until then, at least explain how there are 3 projects on > > > your page, with the updated 10,000 ft overviews, and mabye even add some > > > links to LIO-Target and a bit about VHACS cloud. I would be willing to > > > include info about SCST into the Linux-iSCSI.org wiki. Also, please > > > feel free to open an account and start adding stuff about SCST yourself > > > to the site. > > > > > > For Linux-iSCSI.org and VHACS (which is really where everything is going > > > now), please have a look at: > > > > > > http://linux-iscsi.org/index.php/VHACS-VM > > > http://linux-iscsi.org/index.php/VHACS > > > > > > Btw, the VHACS and LIO-Core design will allow for other fabrics to be > > > used inside our cloud, and between other virtualized client setups who > > > speak the wire protocol presented by the server side of VHACS cloud. > > > > > > Many thanks for your most valuable of time, > > > > > New v0.8.15 VHACS-VM images online btw. Keep checking the site for more details. > > Many thanks for your most valuable of time, > > --nab > > -- Ming Zhang @#$%^ purging memory... (*!% http://blackmagic02881.wordpress.com/ http://www.linkedin.com/in/blackmagic02881 -------------------------------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html