On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard <samuel.soulard@xxxxxxxxx> wrote: > Hi to all, > > What if you're using an ISCSI gateway based on LIO and KRBD (that is, RBD > block device mounted on the ISCSI gateway and published through LIO). The > LIO target portal (virtual IP) would failover to another node. This would > theoretically provide support for PGRs since LIO does support SPC-3. > Granted it is not distributed and limited to 1 single node throughput, but > this would achieve high availability required by some environment. Yes, LIO technically supports PGR but it's not distributed to other nodes. If you have a pacemaker-initiated target failover to another node, the PGR state would be lost / missing after migration (unless I am missing something like a resource agent that attempts to preserve the PGRs). For initiator-initiated failover (e.g. a target is alive but the initiator cannot reach it), after it fails over to another port the PGR data won't be available. > Of course, multiple target portal would be awesome since available > throughput would be able to scale linearly, but since this isn't here right > now, this would provide at least an alternative. It would definitely be great to go active/active but there are concerns of data-corrupting edge conditions when using MPIO since it relies on client-side failure timers that are not coordinated with the target. For example, if an initiator writes to sector X down path A and there is delay to the path A target (i.e. the target and initiator timeout timers are not in-sync), and MPIO fails over to path B, quickly performs the write to sector X and performs second write to sector X, there is a possibility that eventually path A will unblock and overwrite the new value in sector 1 with the old value. The safe way to handle that would require setting the initiator-side IO timeouts to such high values as to cause higher-level subsystems to mark the MPIO path as failed should a failure actually occur. The iSCSI MCS protocol would address these concerns since in theory path B could discover that the retried IO was actually a retry, but alas it's not available in the Linux Open-iSCSI nor ESX iSCSI initiators. > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp <ddiss@xxxxxxx> wrote: >> >> Hi Jason, >> >> Thanks for the detailed write-up... >> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote: >> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López <jorpilo@xxxxxxxxx> >> > wrote: >> > >> > > As far as I am able to understand there are 2 ways of setting iscsi >> > > for >> > > ceph >> > > >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora... >> > > >> > >> > The target_core_rbd approach is only utilized by SUSE (and its >> > derivatives >> > like PetaSAN) as far as I know. This was the initial approach for Red >> > Hat-derived kernels as well until the upstream kernel maintainers >> > indicated >> > that they really do not want a specialized target backend for just krbd. >> > The next attempt was to re-use the existing target_core_iblock to >> > interface >> > with krbd via the kernel's block layer, but that hit similar upstream >> > walls >> > trying to get support for SCSI command passthrough to the block layer. >> > >> > >> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli) >> > > >> > >> > The TCMU approach is what upstream and Red Hat-derived kernels will >> > support >> > going forward. >> >> SUSE is also in the process of migrating to the upstream tcmu approach, >> for the reasons that you gave in (1). >> >> ... >> >> > The TCMU approach also does not currently support SCSI persistent >> > reservation groups (needed for Windows clustering) because that support >> > isn't available in the upstream kernel. The SUSE kernel has an approach >> > that utilizes two round-trips to the OSDs for each IO to simulate PGR >> > support. Earlier this summer I believe SUSE started to look into how to >> > get >> > generic PGR support merged into the upstream kernel using corosync/dlm >> > to >> > synchronize the states between multiple nodes in the target. I am not >> > sure >> > of the current state of that work, but it would benefit all LIO targets >> > when complete. >> >> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC, >> whether DLM or the underlying Ceph cluster gets used for PR state >> storage is still under consideration. >> >> Cheers, David >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com