Re: Ceph-ISCSI

Samuel Soulard <samuel.soulard@xxxxxxxxx> · Wed, 11 Oct 2017 13:10:18 -0400

Hmmm, If you failover the identity of the LIO configuration including PGRs (I believe they are files on disk), this would work no?  Using an 2 ISCSI gateways which have shared storage to store the LIO configuration and PGR data.  

Also, you said another "fails over to another port", do you mean a port on another ISCSI gateway?  I believe LIO with multiple target portal IP on the same node for path redundancy works with PGRs.  

In my scenario, if my assumptions are correct, you would only have 1 ISCSI gateway available through 2 target portal IP (for data path redundancy).  If this first ISCSI gateway fails, both target portal IP failover to the standby node with the PGR data that is available on share stored.  

Sam 

On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard

<samuel.soulard@xxxxxxxxx> wrote:

> Hi to all,

>

> What if you're using an ISCSI gateway based on LIO and KRBD (that is, RBD

> block device mounted on the ISCSI gateway and published through LIO).  The

> LIO target portal (virtual IP) would failover to another node.  This would

> theoretically provide support for PGRs since LIO does support SPC-3.

> Granted it is not distributed and limited to 1 single node throughput, but

> this would achieve high availability required by some environment.

Yes, LIO technically supports PGR but it's not distributed to other

nodes. If you have a pacemaker-initiated target failover to another

node, the PGR state would be lost / missing after migration (unless I

am missing something like a resource agent that attempts to preserve

the PGRs). For initiator-initiated failover (e.g. a target is alive

but the initiator cannot reach it), after it fails over to another

port the PGR data won't be available.

> Of course, multiple target portal would be awesome since available

> throughput would be able to scale linearly, but since this isn't here right

> now, this would provide at least an alternative.

It would definitely be great to go active/active but there are

concerns of data-corrupting edge conditions when using MPIO since it

relies on client-side failure timers that are not coordinated with the

target.

For example, if an initiator writes to sector X down path A and there

is delay to the path A target (i.e. the target and initiator timeout

timers are not in-sync), and MPIO fails over to path B, quickly

performs the write to sector X and performs second write to sector X,

there is a possibility that eventually path A will unblock and

overwrite the new value in sector 1 with the old value. The safe way

to handle that would require setting the initiator-side IO timeouts to

such high values as to cause higher-level subsystems to mark the MPIO

path as failed should a failure actually occur.

The iSCSI MCS protocol would address these concerns since in theory

path B could discover that the retried IO was actually a retry, but

alas it's not available in the Linux Open-iSCSI nor ESX iSCSI

initiators.

> On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp <ddiss@xxxxxxx> wrote:

>>

>> Hi Jason,

>>

>> Thanks for the detailed write-up...

>>

>> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:

>>

>> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López <jorpilo@xxxxxxxxx>

>> > wrote:

>> >

>> > > As far as I am able to understand there are 2 ways of setting iscsi

>> > > for

>> > > ceph

>> > >

>> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...

>> > >

>> >

>> > The target_core_rbd approach is only utilized by SUSE (and its

>> > derivatives

>> > like PetaSAN) as far as I know. This was the initial approach for Red

>> > Hat-derived kernels as well until the upstream kernel maintainers

>> > indicated

>> > that they really do not want a specialized target backend for just krbd.

>> > The next attempt was to re-use the existing target_core_iblock to

>> > interface

>> > with krbd via the kernel's block layer, but that hit similar upstream

>> > walls

>> > trying to get support for SCSI command passthrough to the block layer.

>> >

>> >

>> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)

>> > >

>> >

>> > The TCMU approach is what upstream and Red Hat-derived kernels will

>> > support

>> > going forward.

>>

>> SUSE is also in the process of migrating to the upstream tcmu approach,

>> for the reasons that you gave in (1).

>>

>> ...

>>

>> > The TCMU approach also does not currently support SCSI persistent

>> > reservation groups (needed for Windows clustering) because that support

>> > isn't available in the upstream kernel. The SUSE kernel has an approach

>> > that utilizes two round-trips to the OSDs for each IO to simulate PGR

>> > support. Earlier this summer I believe SUSE started to look into how to

>> > get

>> > generic PGR support merged into the upstream kernel using corosync/dlm

>> > to

>> > synchronize the states between multiple nodes in the target. I am not

>> > sure

>> > of the current state of that work, but it would benefit all LIO targets

>> > when complete.

>>

>> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC,

>> whether DLM or the underlying Ceph cluster gets used for PR state

>> storage is still under consideration.

>>

>> Cheers, David

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

--

Jason

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com