Re: Ceph-ISCSI

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 11 Oct 2017 14:03:59 -0400

On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
<samuel.soulard@xxxxxxxxx> wrote:
> Hmmm, If you failover the identity of the LIO configuration including PGRs
> (I believe they are files on disk), this would work no?  Using an 2 ISCSI
> gateways which have shared storage to store the LIO configuration and PGR
> data.

Are you referring to the Active Persist Through Power Loss (APTPL)
support in LIO where it writes the PR metadata to
"/var/target/pr/aptpl_<wwn>"? I suppose that would work for a
Pacemaker failover if you had a shared file system mounted between all
your gateways *and* the initiator requests APTPL mode(?).

> Also, you said another "fails over to another port", do you mean a port on
> another ISCSI gateway?  I believe LIO with multiple target portal IP on the
> same node for path redundancy works with PGRs.

Yes, I was referring to the case with multiple active iSCSI gateways
which doesn't currently distribute PGRs to all gateways in the group.

> In my scenario, if my assumptions are correct, you would only have 1 ISCSI
> gateway available through 2 target portal IP (for data path redundancy).  If
> this first ISCSI gateway fails, both target portal IP failover to the
> standby node with the PGR data that is available on share stored.
>
>
> Sam
>
> On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman <jdillama@xxxxxxxxxx>
> wrote:
>>
>> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
>> <samuel.soulard@xxxxxxxxx> wrote:
>> > Hi to all,
>> >
>> > What if you're using an ISCSI gateway based on LIO and KRBD (that is,
>> > RBD
>> > block device mounted on the ISCSI gateway and published through LIO).
>> > The
>> > LIO target portal (virtual IP) would failover to another node.  This
>> > would
>> > theoretically provide support for PGRs since LIO does support SPC-3.
>> > Granted it is not distributed and limited to 1 single node throughput,
>> > but
>> > this would achieve high availability required by some environment.
>>
>> Yes, LIO technically supports PGR but it's not distributed to other
>> nodes. If you have a pacemaker-initiated target failover to another
>> node, the PGR state would be lost / missing after migration (unless I
>> am missing something like a resource agent that attempts to preserve
>> the PGRs). For initiator-initiated failover (e.g. a target is alive
>> but the initiator cannot reach it), after it fails over to another
>> port the PGR data won't be available.
>>
>> > Of course, multiple target portal would be awesome since available
>> > throughput would be able to scale linearly, but since this isn't here
>> > right
>> > now, this would provide at least an alternative.
>>
>> It would definitely be great to go active/active but there are
>> concerns of data-corrupting edge conditions when using MPIO since it
>> relies on client-side failure timers that are not coordinated with the
>> target.
>>
>> For example, if an initiator writes to sector X down path A and there
>> is delay to the path A target (i.e. the target and initiator timeout
>> timers are not in-sync), and MPIO fails over to path B, quickly
>> performs the write to sector X and performs second write to sector X,
>> there is a possibility that eventually path A will unblock and
>> overwrite the new value in sector 1 with the old value. The safe way
>> to handle that would require setting the initiator-side IO timeouts to
>> such high values as to cause higher-level subsystems to mark the MPIO
>> path as failed should a failure actually occur.
>>
>> The iSCSI MCS protocol would address these concerns since in theory
>> path B could discover that the retried IO was actually a retry, but
>> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
>> initiators.
>>
>> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp <ddiss@xxxxxxx>
>> > wrote:
>> >>
>> >> Hi Jason,
>> >>
>> >> Thanks for the detailed write-up...
>> >>
>> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
>> >>
>> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
>> >> > <jorpilo@xxxxxxxxx>
>> >> > wrote:
>> >> >
>> >> > > As far as I am able to understand there are 2 ways of setting iscsi
>> >> > > for
>> >> > > ceph
>> >> > >
>> >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>> >> > >
>> >> >
>> >> > The target_core_rbd approach is only utilized by SUSE (and its
>> >> > derivatives
>> >> > like PetaSAN) as far as I know. This was the initial approach for Red
>> >> > Hat-derived kernels as well until the upstream kernel maintainers
>> >> > indicated
>> >> > that they really do not want a specialized target backend for just
>> >> > krbd.
>> >> > The next attempt was to re-use the existing target_core_iblock to
>> >> > interface
>> >> > with krbd via the kernel's block layer, but that hit similar upstream
>> >> > walls
>> >> > trying to get support for SCSI command passthrough to the block
>> >> > layer.
>> >> >
>> >> >
>> >> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>> >> > >
>> >> >
>> >> > The TCMU approach is what upstream and Red Hat-derived kernels will
>> >> > support
>> >> > going forward.
>> >>
>> >> SUSE is also in the process of migrating to the upstream tcmu approach,
>> >> for the reasons that you gave in (1).
>> >>
>> >> ...
>> >>
>> >> > The TCMU approach also does not currently support SCSI persistent
>> >> > reservation groups (needed for Windows clustering) because that
>> >> > support
>> >> > isn't available in the upstream kernel. The SUSE kernel has an
>> >> > approach
>> >> > that utilizes two round-trips to the OSDs for each IO to simulate PGR
>> >> > support. Earlier this summer I believe SUSE started to look into how
>> >> > to
>> >> > get
>> >> > generic PGR support merged into the upstream kernel using
>> >> > corosync/dlm
>> >> > to
>> >> > synchronize the states between multiple nodes in the target. I am not
>> >> > sure
>> >> > of the current state of that work, but it would benefit all LIO
>> >> > targets
>> >> > when complete.
>> >>
>> >> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC,
>> >> whether DLM or the underlying Ceph cluster gets used for PR state
>> >> storage is still under consideration.
>> >>
>> >> Cheers, David
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>>
>>
>>
>> --
>> Jason
>
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com