Re: Ceph-ISCSI

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Thu, 12 Oct 2017 00:04:59 +0000

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of DLM.  We have been using that with Solaris and Hyper-V initiators for RBD backed storage but still have some ongoing issues with ALUA (probably our current config, we need to lab later recommendations).

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard <samuel.soulard@xxxxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxx>; Zhu Lingshan <lszhu@xxxxxxxx>
> Subject: Re:  Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
> <samuel.soulard@xxxxxxxxx> wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_<wwn>"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman <jdillama@xxxxxxxxxx>
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >> <samuel.soulard@xxxxxxxxx> wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal would be awesome since available
> >> > throughput would be able to scale linearly, but since this isn't
> >> > here right now, this would provide at least an alternative.
> >>
> >> It would definitely be great to go active/active but there are
> >> concerns of data-corrupting edge conditions when using MPIO since it
> >> relies on client-side failure timers that are not coordinated with
> >> the target.
> >>
> >> For example, if an initiator writes to sector X down path A and there
> >> is delay to the path A target (i.e. the target and initiator timeout
> >> timers are not in-sync), and MPIO fails over to path B, quickly
> >> performs the write to sector X and performs second write to sector X,
> >> there is a possibility that eventually path A will unblock and
> >> overwrite the new value in sector 1 with the old value. The safe way
> >> to handle that would require setting the initiator-side IO timeouts
> >> to such high values as to cause higher-level subsystems to mark the
> >> MPIO path as failed should a failure actually occur.
> >>
> >> The iSCSI MCS protocol would address these concerns since in theory
> >> path B could discover that the retried IO was actually a retry, but
> >> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> >> initiators.
> >>
> >> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp <ddiss@xxxxxxx>
> >> > wrote:
> >> >>
> >> >> Hi Jason,
> >> >>
> >> >> Thanks for the detailed write-up...
> >> >>
> >> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
> >> >>
> >> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
> >> >> > <jorpilo@xxxxxxxxx>
> >> >> > wrote:
> >> >> >
> >> >> > > As far as I am able to understand there are 2 ways of setting
> >> >> > > iscsi for ceph
> >> >> > >
> >> >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> >> >> > >
> >> >> >
> >> >> > The target_core_rbd approach is only utilized by SUSE (and its
> >> >> > derivatives like PetaSAN) as far as I know. This was the initial
> >> >> > approach for Red Hat-derived kernels as well until the upstream
> >> >> > kernel maintainers indicated that they really do not want a
> >> >> > specialized target backend for just krbd.
> >> >> > The next attempt was to re-use the existing target_core_iblock
> >> >> > to interface with krbd via the kernel's block layer, but that
> >> >> > hit similar upstream walls trying to get support for SCSI
> >> >> > command passthrough to the block layer.
> >> >> >
> >> >> >
> >> >> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
> >> >> > >
> >> >> >
> >> >> > The TCMU approach is what upstream and Red Hat-derived kernels
> >> >> > will support going forward.
> >> >>
> >> >> SUSE is also in the process of migrating to the upstream tcmu
> >> >> approach, for the reasons that you gave in (1).
> >> >>
> >> >> ...
> >> >>
> >> >> > The TCMU approach also does not currently support SCSI
> >> >> > persistent reservation groups (needed for Windows clustering)
> >> >> > because that support isn't available in the upstream kernel. The
> >> >> > SUSE kernel has an approach that utilizes two round-trips to the
> >> >> > OSDs for each IO to simulate PGR support. Earlier this summer I
> >> >> > believe SUSE started to look into how to get generic PGR support
> >> >> > merged into the upstream kernel using corosync/dlm to
> >> >> > synchronize the states between multiple nodes in the target. I
> >> >> > am not sure of the current state of that work, but it would
> >> >> > benefit all LIO targets when complete.
> >> >>
> >> >> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support.
> >> >> IIUC, whether DLM or the underlying Ceph cluster gets used for PR
> >> >> state storage is still under consideration.
> >> >>
> >> >> Cheers, David
> >> >> _______________________________________________
> >> >> ceph-users mailing list
> >> >> ceph-users@xxxxxxxxxxxxxx
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jason
> >
> >
>
>
>
> --
> Jason
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com