Re: Ceph-ISCSI

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 11 Oct 2017 12:52:11 -0400

On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
<samuel.soulard@xxxxxxxxx> wrote:
> Hi to all,
>
> What if you're using an ISCSI gateway based on LIO and KRBD (that is, RBD
> block device mounted on the ISCSI gateway and published through LIO).  The
> LIO target portal (virtual IP) would failover to another node.  This would
> theoretically provide support for PGRs since LIO does support SPC-3.
> Granted it is not distributed and limited to 1 single node throughput, but
> this would achieve high availability required by some environment.

Yes, LIO technically supports PGR but it's not distributed to other
nodes. If you have a pacemaker-initiated target failover to another
node, the PGR state would be lost / missing after migration (unless I
am missing something like a resource agent that attempts to preserve
the PGRs). For initiator-initiated failover (e.g. a target is alive
but the initiator cannot reach it), after it fails over to another
port the PGR data won't be available.

> Of course, multiple target portal would be awesome since available
> throughput would be able to scale linearly, but since this isn't here right
> now, this would provide at least an alternative.

It would definitely be great to go active/active but there are
concerns of data-corrupting edge conditions when using MPIO since it
relies on client-side failure timers that are not coordinated with the
target.

For example, if an initiator writes to sector X down path A and there
is delay to the path A target (i.e. the target and initiator timeout
timers are not in-sync), and MPIO fails over to path B, quickly
performs the write to sector X and performs second write to sector X,
there is a possibility that eventually path A will unblock and
overwrite the new value in sector 1 with the old value. The safe way
to handle that would require setting the initiator-side IO timeouts to
such high values as to cause higher-level subsystems to mark the MPIO
path as failed should a failure actually occur.

The iSCSI MCS protocol would address these concerns since in theory
path B could discover that the retried IO was actually a retry, but
alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
initiators.

> On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp <ddiss@xxxxxxx> wrote:
>>
>> Hi Jason,
>>
>> Thanks for the detailed write-up...
>>
>> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
>>
>> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López <jorpilo@xxxxxxxxx>
>> > wrote:
>> >
>> > > As far as I am able to understand there are 2 ways of setting iscsi
>> > > for
>> > > ceph
>> > >
>> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>> > >
>> >
>> > The target_core_rbd approach is only utilized by SUSE (and its
>> > derivatives
>> > like PetaSAN) as far as I know. This was the initial approach for Red
>> > Hat-derived kernels as well until the upstream kernel maintainers
>> > indicated
>> > that they really do not want a specialized target backend for just krbd.
>> > The next attempt was to re-use the existing target_core_iblock to
>> > interface
>> > with krbd via the kernel's block layer, but that hit similar upstream
>> > walls
>> > trying to get support for SCSI command passthrough to the block layer.
>> >
>> >
>> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>> > >
>> >
>> > The TCMU approach is what upstream and Red Hat-derived kernels will
>> > support
>> > going forward.
>>
>> SUSE is also in the process of migrating to the upstream tcmu approach,
>> for the reasons that you gave in (1).
>>
>> ...
>>
>> > The TCMU approach also does not currently support SCSI persistent
>> > reservation groups (needed for Windows clustering) because that support
>> > isn't available in the upstream kernel. The SUSE kernel has an approach
>> > that utilizes two round-trips to the OSDs for each IO to simulate PGR
>> > support. Earlier this summer I believe SUSE started to look into how to
>> > get
>> > generic PGR support merged into the upstream kernel using corosync/dlm
>> > to
>> > synchronize the states between multiple nodes in the target. I am not
>> > sure
>> > of the current state of that work, but it would benefit all LIO targets
>> > when complete.
>>
>> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC,
>> whether DLM or the underlying Ceph cluster gets used for PR state
>> storage is still under consideration.
>>
>> Cheers, David
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com