Re: [Qemu-devel] [PATCH 00/16] QEMU vhost-scsi support

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sat, 21 Apr 2012 01:51:56 -0700

On Fri, 2012-04-20 at 12:09 +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 20, 2012 at 8:46 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> > Il 20/04/2012 09:00, Nicholas A. Bellinger ha scritto:

<SNIP>

> > - no support for migration (there can be pending SCSI requests at
> > migration time, that need to be restarted on the destination)
> 
> Yes and it hasn't been thought through by me at least ;-).  So
> migration is indeed a challenge that needs to be worked through.
> 
> > - no support for non-raw images (fix: use NBD on a Unix socket? perhaps
> > add an NBD backend to lio)
> 
> For me this is the biggest issue with kernel-level storage for virtual
> machines.  We have NBD today but it goes through the network stack
> using a limited protocol and probably can't do zero-copy.
> 
> The most promising option I found was dm-userspace
> (http://wiki.xensource.com/xenwiki/DmUserspace), which implements a
> device-mapper target with an in-kernel MMU-like lookup mechanism that
> calls out to userspace when block addresses need to be translated.
> It's not anywhere near to upstream and hasn't been pushed for several
> years.  On the plus side we could also write a userspace
> implementation of this so that QEMU image formats continue to be
> portable to other host OSes without duplicating code.
> 
> If tcm_vhost only works with raw images then I don't see it as a
> realistic option given the effort it will require to complete and
> maintain.
> 

So there has been interest in the past for creating a TCM backend that
allows for a userspace passthrough, but so far the code to do this has
not materialized yet..

There are pieces of logic from STGT that provide an interface for doing
something similar that still exist in the upstream kernel.  Allowing
different QEMU formats to be processed (in userspace) through a hybrid
TCM backend driver that fits into the existing HBA/DEV layout in
/sys/kernel/config/target/$HBA/$DEV/ is what would be required to really
do this properly..

> >> In order for QEMU userspace to support this, Linux would need to expose
> >> a method to userspace for issuing DIF protected CDBs.  This userspace
> >> API currently does not exist AFAIK, so a kernel-level approach is the
> >> currently the only option when it comes to supporting end-to-end block
> >> protection information originating from within Linux guests.
> >
> > I think it would be worthwhile to have this in userspace too.
> >
> >> (Note this is going to involve a virtio-scsi spec rev as well)
> >
> > Yes.  By the way, another possible modification could be to tell the
> > guest what is its (initiator) WWPN.
> 
> Going back to ALUA, I'd like to understand ALUA multipathing a bit
> better.  I've never played with multipath, hence my questions:
> 
> I have a SAN with multiple controllers and ALUA support - so ALUA
> multipathing is possible.  Now I want my KVM guests to take advantage
> of multipath themselves.  Since the LIO target virtualizes the SCSI
> bus (the host admin defines LUNs, target ports, and ACLs that do not
> have to map 1:1 to the SAN) we also have to implement ALUA in the
> virtio-scsi target.  The same would be true for QEMU SCSI emulation.
> 

The virtio-scsi (as an SCSI LLD in guest) is using scsi_dh_alua device
handler just like any other SCSI driver does.  (eg: ALUA is a fabric
independent feature)

That means there is no special requirements for initiator LLDs to be
able to use scsi_dh_alua, other than the target supporting ALUA
primitives + NAA IEEE extended registered naming to identify the backend
device across multiple paths.

This also currently requires explict multipathd.conf setup (in the
guest) if the target LUNs vendor/product strings do not match the
default supported ALUA array list in upstream scsi_dh_alua.c code.

> How would we configure LIO's ALUA in this case?  We really want to
> reflect the port attributes (available/offline,
> optimized/non-optimized) that the external SAN fabric reports.  Is
> this supported by LIO?
> 

Absolutely.  The ability to set ALUA primary access state comes for free
with all fabric modules using TCM + virtual backends (BLOCK+FILEIO).  

The ALUA status appear as attributes under each endpoint LUN under:

   /sys/kernel/config/target/vhost/naa.60014050088ae39a/tpgt_1/lun/lun_0/alua_tg_pt_*

The 'alua_tg_pt_gp' attr is used to optionally set the fabric LUN ALUA
target port group membership.

Each fabric target LUN is (by default) associated with an alua_tg_pt_gp
that is specific to the exported device backend.  Each backend device
can have any number of ALUA tg_pt_gps that exist in a configfs group
under /sys/kernel/config/target/$HBA/$DEV/alua/$TG_PT_GP_NAME.

Here is an quick idea of how an 'default_tg_pt_gp' looks for an
IBLOCK device with multiple fabric exports (iscsi, loopback, vhost)

# head /sys/kernel/config/target/core/iblock_0/mpt_fusion/alua/default_tg_pt_gp/*
==> alua_access_state <==
0

==> alua_access_status <==
None

==> alua_access_type <==
Implict and Explict

==> alua_write_metadata <==
1

==> members <==
iSCSI/iqn.2003-01.org.linux-iscsi.debian-amd64.x8664:sn.6747a471775f/tpgt_1/lun_1
iSCSI/iqn.2003-01.org.linux-iscsi.debian-amd64.x8664:sn.1bc6fcb58f24/tpgt_1/lun_0
loopback/naa.6001405df1bafb29/tpgt_1/lun_0
vhost/naa.60014050088ae39a/tpgt_1/lun_0

==> nonop_delay_msecs <==
100

==> preferred <==
0

==> tg_pt_gp_id <==
0

==> trans_delay_msecs <==
0

Each ALUA $TG_PT_GP_NAME's members (eg: the exported fabric LUNs) are
required to have the same ALUA primary access state following SPC-4 for
supporting ALUA target port groups.  So when ALUA primary access state
is changed at the backend level, it applys to all fabric LUNs within the
associated ALUA target port group.

There is also an secondary ALUA access state (offline) that can also be
set using the an generic fabric LUN ALUA attr.  This information is
saved into individual files that allow the active state to persist
across target power loss.

> Does it even make sense to pass the multipathing up into the guest?
> If we terminate it on the host using Linux's ALUA support, we can hide
> multipath entirely from the guest.  Do we lose an obvious advantage by
> terminating multipath in the host instead of guest?
> 

Being able to virtualize ALUA port access states at the host
(Preferred=1, Active/NonOptimized, Standby) provides a nice fabric
independent (and guest OS independent) method for managing path access
ifor virtio-scsi guest LUNs.

Being able to multiplex I/O to a single vhost-scsi LUN across multiple
vhost interrupt pairs is also likely to be difficult when terminating
multipath in at the host level..

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html