Re: TCMU: pass through initiator name to user space, three proposals, which is better

Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx> · Tue, 13 Jun 2017 15:22:28 +0000

On Tue, 2017-06-13 at 15:51 +0800, Zhu Lingshan wrote:
> On 06/12/2017 11:39 PM, Bart Van Assche wrote: 
> > On Mon, 2017-06-12 at 01:06 +0800, Zhu Lingshan wrote: 
> > > I am working on PR support for RBD in user space, aka, give tcmu-runner 
> > > rbd handler ability to handle Persistent Reservation operations. Now RBD 
> > > handler side works fine, I can capture the CDBs, analyze it, store/ read 
> > > keys, also other supportive codes done and works fine. 
> >  Hello Zhu, 
> > 
> > Since handling PR in user space would add significant complexity (due to the 
> > new interfaces between kernel and user space that are required) and also has 
> > significant disadvantages (Mike Christie mentioned the overhead due to the 
> > reservation state check), can you explain why you think it would be useful to 
> > handle PR commands in user space instead of keeping all PR command processing 
> > in the kernel? 
>
> Please kindly correct me if I understand this incorrectly. If we need to setup
> more than one target server on the nodes using the same RBD image, I think we
> may need to store something on the RBD itself or the targets can not sync
> information. I am also trying to work out a simple and easy solution. 

Hello Zhu,

Setting up more than one SCSI target stack is not what I meant. What I meant is
to port the PR synchronization code from SCST to LIO. That code is well isolated
(source files scst/src/scst_{no_,}dlm.[ch]) and the interaction between that code
and the SCST core is limited. I expect that porting this code will require less
work than adding an interface to LIO to export PR information to user space and
to implement PR synchronization in user space.

> > If you are looking at this to support synchronizing the PR state across 
> > multiple nodes in a cluster I think handling PR in user space is wrong because 
> > that means the solution to synchronize the PR state across nodes will be 
> > limited to RBD and won't work for clusters that don't use RBD. 
> > 
> > Are you aware that reliable open source code exists for synchronizing the PR 
> > state across multiple cluster nodes? See also 
> > https://www.linuxplumbersconf.org/2015/ocw/system/presentations/2691/original/Using%20the%20DLM%20as%20a%20Distributed%20In-Memory%20Database.pdf 
>
> Yes, I have seen that last year, thanks for your papers. But I think maybe HA
> stack(dlm, corosync, pacemaker) is a little heavy, overkill for this feature?
> And if we try to set up another HA cluster over ceph cluster, would they conflict
> with each other? Like HA cluster may try to fence a node. 

There is a dependency of the SCST DLM code on Corosync but that dependency is small.
The function scst_dlm_update_nodeids() iterates over the sysfs entries created by
the DLM kernel driver to figure out the Corosync node IDs. That code should work
with any HA stack that supports the DLM kernel driver and creates node IDs under
/sys/kernel/config/dlm/cluster/comms. In other words, the SCST DLM code does not
require Pacemaker.

Please let me know if you need more information.

Bart.--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html