On 01/29/2018 08:03 AM, Michal Privoznik wrote: > On 01/26/2018 04:07 AM, John Ferlan wrote: >> >> >> On 01/18/2018 11:04 AM, Michal Privoznik wrote: >>> This is a definition that holds information on SCSI persistent >>> reservation settings. The XML part looks like this: >>> >>> <reservations enabled='yes' managed='no'> >>> <source type='unix' path='/path/to/qemu-pr-helper.sock' mode='client'/> >>> </reservations> >>> >>> If @managed is set to 'yes' then the <source/> is not parsed. >>> This design was agreed on here: >>> >>> https://www.redhat.com/archives/libvir-list/2017-November/msg01005.html >>> >>> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> >>> --- >>> docs/formatdomain.html.in | 25 +++- >>> docs/schemas/domaincommon.rng | 19 +-- >>> docs/schemas/storagecommon.rng | 34 +++++ >>> src/conf/domain_conf.c | 36 +++++ >>> src/libvirt_private.syms | 3 + >>> src/util/virstoragefile.c | 148 +++++++++++++++++++++ >>> src/util/virstoragefile.h | 15 +++ >>> .../disk-virtio-scsi-reservations-not-managed.xml | 40 ++++++ >>> .../disk-virtio-scsi-reservations.xml | 38 ++++++ >>> .../disk-virtio-scsi-reservations-not-managed.xml | 1 + >>> .../disk-virtio-scsi-reservations.xml | 1 + >>> tests/qemuxml2xmltest.c | 4 + >>> 12 files changed, 348 insertions(+), 16 deletions(-) >>> create mode 100644 tests/qemuxml2argvdata/disk-virtio-scsi-reservations-not-managed.xml >>> create mode 100644 tests/qemuxml2argvdata/disk-virtio-scsi-reservations.xml >>> create mode 120000 tests/qemuxml2xmloutdata/disk-virtio-scsi-reservations-not-managed.xml >>> create mode 120000 tests/qemuxml2xmloutdata/disk-virtio-scsi-reservations.xml >>> >> >> Before digging too deep into this... >> >> - I assume we're avoiding <disk> iSCSI mainly because those >> reservations would take place elsewhere, safe assumption? > > I believe so, but I'll let Paolo answer that. The way I understand > reservations is that qemu needs to issue 'privileged' SCSI commands and > thus for regular SCSI (which for purpose of this argument involves iSCSI > emulated by kernel) either qemu needs CAP_SYS_RAWIO or a helper process > to which it'll pass the FD and which will issue the 'privileged' SCSI > commands on qemu's behalf. > Just trying to cover all the bases... Disks, Hostdevs, Storage Pools, SCSI, iSCSI, NPIV, etc. A few different ways to figure LUNs. I think you're right regarding the privileged command - it's what I have a recollection of when dealing with PR before with the rawio and sgio properties. >> >> - What about using lun's from a storage pool (and what could become >> your favorite, NPIV devices ;-)) >> >> <disk type='volume' device='lun'> >> <driver name='qemu' type='raw'/> >> <source pool='sourcepool' volume='unit:0:4:0'/> >> <target dev='sda' bus='scsi'/> >> </disk> > > These should work too with my patches (not tested though - I don't have > any real SCSI machine). > I have access to the NPIV environment I use and can let you use it as well... So that part is not a problem. IIRC, this was a two pronged question, but I neglected to go back and be more explicit.. The second half of the question is when using a storage pool for your LUN, then how is the PR added? The <reservations> are only being processed from one very specific disk source type. For reference, see how many times encryption is added to various RNG definitions - including diskSourceVolume which is what I think you'd want to process a 'srcpool' definition. The virStorageTranslateDiskSourcePool handles the magic of generating the disk like structure used when building the command line. There's also a 'mode' attribute where if set to 'host' for an iSCSI LUN's will result in using the "local" filesystem path to the LUN rather than going through the iSCSI server. I would think for this case, then one would have to have the PR set up (if configured of course). >> >> - What about <hostdev>'s? >> >> <hostdev mode='subsystem' type='scsi'> >> >> but not iSCSI or vHost hostdev's. I think that creates the SCSI >> generic LUN, but it's been a while since I've thought about the >> terminology used for hostdev's... > > I think these don't need the feature since qemu can access the device > directly. > Well that'll make life a bit easier. IIRC, this creates a /dev/sgN on the guest. >> >> I also have this faint recollection of PR related to sgio filtering >> and perhaps even rawio, but dredging that back up could send me down the >> path of some "downstream only" type bz's. Although searching on just >> VIR_DOMAIN_DISK_DEVICE_LUN does bring up qemuSetUnprivSGIO. >> >> And finally... I assume there is one qemu-pr-manager (qemu.conf changes >> eventually)... Eventually there's magic that allows/adds per domain >> *and* per LUN some socket path. If libvirt provided it's generated via >> the domain temporary directory; however, what's not really clear is how >> that unmanaged path really works. Need a virtual whiteboard... > > So, in case of unmanaged path, here are the assumptions that my patches > are built on: > > 1) unmanaged helper process (UHP) is spawned by somebody else's than > libvirtd (hence unmanaged) - it doesn't have to be user, it can be > systemd for instance. > > 2) path to UHP's socket has to be labeled correctly - libvirt doesn't > touch that, because it knows nothing about usage scenario, whether sys > admin intended one UHP per whole host and thus configured label that > way, or it is spawned by mgmt app (or systemd, or whomever) per one > domain, or even disk. Therefore, we can do nothing more than shrug > shoulders and require users to label the socket correctly. Or use > managed helper. > > 3) in future, when UHP dies, libvirt will NOT spawn it again. It's > unmanaged after all. It's user/sysadmin responsibility to spawn it > again. > > Now, for the managed helper process (MHP) the assumptions are: > > 1) there's one MHP per domain (all SCSI disks in the domain share the > same MHP). > > 2) the MHP runs as root, but is placed into the same CGroup, mount > namespace as qemu process it serves > > 3) MHP is lives and dies with the domain it is associated with. > > The code might be complicated more than needed - it is prepared to have > one MHP per disk rather than domain (should we ever need it). Therefore > instead of storing one pid_t, we store them in a hash table where more > can be stored. > I know the design was agreed upon previously, but nothing like actual patches to open Pandora's Box again ;-). I might be off in the weeds here, but taking a step back and typing thoughts... If MHP is a QEMU related process - IOW, something QEMU creates, then why is it libvirt's responsibility to manage it? Why would/should libvirt be responsible for restarting it if it goes away? The death would be noticeable if we were notified that the socket closed or perhaps if QEMU sent an event when it noticed. If the daemon was having a problem, then wouldn't restarting it possibly cause a never ending fail/restart cycle? What/Who stops that? Perhaps the UHP model is how we should be treating things - that is "mostly" not our problem. Shouldn't the extent of libvirt's "management" be nothing more than using it and handling when the connection dies by some action defined for the device that's using it (e.g. domain shutdown, domain pause, device hot unplug)?. Subsequent hot-plug's would need to determine that the connection to the daemon can be made; otherwise, the attempt fails. That may need to be something checked after the device add. Defining the close action on a device by device basis allows the domain configurer to make a choice based on whether it's a boot/system/important disk/lun or just something being used for something. Another random, but related thought... QEMU has made this change to add this PR daemon to manage something and communicate with libvirt by some defined socket... Someone else says, great idea, let's copy that and libvirt will be able to manage it. After all imitation is a form of flattery. That leaves libvirt trying to fit this new thing into the existing model - either by liberally copying the same code and changing the names to protect the innocent or refactoring to generate common paths for separate technologies. Still in the long run what we have for the current mechanism is a socket by path. So if we consider that the common point why not have a way to define socket paths to external daemon's by path. IOW: Trying to make this generic enough that we can just have a "type" that would be able to label the type of socket (e.g. "PR") <disk>... <source>.. <socket type='unix' path='/path/to/socket.sock' mode='client|server' closeaction='shutdown|pause|unplug' [daemon='/path/to/daemon' args='list of daemon args']>NAME</socket> Where : optional daemon & args will be used to spin out a new daemon NAME is just some string, but must be UNIQUE amongs other sockets (we could store these in a hash table any number of sockets) The thought behind 'daemon' attribute is to avoid the qemu.conf changes... The args allows defining a way to start the daemon debug or however it's desired - it's just a string copied/passed to the daemon. I would think it's up to daemon code to make all the rules and manage everything including security, cgroups, acls, etc. etc. We don't care if it's a LUN, a Disk, an LV, a Block device, etc. If the daemon wants to somehow support migration, then have at it. We probably should be as hands off as possible. OK - rambled on long enough ;-) John FWIW: Coverity issues found in current series... Patch 9: * In qemuProcessSetupOnePRDaemon, @pidfile is NULL if jump to cleanup if prHelperName is not executable causing a unlink(NULL) which is bad. Also the @cfg is leaked. Patch 10: * In qemuDomainObjPrivateXMLParsePRs, virXPathNodeSet can return a -1 into an unsigned @n value, thus the subsequent for loop will take a while. -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list