From: Tianyu Lan <ltykernel@xxxxxxxxx> Sent: Monday, August 9, 2021 10:56 AM > Subject line tag should be "scsi: storvsc:" > In Isolation VM, all shared memory with host needs to mark visible > to host via hvcall. vmbus_establish_gpadl() has already done it for > storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ > mpb_desc() still need to handle. Use DMA API to map/umap these s/need to handle/needs to be handled/ > memory during sending/receiving packet and Hyper-V DMA ops callback > will use swiotlb function to allocate bounce buffer and copy data > from/to bounce buffer. > > Signed-off-by: Tianyu Lan <Tianyu.Lan@xxxxxxxxxxxxx> > --- > drivers/scsi/storvsc_drv.c | 68 +++++++++++++++++++++++++++++++++++--- > 1 file changed, 63 insertions(+), 5 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index 328bb961c281..78320719bdd8 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -21,6 +21,8 @@ > #include <linux/device.h> > #include <linux/hyperv.h> > #include <linux/blkdev.h> > +#include <linux/io.h> > +#include <linux/dma-mapping.h> > #include <scsi/scsi.h> > #include <scsi/scsi_cmnd.h> > #include <scsi/scsi_host.h> > @@ -427,6 +429,8 @@ struct storvsc_cmd_request { > u32 payload_sz; > > struct vstor_packet vstor_packet; > + u32 hvpg_count; This count is really the number of entries in the dma_range array, right? If so, perhaps "dma_range_count" would be a better name so that it is more tightly associated. > + struct hv_dma_range *dma_range; > }; > > > @@ -509,6 +513,14 @@ struct storvsc_scan_work { > u8 tgt_id; > }; > > +#define storvsc_dma_map(dev, page, offset, size, dir) \ > + dma_map_page(dev, page, offset, size, dir) > + > +#define storvsc_dma_unmap(dev, dma_range, dir) \ > + dma_unmap_page(dev, dma_range.dma, \ > + dma_range.mapping_size, \ > + dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE) > + Each of these macros is used only once. IMHO, they don't add a lot of value. Just coding dma_map/unmap_page() inline would be fine and eliminate these lines of code. > static void storvsc_device_scan(struct work_struct *work) > { > struct storvsc_scan_work *wrk; > @@ -1260,6 +1272,7 @@ static void storvsc_on_channel_callback(void *context) > struct hv_device *device; > struct storvsc_device *stor_device; > struct Scsi_Host *shost; > + int i; > > if (channel->primary_channel != NULL) > device = channel->primary_channel->device_obj; > @@ -1314,6 +1327,15 @@ static void storvsc_on_channel_callback(void *context) > request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd); > } > > + if (request->dma_range) { > + for (i = 0; i < request->hvpg_count; i++) > + storvsc_dma_unmap(&device->device, > + request->dma_range[i], > + request->vstor_packet.vm_srb.data_in == READ_TYPE); I think you can directly get the DMA direction as request->cmd->sc_data_direction. > + > + kfree(request->dma_range); > + } > + > storvsc_on_receive(stor_device, packet, request); > continue; > } > @@ -1810,7 +1832,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) > unsigned int hvpgoff, hvpfns_to_add; > unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset); > unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length); > + dma_addr_t dma; > u64 hvpfn; > + u32 size; > > if (hvpg_count > MAX_PAGE_BUFFER_COUNT) { > > @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) > payload->range.len = length; > payload->range.offset = offset_in_hvpg; > > + cmd_request->dma_range = kcalloc(hvpg_count, > + sizeof(*cmd_request->dma_range), > + GFP_ATOMIC); With this patch, it appears that storvsc_queuecommand() is always doing bounce buffering, even when running in a non-isolated VM. The dma_range is always allocated, and the inner loop below does the dma mapping for every I/O page. The corresponding code in storvsc_on_channel_callback() that does the dma unmap allows for the dma_range to be NULL, but that never happens. > + if (!cmd_request->dma_range) { > + ret = -ENOMEM; The other memory allocation failure in this function returns SCSI_MLQUEUE_DEVICE_BUSY. It may be debatable as to whether that's the best approach, but that's a topic for a different patch. I would suggest being consistent and using the same return code here. > + goto free_payload; > + } > > for (i = 0; sgl != NULL; sgl = sg_next(sgl)) { > /* > @@ -1847,9 +1878,29 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) > * last sgl should be reached at the same time that > * the PFN array is filled. > */ > - while (hvpfns_to_add--) > - payload->range.pfn_array[i++] = hvpfn++; > + while (hvpfns_to_add--) { > + size = min(HV_HYP_PAGE_SIZE - offset_in_hvpg, > + (unsigned long)length); > + dma = storvsc_dma_map(&dev->device, pfn_to_page(hvpfn++), > + offset_in_hvpg, size, > + scmnd->sc_data_direction); > + if (dma_mapping_error(&dev->device, dma)) { > + ret = -ENOMEM; The typical error from dma_map_page() will be running out of bounce buffer memory. This is a transient condition that should be retried at the higher levels. So make sure to return an error code that indicates the I/O should be resubmitted. > + goto free_dma_range; > + } > + > + if (offset_in_hvpg) { > + payload->range.offset = dma & ~HV_HYP_PAGE_MASK; > + offset_in_hvpg = 0; > + } I'm not clear on why payload->range.offset needs to be set again. Even after the dma mapping is done, doesn't the offset in the first page have to be the same? If it wasn't the same, Hyper-V wouldn't be able to process the PFN list correctly. In fact, couldn't the above code just always set offset_in_hvpg = 0? > + > + cmd_request->dma_range[i].dma = dma; > + cmd_request->dma_range[i].mapping_size = size; > + payload->range.pfn_array[i++] = dma >> HV_HYP_PAGE_SHIFT; > + length -= size; > + } > } > + cmd_request->hvpg_count = hvpg_count; This line just saves the size of the dma_range array. Could it be moved up with the code that allocates the dma_range array? To me, it would make more sense to have all that code together in one place. > } The whole approach here is to do dma remapping on each individual page of the I/O buffer. But wouldn't it be possible to use dma_map_sg() to map each scatterlist entry as a unit? Each scatterlist entry describes a range of physically contiguous memory. After dma_map_sg(), the resulting dma address must also refer to a physically contiguous range in the swiotlb bounce buffer memory. So at the top of the "for" loop over the scatterlist entries, do dma_map_sg() if we're in an isolated VM. Then compute the hvpfn value based on the dma address instead of sg_page(). But everything else is the same, and the inner loop for populating the pfn_arry is unmodified. Furthermore, the dma_range array that you've added is not needed, since scatterlist entries already have a dma_address field for saving the mapped address, and dma_unmap_sg() uses that field. One thing: There's a maximum swiotlb mapping size, which I think works out to be 256 Kbytes. See swiotlb_max_mapping_size(). We need to make sure that we don't get a scatterlist entry bigger than this size. But I think this already happens because you set the device->dma_mask field in Patch 11 of this series. __scsi_init_queue checks for this setting and sets max_sectors to limits transfers to the max mapping size. > > cmd_request->payload = payload; > @@ -1860,13 +1911,20 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) > put_cpu(); > > if (ret == -EAGAIN) { > - if (payload_sz > sizeof(cmd_request->mpb)) > - kfree(payload); > /* no more space */ > - return SCSI_MLQUEUE_DEVICE_BUSY; > + ret = SCSI_MLQUEUE_DEVICE_BUSY; > + goto free_dma_range; > } > > return 0; > + > +free_dma_range: > + kfree(cmd_request->dma_range); > + > +free_payload: > + if (payload_sz > sizeof(cmd_request->mpb)) > + kfree(payload); > + return ret; > } > > static struct scsi_host_template scsi_driver = { > -- > 2.25.1