Hi, Den 02.06.2020 19.05, skrev Felipe Balbi: > > Hi, > > I missed this completely until now. > Noralf Trønnes <noralf@xxxxxxxxxxx> writes: >> This adds the gadget side support for the Generic USB Display. It presents >> a DRM display device as a USB Display configured through configfs. >> >> The display is implemented as a vendor type USB interface with one bulk >> out endpoint. The protocol is implemented using control requests. >> lz4 compressed framebuffer data/pixels are sent over the bulk endpoint. >> >> The DRM part of the gadget is placed in the DRM subsystem since it reaches >> into the DRM internals. > > First and foremost, could this be done in userspace using the raw gadget > or f_fs? > An uncompressed 1920x1080-RGB565 frame is ~4MB. All frames can be compressed (lz4) even if just a little, so this is decompressed into the framebuffer of the attached DRM device. AFAIU I would need to be able to mmap the received bulk buffer if I were to do this from userspace without killing performance. So it doesn't look like I can use raw gadget or f_fs. >> diff --git a/drivers/usb/gadget/function/f_gud_drm.c b/drivers/usb/gadget/function/f_gud_drm.c >> new file mode 100644 >> index 000000000000..9a2d6bb9739f >> --- /dev/null >> +++ b/drivers/usb/gadget/function/f_gud_drm.c >> @@ -0,0 +1,678 @@ >> +struct f_gud_drm { >> + struct usb_function func; >> + struct work_struct worker; > > why do you need a worker? > The gadget runs in interrupt context and I need to call into the DRM subsystem which can sleep. >> + size_t max_buffer_size; >> + void *ctrl_req_buf; >> + >> + u8 interface_id; >> + struct usb_request *ctrl_req; >> + >> + struct usb_ep *bulk_ep; >> + struct usb_request *bulk_req; > > single request? Don't you want to amortize the latency of > queue->complete by using a series of requests? > I use only one request per update or partial update. I kmalloc the biggest buffer I can get (default 4MB) and tell the host about this size. If a frame doesn't fit, the host splits it up into partial updates. I already support partial updates so this is built in. Userspace can tell the graphics driver which portion of the framebuffer it has touched to avoid sending the full frame each time. Having one continous buffer simplifies decompression. There's a control request preceding the bulk request which tells the area the update is for and whether it is compressed or not. I did some testing to see if I should avoid the control request overhead for split updates, but it turns out that the bottleneck is the decompression. The control request is just 400-500us, while the total time from control request to buffer is decompressed is 50-100ms (depending on how well the frame compresses). >> + struct gud_drm_gadget *gdg; >> + >> + spinlock_t lock; /* Protects the following members: */ >> + bool ctrl_pending; >> + bool status_pending; >> + bool bulk_pending; >> + bool disable_pending; > > could this be a single u32 with #define'd flags? That would be atomic, > right? > I have never grasped all the concurrency issues, but wouldn't I need memory barriers as well? >> + u8 errno; > > a global per-function error? Why? > This is the result of the previously request operation. The host will request this value to see how it went. I might switch to using a bulk endpoint for status following a discussion with Alan Stern in the host driver thread in this patch series. If so I might not need this. >> + u16 request; >> + u16 value; > > also why? Looks odd > These values contains the operation (in addition to the payload) that the worker shall perform following the control-OUT requests. control-IN requests can run in interrupt context so in that case the payload is queued up immediately. <snip> >> +static void f_gud_drm_bulk_complete(struct usb_ep *ep, struct usb_request *req) >> +{ >> + struct f_gud_drm *fgd = req->context; >> + unsigned long flags; >> + >> + if (req->status || req->actual != req->length) >> + return; > > so, if we complete with an erroneous status or a short packet, you'll > ignore it? > Hmm yeah. When I wrote this I thought that the bottleneck was the USB transfers, so I didn't want the host to slow down performance by requesting this status. Now I know it's the decompression that takes time, so I could actually do a status check and retry the frame if the device detects an error. >> + spin_lock_irqsave(&fgd->lock, flags); >> + fgd->bulk_pending = true; >> + spin_unlock_irqrestore(&fgd->lock, flags); >> + >> + queue_work(system_long_wq, &fgd->worker); > > Do you need to offset this to a worker? > Yes, long running (one operation can be >100ms) and can sleep. >> +static int f_gud_drm_ctrl_req_set_buffer(struct f_gud_drm *fgd, void *buf, size_t len) >> +{ >> + int ret; >> + >> + if (len != sizeof(struct gud_drm_req_set_buffer)) >> + return -EINVAL; >> + >> + ret = gud_drm_gadget_set_buffer(fgd->gdg, buf); >> + if (ret < 0) >> + return ret; >> + >> + if (ret > fgd->max_buffer_size) >> + return -EOVERFLOW; >> + >> + fgd->bulk_req->length = ret; >> + >> + return usb_ep_queue(fgd->bulk_ep, fgd->bulk_req, GFP_KERNEL); >> +} >> + >> +static void f_gud_drm_worker(struct work_struct *work) >> +{ >> + struct f_gud_drm *fgd = container_of(work, struct f_gud_drm, worker); >> + bool ctrl_pending, bulk_pending, disable_pending; >> + struct gud_drm_gadget *gdg = fgd->gdg; >> + unsigned long flags; >> + u16 request, value; >> + int ret; >> + >> + spin_lock_irqsave(&fgd->lock, flags); >> + request = fgd->request; >> + value = fgd->value; >> + ctrl_pending = fgd->ctrl_pending; >> + bulk_pending = fgd->bulk_pending; >> + disable_pending = fgd->disable_pending; >> + spin_unlock_irqrestore(&fgd->lock, flags); >> + >> + pr_debug("%s: bulk_pending=%u ctrl_pending=%u disable_pending=%u\n", >> + __func__, bulk_pending, ctrl_pending, disable_pending); > > could we use dev_dbg() at least? > Sure. Noralf.