Re: [PATCH v3 6/6] usb: gadget: function: Add Generic USB Display support

Noralf Trønnes <noralf@xxxxxxxxxxx> · Tue, 2 Jun 2020 21:14:59 +0200

Hi,

Den 02.06.2020 19.05, skrev Felipe Balbi:
> 
> Hi,
> 
> I missed this completely until now.
> Noralf Trønnes <noralf@xxxxxxxxxxx> writes:
>> This adds the gadget side support for the Generic USB Display. It presents
>> a DRM display device as a USB Display configured through configfs.
>>
>> The display is implemented as a vendor type USB interface with one bulk
>> out endpoint. The protocol is implemented using control requests.
>> lz4 compressed framebuffer data/pixels are sent over the bulk endpoint.
>>
>> The DRM part of the gadget is placed in the DRM subsystem since it reaches
>> into the DRM internals.
> 
> First and foremost, could this be done in userspace using the raw gadget
> or f_fs?
> 

An uncompressed 1920x1080-RGB565 frame is ~4MB. All frames can be
compressed (lz4) even if just a little, so this is decompressed into the
framebuffer of the attached DRM device. AFAIU I would need to be able to
mmap the received bulk buffer if I were to do this from userspace
without killing performance. So it doesn't look like I can use raw
gadget or f_fs.

>> diff --git a/drivers/usb/gadget/function/f_gud_drm.c b/drivers/usb/gadget/function/f_gud_drm.c
>> new file mode 100644
>> index 000000000000..9a2d6bb9739f
>> --- /dev/null
>> +++ b/drivers/usb/gadget/function/f_gud_drm.c
>> @@ -0,0 +1,678 @@
>> +struct f_gud_drm {
>> +	struct usb_function func;
>> +	struct work_struct worker;
> 
> why do you need a worker?
> 

The gadget runs in interrupt context and I need to call into the DRM
subsystem which can sleep.

>> +	size_t max_buffer_size;
>> +	void *ctrl_req_buf;
>> +
>> +	u8 interface_id;
>> +	struct usb_request *ctrl_req;
>> +
>> +	struct usb_ep *bulk_ep;
>> +	struct usb_request *bulk_req;
> 
> single request? Don't you want to amortize the latency of
> queue->complete by using a series of requests?
> 

I use only one request per update or partial update.
I kmalloc the biggest buffer I can get (default 4MB) and tell the host
about this size. If a frame doesn't fit, the host splits it up into
partial updates. I already support partial updates so this is built in.
Userspace can tell the graphics driver which portion of the framebuffer
it has touched to avoid sending the full frame each time.
Having one continous buffer simplifies decompression.

There's a control request preceding the bulk request which tells the
area the update is for and whether it is compressed or not.
I did some testing to see if I should avoid the control request overhead
for split updates, but it turns out that the bottleneck is the
decompression. The control request is just 400-500us, while the total
time from control request to buffer is decompressed is 50-100ms
(depending on how well the frame compresses).

>> +	struct gud_drm_gadget *gdg;
>> +
>> +	spinlock_t lock; /* Protects the following members: */
>> +	bool ctrl_pending;
>> +	bool status_pending;
>> +	bool bulk_pending;
>> +	bool disable_pending;
> 
> could this be a single u32 with #define'd flags? That would be atomic,
> right?
> 

I have never grasped all the concurrency issues, but wouldn't I need
memory barriers as well?

>> +	u8 errno;
> 
> a global per-function error? Why?
> 

This is the result of the previously request operation. The host will
request this value to see how it went. I might switch to using a bulk
endpoint for status following a discussion with Alan Stern in the host
driver thread in this patch series. If so I might not need this.

>> +	u16 request;
>> +	u16 value;
> 
> also why? Looks odd
> 

These values contains the operation (in addition to the payload) that
the worker shall perform following the control-OUT requests.

control-IN requests can run in interrupt context so in that case the
payload is queued up immediately.

<snip>

>> +static void f_gud_drm_bulk_complete(struct usb_ep *ep, struct usb_request *req)
>> +{
>> +	struct f_gud_drm *fgd = req->context;
>> +	unsigned long flags;
>> +
>> +	if (req->status || req->actual != req->length)
>> +		return;
> 
> so, if we complete with an erroneous status or a short packet, you'll
> ignore it?
> 

Hmm yeah. When I wrote this I thought that the bottleneck was the USB
transfers, so I didn't want the host to slow down performance by
requesting this status. Now I know it's the decompression that takes
time, so I could actually do a status check and retry the frame if the
device detects an error.

>> +	spin_lock_irqsave(&fgd->lock, flags);
>> +	fgd->bulk_pending = true;
>> +	spin_unlock_irqrestore(&fgd->lock, flags);
>> +
>> +	queue_work(system_long_wq, &fgd->worker);
> 
> Do you need to offset this to a worker?
> 

Yes, long running (one operation can be >100ms) and can sleep.

>> +static int f_gud_drm_ctrl_req_set_buffer(struct f_gud_drm *fgd, void *buf, size_t len)
>> +{
>> +	int ret;
>> +
>> +	if (len != sizeof(struct gud_drm_req_set_buffer))
>> +		return -EINVAL;
>> +
>> +	ret = gud_drm_gadget_set_buffer(fgd->gdg, buf);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (ret > fgd->max_buffer_size)
>> +		return -EOVERFLOW;
>> +
>> +	fgd->bulk_req->length = ret;
>> +
>> +	return usb_ep_queue(fgd->bulk_ep, fgd->bulk_req, GFP_KERNEL);
>> +}
>> +
>> +static void f_gud_drm_worker(struct work_struct *work)
>> +{
>> +	struct f_gud_drm *fgd = container_of(work, struct f_gud_drm, worker);
>> +	bool ctrl_pending, bulk_pending, disable_pending;
>> +	struct gud_drm_gadget *gdg = fgd->gdg;
>> +	unsigned long flags;
>> +	u16 request, value;
>> +	int ret;
>> +
>> +	spin_lock_irqsave(&fgd->lock, flags);
>> +	request = fgd->request;
>> +	value = fgd->value;
>> +	ctrl_pending = fgd->ctrl_pending;
>> +	bulk_pending = fgd->bulk_pending;
>> +	disable_pending = fgd->disable_pending;
>> +	spin_unlock_irqrestore(&fgd->lock, flags);
>> +
>> +	pr_debug("%s: bulk_pending=%u ctrl_pending=%u disable_pending=%u\n",
>> +		 __func__, bulk_pending, ctrl_pending, disable_pending);
> 
> could we use dev_dbg() at least?
> 

Sure.

Noralf.