On Wed, Feb 26, 2014 at 8:32 PM, Jassi Brar <jaswinder.singh@xxxxxxxxxx> wrote: > On 26 February 2014 23:21, Srikanth Thokala <sthokal@xxxxxxxxxx> wrote: >> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@xxxxxxxxxx> wrote: >>> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xxxxxxxxxx> wrote: >>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@xxxxxxxxxx> >>>> wrote: >>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xxxxxxxxxx> wrote: >>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@xxxxxxxxxx> >>>>>> wrote: >>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xxxxxxxxxx> wrote: >>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar >>>>>>>> <jaswinder.singh@xxxxxxxxxx> wrote: >>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xxxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar >>>>>>>>>> <jaswinder.singh@xxxxxxxxxx> wrote: >>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xxxxxxxxxx> >>>>>>>>>>> wrote: >>>>>>>>>>>> The current implementation of interleaved DMA API support multiple >>>>>>>>>>>> frames only when the memory is contiguous by incrementing >>>>>>>>>>>> src_start/ >>>>>>>>>>>> dst_start members of interleaved template. >>>>>>>>>>>> >>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave >>>>>>>>>>>> device >>>>>>>>>>>> to not submit multiple frames in a batch. This patch handles this >>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma >>>>>>>>>>>> templates each having a different memory location. >>>>>>>>>>>> >>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to >>>>>>>>>>> submit separate transfers for each segment/frame? >>>>>>>>>>> It will help if you could give a typical example (chunk size and gap >>>>>>>>>>> in bytes) of what you worry about. >>>>>>>>>> >>>>>>>>>> With scatter-gather engine feature in the hardware, submitting >>>>>>>>>> separate >>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA >>>>>>>>>> engine >>>>>>>>>> supports up to 16 video frames, with each frame (a typical video >>>>>>>>>> frame >>>>>>>>>> size) being contiguous in memory but frames are scattered into >>>>>>>>>> different >>>>>>>>>> locations. We could not definitely submit frame by frame as it would >>>>>>>>>> be >>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video >>>>>>>>>> lags. >>>>>>>>>> >>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem >>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher >>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to >>>>>>>>> me. >>>>>>>>> >>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further. >>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which >>>>>>>>> should be possible since for a session of video playback the frame >>>>>>>>> buffers' locations wouldn't change. >>>>>>>>> >>>>>>>>> Yet another option is to use the full potential of the >>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame' >>>>>>>>> with the interleaved-xfer api's 'frame'. They are different. >>>>>>>>> >>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in >>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and Gi != Gj >>>>>>>>> for i!=j >>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16 >>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk where >>>>>>>>> 0<=k<15 >>>>>>>>> So for your use-case ..... >>>>>>>>> dma_interleaved_template.numf = 1 /* just 1 frame */ >>>>>>>>> dma_interleaved_template.frame_size = 16 /* containing 16 chunks */ >>>>>>>>> ...... //other parameters >>>>>>>>> >>>>>>>>> You have 3 options to choose from and all should work just as fine. >>>>>>>>> Otherwise please state your problem in real numbers (video-frames' >>>>>>>>> size, count & gap in bytes). >>>>>>>> >>>>>>>> Initially I interpreted interleaved template the same. But, Lars >>>>>>>> corrected me >>>>>>>> in the subsequent discussion and let me put it here briefly, >>>>>>>> >>>>>>>> In the interleaved template, each frame represents a line of size >>>>>>>> denoted by >>>>>>>> chunk.size and the stride by icg. 'numf' represent number of frames >>>>>>>> i.e. >>>>>>>> number of lines. >>>>>>>> >>>>>>>> In video frame context, >>>>>>>> chunk.size -> hsize >>>>>>>> chunk.icg -> stride >>>>>>>> numf -> vsize >>>>>>>> and frame_size is always 1 as it will have only one chunk in a line. >>>>>>>> >>>>>>> But you said in your last post >>>>>>> "with each frame (a typical video frame size) being contiguous in >>>>>>> memory" >>>>>>> ... which is not true from what you write above. Anyways, my first 2 >>>>>>> suggestions still hold. >>>>>> >>>>>> Yes, each video frame is contiguous and they can be scattered. >>>>>> >>>>> I assume by contiguous frame you mean as in framebuffer? Which is an >>>>> array of bytes. >>>>> If yes, then you should do as I suggest first, frame_size=16 and numf=1. >>>> >>>> I think am confusing you. I would like to explain with an example. Lets >>>> say >>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and >>>> other frame at 0x20002000 (-0x20003000), and so on. >>>> >>> As I said plz dont confuse video frame with DMA frame.... in video >>> frame the stride is constant(zero or not) whereas in DMA context the >>> stride must be zero for the frame to be called contiguous. >>> >>>> So, the frames are >>>> scattered in memory and as the template doesnt allow multiple src_start/ >>>> dst_start we could not use single template to fill the HW descriptors (of >>>> frames). So, I feel your suggestion might not work if the frames are >>>> scattered. >>>> Also, how could we get 'vsize' value in your approach? >>>> >>> The client driver(video driver) should know the frame parameters. >>> Practically you'll have to populate 16 transfer templates (frame >>> attributes and locations won't change for a session) and submit to be >>> transferred _cyclically_. >>> >>>> More importantly, >>>> we are overriding the semantics of interleaved template members. >>>> >>> Not at all. Interleaved-dma isn't meant for only constant stride/icg >>> transfers. Rather it's for identical frames with random strides. >>> >>>> >>>>> >>>>> If no, then it seems you are already doing the right thing.... the >>>>> ring-buffer scheme. Please share some stats how the current api is >>>>> causing you overhead because that is a very common case (many >>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames >>>>> ring-buffer) to queue in before you see any frame drop. >>>> >>>> As I mentioned earlier in the thread, our hardware has a SG engine where by >>>> we could send multiple frames in a batch. Using the original implementation >>>> of interleaved API, we have three options to transfer. >>>> >>>> One is to send frame by frame to the hardware. We get a async_tx desc for >>>> each frame and then we submit it to hardware triggering it to transfer this >>>> BD. >>>> We queue the next descriptor on the pending queue and whenever there is >>>> a completion interrupt we submit the next BD on this queue to hardware. In >>>> this implementation we are not efficiently using the SG engine in the >>>> hardware >>>> as we transferring frame by frame even HW allows us to transfer multiple >>>> frames. >>>> >>> Sending one frame at a time will likely cause jitters. That shouldn't be done. >>> >>>> The second option is to queue all the BDs until the maximum frames that HW >>>> is >>>> capable to transfer in a batch and then submit to SG engine in the HW. With >>>> this approach I feel there will be additional software overhead to track the >>>> number >>>> of maximum transfers and few additional cycles to release the cookie of each >>>> desc. >>>> Here, each desc represents a frame. >>>> >>> APIs are written for the gcd of h/w. We should optimize only for >>> majority and here isn't even anything gained after changing the api. >>> What you want to 'optimize' has been there since ever. Nobody >>> considers that overhead. BTW you don't even want to spend a few 'extra >>> cycles' but what about every other platform that doesn't support this >>> and will have to scan for such transfer requests? >>> >>>> The last option is the current implementation of the driver along with this >>>> change in >>>> API. It will allow us to send array of interleaved templates wherein we >>>> could allocate >>>> a single async desc which will handle multiple frames (or segments) and just >>>> submit >>>> this desc to HW. Then we program the current to first frame, tail to last >>>> frame and >>>> HW will complete this transfer. Here, each desc represents multiple frames. >>>> >>>> My point here is the driver should use hardware resources efficiently and I >>>> feel the >>>> driver will be inefficient if we dont use them. >>>> >>> I am afraid you are getting carried away by the 'awesomeness' of your >>> hardware. RingBuffers/Cyclic transfers are meant for cases just like >>> yours. >>> Consider the following ... if you queue 16 frames and don't care to >>> track before all are transmitted, you'll have very high latency. Video >>> seeks will take unacceptably long and give the impression of a slow >>> system. Whereas if you get callbacks for each frame rendered, you >>> could updates frames from the next one thereby having very quick >>> response time. >> >> Not at all, at least in our hardware, when we submit transfers it is most >> likely to be completed. There are few errors, but mostly are recoverable >> and it doesnt stop the transfer unless there is a critical error which needs >> a reset to the whole system. So, at least in my use case there will be >> no latency. I am not saying hardware is great, but it is the IP implementation. >> > I am not talking about the error cases. > Apply your patch locally so that you queue 16frames and not get > notified upon each frame 'rendered'... now click on 'seek bar' of the > video player. See how slow it is to jump to play from the new > location. Or if the frames are to be encoded after dma transfer... > see the 'padding' that would need to be done at the end. These > concerns are common with audio subsystem using ring-buffers. > Anyways that is just FYI, I don't care how you implement your platform. > >> Regarding this API change, I had earlier explained my use case in my >> v2 thread. >> Lars and Vinod came up with this resolution to allow array of >> interleaved templates. >> I also feel this is reasonable change for the subsystem. >> > I don't see you getting any better performance from your hardware and > I certainly don't find your usecase anything new (many dma controllers > support LLI and they work just fine as such). And I have already > explained how you could do, whatever you want, without this change. > So a polite NAK from me. Ok. I would go with your suggestion of having frame_size = 16 and will send v4 dropping this change. Thanks for the valuable suggestions. Srikanth > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe dmaengine" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html