On 08/21/2015 06:00 PM, Jerome Glisse wrote: > On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote: >> On 08/21/2015 03:32 PM, Jerome Glisse wrote: >>> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote: >>>> On 08/20/2015 10:34 PM, Jerome Glisse wrote: >>>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote: >>>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote: >>>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote: >>>>>>>> Hi, Tiago! >>>>>>>> >>>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote: >>>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory: >>>>>>>>> >>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= >>>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it >>>>>>>> in the archives. An attempt to explain the situation from the vmwgfx >>>>>>>> perspective. >>>>>>>> >>>>>>>> The fact that the interface is generic means that people will start >>>>>>>> using it for the zero-copy case. There has been a couple of more or less >>>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we >>>>>>>> don't need to be that careful but if it is a _generic_ interface we need >>>>>>>> to be very careful to make it fit *all* the hardware out there and that >>>>>>>> we make all potential users use the interface in a way that conforms >>>>>>>> with the interface specification. >>>>>>>> >>>>>>>> What will happen otherwise is that apps written for coherent fast >>>>>>>> hardware might, for example, ignore calling the SYNC api, just because >>>>>>>> the app writer only cared about his own hardware on which the app works >>>>>>>> fine. That would fail miserably if the same app was run on incoherent >>>>>>>> hardware, or the incoherent hardware driver maintainers would be forced >>>>>>>> to base an implementation on page-faults which would be very slow. >>>>>>>> >>>>>>>> So assume the following use case: An app updates a 10x10 area using the >>>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for >>>>>>>> texturing. On some hardware the dma-buf might be tiled in a very >>>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only >>>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a >>>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU >>>>>>>> write and a DMA back again after the write, before GPU usage. On the >>>>>>>> tiled architecture the SYNC operation must untile before CPU access and >>>>>>>> probably tile again before GPU access. >>>>>>>> >>>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd >>>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times >>>>>>>> before writing, and then again after writing. If the app forgot to call >>>>>>>> SYNC we must error. >>>>>>>> >>>>>>>> So to summarize, the fact that the interface is generic IMO means: >>>>>>>> >>>>>>>> 1) Any user must be able to make valid assumptions about the internal >>>>>>>> format of the dma-buf. (untiled, color format, stride etc.) >>>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent >>>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be >>>>>>>> documented somewhere so that maintainers of drivers of uncoherent >>>>>>>> architectures have somewhere to point their fingers. >>>>>>>> 3) Driver-specific implementations must be allowed to error (segfault) >>>>>>>> if SYNC has not been used. >>>>>>> I think here you are too lax, the common code must segfault or >>>>>>> error badly if SYNC has not been use in all cases even on cache >>>>>>> coherent arch. The device driver sync callback can still be a >>>>>>> no operation. But i think that we need to insist strongly on a >>>>>>> proper sync call being made (and we should forbid empty area >>>>>>> sync call). This would be the only way to make sure userspace >>>>>>> behave properly as otherwise we endup in the situation you were >>>>>>> describing above, where the app is design on a cache coherent >>>>>>> arch and works fine there but broke in subtle way on non cache >>>>>>> coherent arch and app developer is clueless of why. >>>>>>> >>>>>>> I just do not trust userspace. >>>>>> I agree and ideally i'd want it this way as well. The question is, is it >>>>>> possible? Can this be done without a fault() handler in the generic >>>>>> kernel code? >>>>>> >>>>>>>> 4) The use-case stated above clearly shows the benefit of a >>>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but >>>>>>>> what if someone in the future wants to use this interface for a 3D >>>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface >>>>>>>> extendable in a way that an enum sync_type member defines the layout of >>>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d >>>>>>>> for the future? >>>>>>>> >>>>>>>> Also, I agree there is probably no good way to generically implement an >>>>>>>> error if SYNC has not been called. That needs to be left as an option to >>>>>>>> drivers. >>>>>>> I think there is, just forbid any further use of the dma buffer, mark >>>>>>> it as invalid and printk a big error. Userspace app developer will >>>>>>> quickly see that something is wrong and looking at kernel log should >>>>>>> explain why. >>>>>> The problem is if someone calls mmap() and then decides to not access >>>>>> the buffer before munmap() or GPU usage. How do we recognize that case >>>>>> and separate it from a CPU access occured outside a sync? We could, as >>>>>> mentioned above have a fault() handler in the generic kernel code and >>>>>> make sure drivers don't populate PTEs until the first fault(). Another >>>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not >>>>>> even sure all CPU architectures have such a flag? >>>>>> >>>>>> But there might be a simpler solution that I have overlooked? >>>>> All arch have a dirty flag, so you could check that, as all we >>>>> really care about is CPU write access. So all you need is clearing >>>>> the dirty bit after each successfull GPU command stream ioctl >>>>> and checking that no dirty bit is set in a region not cover by >>>>> a flush(). Note that the clear and check can happen in the same >>>>> function as part of buffer validation for the GPU command stream. >>>> Actually, I do think we care about reading as well, since reading >>>> without flushing anything written by the GPU will >>>> yield garbage on a non-coherent architecture. Instead we might check for >>>> the PAGE_ACCESSED bit. >>> I do not think the read access after GPU matter. If people think it does >>> then it is simple, at buffer validation in buffer mapping we invalidate >>> CPU page table mapping and the prepare access ioctl populate them after >>> checking that GPU is flush. >>> >>>> So the full version of this would keep track of all synced regions and >>>> at buffer validation time, or unmap time error if there >>>> is a page completely outside the union of all synced regions in any VMA >>>> belonging to the dma-buf address space, then unmark all accessed PTEs >>>> and invalidate TLBs accordingly, typically causing a global TLB flush. >>>> Yeah, we could probably put together a helper function that does this, >>>> but it will be expensive to run and the whole point of this API is to be >>>> able to improve performance. >>> No need for such complexity, the sync ioctl would go over the page table >>> and clear dirty bit for all page in range that is synchronized. So once >>> you get to the command ioctl you know that all dirty bit must be clear. >>> Note this is only a page table walk in all case. I do not consider this >>> to be that expensive. >>> >>> >> The expensive part is the TLB flushing after the PTE update, which >> typically needs to be performed on all cores. Although without a >> benchmark I'm not sure how expensive. It might be that it doesn't >> matter. After all, the vmwgfx- and I believe the qxl driver both use a >> similar approach for fbdev user-space access, but that has never been >> cosidered performance-critical... > There is no need to flush the TLB ! At least not for the write case. > Clearing the dirty bit is cache coherent. TLB are about caching page > table directory tree. To convince yourself look at page_mkclean_one() > in mm/rmap.c which, by the way, is almost what we want (remove the mmu > notifier part and the write protection part and force real CPU cache > flush if needed base on device flag). Hmm? Last time I visited this code, page_mkclean_one() was eventually calling flush_tlb_page() through ptep_clear_flush_notify(). I was under the impression that you always have to flush the TLB when you clear a PTE status bit, but typically not when you set it. Are you saying that that's not the case? The question as I see it is rather whether you want to flush TLB on a per-PTE basis or do it as a global flush after a batched PTE status bit clearing. /Thomas _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel