On 06/08/2019 21:25, Alyssa Rosenzweig wrote: >>> It's not obvious to me when it actually needs to be enabled. Besides the >>> errata, it's only when... device_nr=1 for a compute-only job in kbase? >>> >>> I'm afraid I don't know nearly enough about how kbase plumbs CL to grok >>> the signifiance... >> >> Figuring out the nr_core_groups was the complicated part of this as I >> recall. Seems like we should at least figure out if we (or will need) >> PANFROST_JD_REQ_CORE_GRP_MASK added to the UAPI as well. > > I suspect this is something OpenCL/Vulkan specific. Hopefully Stephen > can shine some light here :) *switches torch on*... Ok, this is actually a lot more complex than it first appears, so I'll have to start with a bit of background: Mali Midgard GPUs have 2 "thread creators" per core. There is one for fragment threads and one for 'compute' threads (vertex work is considered compute in this context). The cores have a set number of threads (e.g. 256 for the early GPUs) and they effectively get divided between fragment threads and compute threads - it's effectively round-robin between the thread creators, but I think there's some extra 'magic' in the hardware. The idea is that for graphics you can run fragment and vertex workloads at the same time on the core and make better use of the hardware (i.e. fragment units are using the texturing hardware, which vertex is using the ALU). However two things stand in the way of this working nicely: 1. Core groups - this is a lovely design feature for hardware engineers, but a pain for software. Basically you can have multiple sets of cores. The cores in a set are coherent with each other, but they are not coherent between sets. This is because each core group has it's own L2 cache. To complicate things even further the tiler notationally exists within core group 0, so is only coherent with that core group. This means that if you have a vertex/tiler job chain it has to be run entirely within core group 0 - or you will need to insert appropriate cache flushes. For fragment work you generally don't need coherency between threads so this isn't a problem and you can run over all the cores in all groups. For compute (i.e. OpenCL) you probably care about coherency in a work group, but you may have several independent jobs that can run in parallel. In this case you can run some (coherent) work on core group 0, and some other (independent but coherent) work on core group 1. 2. Starvation. For compute work it's common to insert barriers requiring all threads to reach the same point in the shader before any thread can progress. If your workgroup size (i.e. the number of threads which synchronise on the barrier) is the same as the number of threads in the core this means that all threads have to be allocated to compute before the barrier can complete. However if the compute thread creator is competing with the fragment thread creator this can lead to the situation where compute threads are idle waiting for fragment threads to complete. This implies that running compute workloads with barriers at the same time as fragment work on the same cores isn't very optimal. </end of background> kbase has several flags: * BASE_JD_REQ_COHERENT_GROUP - the job chain must be run on a coherent set of cores. I.e. must be restricted to a single core group. * BASE_JD_REQ_ONLY_COMPUTE - the job chain is compute jobs and may contain barriers. * BASE_JD_REQ_SPECIFIC_COHERENT_GROUP - we care about being on a particular core group. device_nr is used to select which (and device_nr is otherwise ignored) In practice all this only really matters on the T62x GPU. All other GPUs have only one core group[1]. So it only really makes sense to use JS2 on the T62x where you want to use both JS1 and JS2 to run two independent jobs: one on each core group. Of course kbase makes all this into a maze of twisty little passages, all alike! :) Oh, and there is one hardware workaround (BASE_HW_ISSUE_8987) that uses JS2. This is to avoid vertex and compute jobs landing on the same slot. This affects T604 "dev15" only and is because some state was not properly cleared between jobs. Steve [1] There might be multiple L2 caches in hardware, but they are coherent and are logically a single L2 (only 1 bit set in L2_PRESENT). _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel