Re: [RFC PATCH] drm/panfrost: Add support for mapping BOs on GPU page faults

Steven Price <steven.price@xxxxxxx> · Fri, 28 Jun 2019 11:34:30 +0100

On 27/06/2019 17:38, Rob Herring wrote:
> On Thu, Jun 27, 2019 at 4:57 AM Steven Price <steven.price@xxxxxxx> wrote:
>>
>> Sorry for the slow response, I've been on holiday for a few weeks.
> 
> Welcome back.

Thanks!

>>
>> On 20/06/2019 06:50, Tomeu Vizoso wrote:
>>> On Mon, 17 Jun 2019 at 16:56, Rob Herring <robh@xxxxxxxxxx> wrote:
>>>>
>>>> On Sun, Jun 16, 2019 at 11:15 PM Tomeu Vizoso
>>>> <tomeu.vizoso@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>> On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@xxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
>>>>>>>> on GPU page faults and not pinned in memory. The vendor driver calls
>>>>>>>> this functionality GROW_ON_GPF.
>>>>>>>>
>>>>>>>> This implementation assumes that BOs allocated with the
>>>>>>>> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
>>>>>>>> actually work, but I'm unsure if there's some interaction there. It
>>>>>>>> would cause the whole object to be pinned in memory which would defeat
>>>>>>>> the point of this.
>>
>> Although in normal usage user space will never care about the contents
>> of growable memory it can be useful to be able to access it for
>> debugging (although not critical to have it working immediately). In
>> particular it allow submitting the jobs in a job chain separately.
>> Exporting I can't see a use-case for.
>>
>> So personally I'd prefer not using a "NOMAP" flag to mean "grow on fault".
> 
> NOMAP means 'no gpu map on alloc'. The CPU mapping part is just a
> limitation in the implementation which could be handled if needed.

Ah, well my confusion might be another indication it's not a great name ;)

> NOPIN? It's not really 'growing' either as the total/max size is
> fixed. No sure if that's the same for kbase. Maybe faults happen to be
> sequential in addresses and it grows in that sense.

It depends what you understand by pinning. To me pinning means that the
memory cannot be swapped out - which isn't the API level feature (e.g.
we might introduce support for swapping when the GPU isn't using the
memory). In kbase we call it "growing" because the amount of memory
allocated can increase - and indeed it grows in a similar way to a stack
on a CPU.

> Maybe just saying what the buffer is used for (HEAP) would be best?

That seems like a good name to me. User space doesn't really care how
the kernel manages the memory - it just wants to communicate that this
is temporary heap memory for the GPU to use.

> Speaking of alloc flags, Alyssa also mentioned we need a way to align
> shader buffers. My suggestion there is an executable flag. That way we
> can also set pages to XN. Though maybe alignment requirements need to
> be explicit?

kbase mostly handles this with a executable flag, so yes that seems a
reasonable way of handling it. Note, however, that there are a bunch of
wacky optimisation ideas that have been considered that require
particular alignment constraints. In particular kbase ended up with
BASE_MEM_TILER_ALIGN_TOP[1] which is somewhat of a hack to specify the
odd alignment requirement without adding extra fields to the ioctl.

[1]
https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/product/kernel/drivers/gpu/arm/midgard/mali_base_kernel.h#L197

One other thing that I don't think is well supported in panfrost at the
moment is that some units don't actually store the full VA address. The
most notable one is the PC - this is either 32 bit or 24 bit depending
on the GPU (although kbase always assumes 24 bit). This means that the
shader code must be aligned to not cross a 24 bit boundary. kbase also
has BASE_MEM_GPU_VA_SAME_4GB_PAGE for the same idea but restricted to a
32 bit size.

There's also a nasty limitation for executable memory - it can't start
(or end) on a 4GB boundary, see the code here which avoids picking those
addresses:

https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/product/kernel/drivers/gpu/arm/midgard/mali_kbase_mem.c#L279

Finally kbase has kbase_ioctl_mem_alias which allows creating aliases of
existing mappings with appropriate strides between them. This is an
optimisation for rending to multiple render targets efficiently and is
only needed for some GPUs. But I think we can leave that one for now.

[...]
>> It would certainly seem reasonable that the contents of NOMAP memory can
>> be thrown away when the job chain has been completed. But, there is a
>> potential performance improvement by not immediately unmapping/freeing
>> the memory but leaving it in the assumption a similar job will be
>> submitted later requiring roughly the same amount of memory.
>>
>> Arm's blob/kernel have various mechanisms for freeing memory either
>> after a period of being idle (in the blob) or when a shrinker is called
>> (in kbase). The idea is that the heap memory is grown once to whatever
>> the content needs and then the same buffer (or small set of buffers) is
>> reused repeatedly. kbase has a mechanism called "ephemeral memory" (or
>> evictable) which is memory which normally remains mapped on the GPU, but
>> under memory pressure it can be freed (and later faulted in with empty
>> pages if accessed again). A pinning mechanism is used to ensure that
>> this doesn't happen in the middle of a job chain which uses the buffer.
>> This mechanism is referred to as "JIT" (Just In Time allocation) in places.
> 
> That's a bit simpler than what I assumed JIT was.

Ah, well I have simplified it a bit in that description :)

There are effectively two features. Ephemeral memory is the DONT_NEED
flag which enables the memory to be freed under memory pressure when
it's not in use. JIT is then built on top of that and provides a
mechanism for the kernel to allocated ephemeral memory regions "just in
time" immediately before the jobs are sent to the GPU. This offloads the
decision about how many memory regions are needed to the kernel in the
hope that the kernel can dynamically choose the trade-off between
allocating lots of buffers (gives maximum flexibility in terms of job
scheduling) or saving memory by immediately running the fragment job so
the heap buffers can be recycled.

All I can say is that it's a locking nightmare (shrinkers can be called
some very annoying contexts). It's also not clear that the kernel is in
a better position to make the memory/performance trade-off decision than
user space.

> So there's 2 different cases of memory not pinned on alloc. The first
> is the heap memory which is just faulted on demand (i.e during jobs)
> and the 2nd is the JIT which is pinned some time between alloc and a
> job submit. Is that correct? Is that 2 different allocation flags or 1
> flag with 2 different ways to get pages pinned?

Yes that's correct. "Heap memory"[2] is just GROW_ON_GPF memory
allocated by user space. JIT memory is allocated by a 'soft-job'
(BASE_JD_REQ_SOFT_JIT_ALLOC) that user space inserts before the real GPU
jobs. This soft-job is responsible for allocating (or reusing) a buffer
(which is internally marked as GROW_ON_GPF) and ensuring it's pinned
(removing any DONT_NEED flag). After the GPU jobs have run there's
another soft-job (BASE_JD_REQ_SOFT_JIT_FREE) which will return the
buffer to a pool and set the DONT_NEED flag on it.

[2] We don't really have a term for this internally, it's just "growable
memory".

So both types are "grow on fault", the difference is that the
user-allocated "heap memory" well not be discarded or automatically
reused by the kernel, whereas JIT memory will be under the control of
the kernel after the soft-job frees it and so can be recycled/freed at
any time.

>>> I could very well be missing something that is needed by Arm's blob
>>> and not by Panfrost atm, but I don't see in kbase any mechanism for
>>> the kernel to know when the GPU is done with a page, other than the
>>> job that mapped it having finished.
>>
>> Much of the memory management is done by the user space blob. The kernel
>> driver usually doesn't actually know what memory a job will access.
>> There are exceptions though, in particular: ephemeral memory (through
>> JIT) and imported memory.
> 
> Presumably that's a difference. We have a complete list of BOs for each job.

Yes - that's something that I've repeatedly wished the blob driver had.
However it was an early design decision that the driver wouldn't need to
track what memory regions would be used. This meant for the exceptions
there has to be explicit tracking of the regions, which unfortunately
means imported memory ends up being quite 'special'.

Steve

> Rob
> _______________________________________________
> dri-devel mailing list
> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel