* Danilo Krummrich <dakr@xxxxxxxxxx> [230627 10:58]: > Hi Liam, > > On 6/27/23 03:58, Liam R. Howlett wrote: > > * Danilo Krummrich <dakr@xxxxxxxxxx> [230626 14:37]: > > > On 6/26/23 16:52, Matthew Wilcox wrote: > > > > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: > > > > > On 6/26/23 15:19, Matthew Wilcox wrote: > > > > > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: > > > > > > > On the other hand, unless I miss something (and if so, please let me know), > > > > > > > something is bogus with the API then. > > > > > > > > > > > > > > While the documentation of the Advanced API of the maple tree explicitly > > > > > > > claims that the user of the API is responsible for locking, this should be > > > > > > > limited to the bounds set by the maple tree implementation. Which means, the > > > > > > > user must decide for either the internal (spin-) lock or an external lock > > > > > > > (which possibly goes away in the future) and acquire and release it > > > > > > > according to the rules maple tree enforces through lockdep checks. > > > > > > > > > > > > > > Let's say one picks the internal lock. How is one supposed to ensure the > > > > > > > tree isn't modified using the internal lock with mas_preallocate()? > > > > > > > > > > > > > > Besides that, I think the documentation should definitely mention this > > > > > > > limitation and give some guidance for the locking. > > > > > > > > > > > > > > Currently, from an API perspective, I can't see how anyone not familiar with > > > > > > > the implementation details would be able to recognize this limitation. > > > > > > > > > > > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop > > > > > > > the maple tree and go back to using a rb-tree, since it seems there is no > > > > > > > sane way doing a worst-case pre-allocation that does not suffer from this > > > > > > > limitation. > > > > > > > > > > > > I haven't been paying much attention here (too many other things going > > > > > > on), but something's wrong. > > > > > > > > > > > > First, you shouldn't need to preallocate. Preallocation is only there > > > > > > > > > > Unfortunately, I think we really have a case where we have to. Typically GPU > > > > > mappings are created in a dma-fence signalling critical path and that is > > > > > where such mappings need to be added to the maple tree. Hence, we can't do > > > > > any sleeping allocations there. > > > > > > > > OK, so there are various ways to hadle this, depending on what's > > > > appropriate for your case. > > > > > > > > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM > > > > layer "This is too hard, let me tap into the emergency reserves". It's > > > > mildly frowned upon, so let's see if we can do better. > > > > > > > > If you know where the allocation needs to be stored, but want it to act as > > > > NULL until the time is right, you can store a ZERO entry. That will read > > > > as NULL until you store to it. A pure overwriting store will not cause > > > > any memory allocation since all the implementation has to do is change > > > > a pointer. The XArray wraps this up nicely behind an xa_reserve() API. > > > > As you're discovering, the Maple Tree API isn't fully baked yet. > > > > > > > > > > Unfortunately, GFP_ATOMIC seems the be the only option. I think storing > > > entries in advance would not work. Typically userspace submits a job to the > > > kernel issuing one or multiple requests to map and unmap memory in an ioctl. > > > Such a job is then put into a queue and processed asynchronously in a > > > dma-fence signalling critical section. Hence, at the we'd store entries in > > > advance we could have an arbitrary amount of pending jobs potentially still > > > messing with the same address space region. > > > > What I think you are saying is that you have a number of requests > > flooding in, which may overwrite the same areas, but are queued up to be > > written after they are queued. These operations look to be kept in > > order according to the code in nouveau_job_submit[1]. Is this correct? > > That's all correct. > > (Although Nouveau isn't a good example in this case. Some aspects of it do > and some aspects of it do not apply to the problem we're discussing here.) > > > > > So then, your issue isn't that you don't know where they will land, but > > don't know if the area that you reserved is already split into other > > areas? For instance, before the range 5-10 is backed by whatever > > happens in the fence, it may have already become 5-6 & 8-10 by something > > that came after (from userspace) but hasn't been processed by the > > kernel that will live at 7? So you can't write 5-10 right away because > > you can't be sure 5-10 is going to exist once you reach the kernel fence > > code that stores the entry? > > > > Is my understanding of your issue correct? > > Yes, it is. > > However, the problem already starts while trying to reserve an area. In > order to satisfy a user request, such a request is broken down into > operations such as unmap mappings which are in the way entirely, remap > mappings which intersect with the requested mapping and finally map the > requested mapping. The execution of such a sequence must appear atomic and > hence be locked accordingly. When trying to reserve an area we'd need to > take that lock. But since this lock would be used in the dma-fence > signalling critical path as well we'd not be allowed to do sleeping > allocations while holding this lock. > > Potentially, this could be solved with a retry loop though. Drop the lock > while allocating, take it again and check whether we still got enough nodes > allocated. Analogous to what the maple tree does in mas_store_gfp(), I > guess. > > > > > Oh, and I guess the queued requests would have to remain ordered between > > threads or whatever is on the other side? I mean, you can't have two > > threads firing different things into the kernel at the same region > > because I would think the results would be unpredictable? > > Once a job is queued up in the kernel they remain ordered. > > However, user threads could concurrently push jobs to the kernel altering > the same region of the address space - it just would not make any sense for > userspace to do that. > > In general userspace is responsible for the semantics of the address space. > The kernel basically just takes any (valid) request and make it happen. It > also assures waiting and signalling of fences which might be bound to > certain jobs and obviously keeps track of the VA space to be able to clean > things up once a client disappears. > > > > > Can these overlapping entries partially overlap one region and another? > > That is, can you have three in-flight writes that does something like: > > store 1-10, store 10-20, store 5-15? > > Absolutely, yes. > > > > > How stable of an output is needed? Does each kernel write need to be > > 100% correct or is there a point where the userspace updates stop and > > only then it is needed to be stable? > > It needs to be 100% correct all the time. The reason is that, as mentioned > above, every job can carry in- and out-fences, such that userspace can order > these jobs against the execution of shaders. But each job is split into parts, so the fences surround these groups of operations? Since ordering is kept, you must reach a point before entering the fences which could call the mas_preallocate() to ensure enough nodes exist to install the new mapping, and then no other operations will be happening. I guess what you are saying is each fence has more than one tree operation? As long as you are not mapping more than a range, then this should be possible in a single write and thus a single preallocation. You can do this by not actually writing unmaps/remaps to the tree within the fence. Once the out-fence is reached, then the operation looks atomic. Reading your patch, it is not clear this is accurate for VM_BIND of asynchronous syncobjs. Is the fence spanning multiple syncobjs with various ranges to map? Or are these things the split-up tasks of unmap/remap, etc that will eventually boil down to what appears to be a single write? > > This is also why there could be jobs queued up, where all of them apply > changes to the same region within the VA space, since there might be shader > executions (or just memory copies) ordered right between them. > > - Danilo > > > > > > > > > So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC > > > directly in the fence signalling critical path. I guess mas_store_gfp() does > > > not BUG_ON() if it can't get atomic pages? > > > > > > Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX). > > > Do you think it could be a sane alternative to pre-allocate with > > > MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise > > > of pre-allocating just a couple of nodes and then rely on atomic pages for > > > the rest? > > > > > > FYI, we're talking about a magnitude of hundreds of thousands of entries to > > > be stored in the tree. > > > > > > > Since you are not tracking gaps, you will get 16 entries per node. The > > maximum height is 31, so that would be 16^31, assuming a gap between > > each entry (the worst case), you can cut that in 1/2. To assure you can > > successfully allocate storage for a new entries, you'd need to allocate > > 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful > > as almost all of these would be freed, and sometimes all of them. > > > > You estimate less than 1M entries, that would never go over 6 levels (8.3M > > entries with the worst-case). 5 levels would get you 500K in the worst > > case, but realistically you'll be in the 5 levels almost always. So, > > 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages. > > > > [1] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@xxxxxxxxxx/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c > > >