Finally returning to this, thanks for the replies.
On 7/19/2023 2:02 AM, Christian König wrote:
Hi guys,
massive sorry for the delayed response, this mail felt totally through
my radar without being noticed.
Am 17.07.23 um 19:24 schrieb Rodrigo Vivi:
On Thu, Jun 29, 2023 at 02:10:58PM -0700, Welty, Brian wrote:
Hi Christian / Thomas,
Wanted to ask if you have explored or thought about adding support in
TTM
such that a ttm_bo could have more than one underlying backing store
segment
(that is, to have a tree of ttm_resources)?
We already use something similar on amdgpu where basically the VRAM
resources are stitched together from multiple backing pages.
That is not exactly the same, but it comes close.
I tried searching for awhile for this in amdgpu but wasn't able to find
it. Didn't see any signs in amdgpu_vram_mgr.c.
Can you point me to where this code lives? I wanted to review and
compare the approach...
We are considering to support such BOs for Intel Xe driver.
They are indeed the best one to give an opinion here.
I just have some dummy questions and comments below.
Some of the benefits:
* devices with page fault support can fault (and migrate) backing
store
at finer granularity than the entire BO
We've considered that once as well and I even started hacking something
together, but the problem was that at least at that point it wasn't
doable because of limitations in the Linux memory management.
Basically the extended attributes used to control caching of pages where
only definable per VMA! So when one piece of the BO would have been in
uncached VRAM while another piece would be in cached system system
memory you immediately ran into problems.
I think that issue is fixed by now, but I'm not 100% sure.
Okay, thanks for mentioning. I didn't come across such issue so far...
In general I think it might be beneficial, but I'm not 100% sure if it's
worth the additional complexity.
Agreed. Well, up next is to put small RFC together then...
Regards,
Christian.
what advantage does this bring? to each workload?
is it a performance on huge bo?
Replying to Rodrigo's comments for the rest here...
Yes, providing more rationale is needed. I'll see about beefing up
the description with the RFC patches...
Bascially, all aspects of working with BO backing store can operate
on smaller granularity.
Including being able to support a BO which is larger than total VRAM.
* BOs can support having multiple backing store segments, which can be
in different memory domains/regions
what locking challenges would this bring?
Intent would be to still have locking done at the BO level, and not
attempt to introduce finer grained locking.
is this more targeting gpu + cpu? or only for our multi-tile platforms?
and what's the advantage this is bringing to real use cases?
Right, it's able to be leveraged for both types of usage you mentioned.
So with both gpu + cpu accessing a BO, the portion of the BO they are
accessing can be placed locally.
And with an Xe gt0 + gt1 accessing a BO, we can place segments of it in
the tile local to the gt.
(probably the svm/hmm question below answers my questions, but...)
* BO eviction could operate on smaller granularity than entire BO
I believe all the previous doubts apply to this item as well...
Not sure what 'all the previous doubts' refers to...
Agree most of the value is lost if eviction is not updated to operate at
finer granularity. Will make sure to explore this.
Or is the thinking that workloads should use SVM/HMM instead of
GEM_CREATE
if they want above benefits?
Is this something you are open to seeing an RFC series that starts
perhaps
with just extending ttm_bo_validate() to see how this might shape up?
Imho the RFC always help... a piece of code to see the idea usually draws
more attention from devs than ask in text mode. But more text explaining
the reasons behind are also helpful even with the RFC.
Will work up a small RFC and see where we go with this...
Thanks,
-Brian
Thanks,
Rodrigo.
-Brian