On 12/5/18 12:19 AM, Jerome Glisse wrote:
Above example is for migrate. Here is an example for how the
topology is use today:
Application knows that the platform is running on have 16
GPU split into 2 group of 8 GPUs each. GPU in each group can
access each other memory with dedicated mesh links between
each others. Full speed no traffic bottleneck.
Application splits its GPU computation in 2 so that each
partition runs on a group of interconnected GPU allowing
them to share the dataset.
With HMS:
Application can query the kernel to discover the topology of
system it is running on and use it to partition and balance
its workload accordingly. Same application should now be able
to run on new platform without having to adapt it to it.
Will the kernel be ever involved in decision making here? Like the
scheduler will we ever want to control how there computation units get
scheduled onto GPU groups or GPU?
This is kind of naive i expect topology to be hard to use but maybe
it is just me being pesimistics. In any case today we have a chicken
and egg problem. We do not have a standard way to expose topology so
program that can leverage topology are only done for HPC where the
platform is standard for few years. If we had a standard way to expose
the topology then maybe we would see more program using it. At very
least we could convert existing user.
I am wondering whether we should consider HMAT as a subset of the ideas
mentioned in this thread and see whether we can first achieve HMAT
representation with your patch series?
-aneesh