On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote: > On 12/5/18 9:53 AM, Jerome Glisse wrote: > > No so there is 2 kinds of applications: > > 1) average one: i am using device {1, 3, 9} give me best memory for > > those devices > ... > > > > For case 1 you can pre-parse stuff but this can be done by helper library > > How would that work? Would each user/container/whatever do this once? > Where would they keep the pre-parsed stuff? How do they manage their > cache if the topology changes? Short answer i don't expect a cache, i expect that each program will have a init function that query the topology and update the application codes accordingly. This is what people do today, query all available devices, decide which one to use and how, create context for each selected ones, define a memory migration job/memory policy for each part of the program so that memory is migrated/have proper policy in place when the code that run on some device is executed. Long answer: I can not dictate how user folks do their program saddly :) I expect that many application will do it once during start up. Then you will have all those containers folks or VM folks that will get presure to react to hot- plug. For instance if you upgrade your instance with your cloud provider to have more GPUs or more TPUs ... It is likely to appear as an hotplug from the VM/container point of view and thus as an hotplug from the application point of view. So far demonstration i have seen do that by relaunching the application ... More on that through the live re-patching issues below. Oh and i expect application will crash if you hot-unplug anything it is using (this is what happens i believe now in most API). Again i expect that some pressure from cloud user and provider will force programmer to be a bit more reactive to this kind of event. Live re-patching application code can be difficult i am told. Let say you have: void compute_serious0_stuff(accelerator_t *accelerator, void *inputA, size_t sinputA, void *inputB, size_t sinputB, void *outputA, size_t soutputA) { ... // Migrate the inputA to the accelerator memory api_migrate_memory_to_accelerator(accelerator, inputA, sinputA); // The inputB buffer is fine in its default placement // The output is assume to be empty vma ie no page allocated yet // so set a policy to direct all allocation due to page fault to // use the accelerator memory api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA); ... for_parallel<accelerator> (i = 0; i < THEYAREAMILLIONSITEMS; ++i) { // Do something serious } ... } void serious0_orchestrator(topology topology, void *inputA, void *inputB, void *outputA) { static accelerator_t **selected = NULL; static serious0_job_partition *partition; ... if (selected == NULL) { serious0_select_and_partition(topology, &selected, &partition, inputA, inputB, outputA) } ... for(i = 0; i < nselected; ++) { ... compute_serious0_stuff(selected[i], inputA + partition[i].inputA_offset, partition[i].inputA_size, inputB + partition[i].inputB_offset, partition[i].inputB_size, outputA + partition[i].outputB_offset, partition[i].outputA_size); ... } ... for(i = 0; i < nselected; ++) { accelerator_wait_finish(selected[i]); } ... // outputA is ready to be use by the next function in the program } If you start without a GPU/TPU your for_parallel will use the CPU and with the code the compiler have emitted at built time. For GPU/TPU at build time you compile your for_parallel loop to some intermediate representation (a virtual ISA) then at runtime during the application initialization that intermediate representation get lowered down to all the available GPU/TPU on your system and each for_parallel loop is patched to be turn into a call to: void dispatch_accelerator_function(accelerator_t *accelerator, void *function, ...) { } So in the above example the for_parallel loop becomes: dispatch_accelerator_function(accelerator, i_compute_serious_stuff, inputA, inputB, outputA); This hot patching of code is easy to do when no CPU thread is running the code. However when CPU threads are running it can be problematic, i am sure you can do trickery like delay the patching only to the next time the function get call by doing clever thing at build time like prepending each for_parallel section with enough nop that would allow you to replace it to a call to the dispatch function and a jump over the normal CPU code. I think compiler people want to solve the static case first ie during application initializations decide what devices are gonna be use and then update the application accordingly. But i expect it will grow to support hotplug as relaunching the application is not that user friendly even in this day an age where people starts millions of container with one mouse click. Anyway above example is how it looks today and accelerator can turn up to be just regular CPU core if you do not have any devices. The idea is that we would like a common API that cover both CPU thread and device thread. Same for the migration/policy functions if it happens that the accelerator is just plain old CPU then you want to migrate memory to the CPU node and set memory policy to that node too. Cheers, Jérôme