Hi, On 11/26/2012 10:21 PM, Andi Kleen wrote: >> +{ >> + int i; >> + int nr_nodes = numa_num_configured_nodes(); >> + int pagesize = numa_pagesize(); >> + size_t *node_scores = calloc(nr_nodes, sizeof(*node_scores)); >> + if (!node_scores) { >> + perror("malloc"); >> + exit(1); > You cannot handle errors like this in the library. Definitely, this was just a proof-of-concept of my patch. > It's not fully clear to me what the interface of your proposed new function > is. Can you send a spec of the new user interface in a manpage like format? void *numa_alloc_interleaved_custom(size_t size, size_t *parts, size_t nr_parts, int *nodes); User-specified custom interleaved allocation. This function allocates a continuous region of `size' bytes (rounded up to the system's page size) that are interleaved in `nr_parts' partitions across the system's memory nodes according to the interleaving specification passed in the `parts' and `nodes' arrays. The size of each partition in bytes and the node that it must lie on are passed in the `parts' and `nodes' arrays, respectively, both of size `nr_parts'. For example, the i-th partition of size `parts[i]' will be placed on the node `nodes[i]'. This function takes care of the correct alignment of each partition to the system's page size and the `parts' array is updated accordingly. The page at the boundary of two partitions is assigned to the partition owning the largest part, provided that each partition is left with at least one page. In cases of very small partitions, not exceeding the system's page size, these are merged into larger ones and placed to the node with the largest allocation requirements among the nodes of the merged partitions. The `parts' and `nodes' arrays are updated accordingly. For example, supposing the following interleaved allocation scheme (in bytes): parts = {100, 200, 300, 400} nodes = {0, 0, 1, 1} the resulting allocation will be parts = {4096, 0, 0, 0} nodes = {1, 0, 1, 1} Note that sizes of the merged partitions set to zero, while their original node specification is not touched. On success a pointer to the newly allocated area is returned, otherwise NULL is returned. The `parts' and `nodes' arguments are set to the values of the updated partition sizes and nodes. Similarly to numa_alloc_onnode(), this function just reserves the requested allocation space and binds the user-specified partitions to the desired memory nodes. The actual (physical page) allocation will take place on first write. In fact, `numa_alloc_interleaved_custom(mysize, &mysize, 1, &mynode)' is equivalent to `numa_alloc_onnode(mysize, mynode)'. > Also it would be good to have some more information on the use case. Suppose you have a large data structure (e.g., a sparse matrix) and a set of operations that operate on this, assuming a continuous allocation. However, some operations are quite time-consuming and memory-intensive, so you have decided to use multiple threads for them. Your balancing is based on a static determination of the data structure partitions, while the partition boundaries are determined by `marks' (e.g., starting/ending rows) in the original data structure. The following is a typical (a little abstract) scenario: void do_sth_memory_intensive(matrix, matrix_parts) { for (j = matrix_parts.start to matrix_parts.end) do_sth(matrix[j]) } matrix = malloc(size) init(matrix); other_trivial_ops(matrix); matrix_parts = split_balanced(matrix, nr_threads); for (i = 0; i < nr_threads; i++) spawn_thread(do_sth_memory_intensive, matrix, matrix_parts[i]) Since your workload is memory-intensive, NUMA capabilities are quite tempting, but you want to achieve this with the minimum effort. Having already your static balancing scheme with the corresponding partition boundaries, all you need is to replace the typical malloc() call with the custom interleaved allocation call proposed, passing it your partition sizes and the desired nodes. The rest of the operations on your data structure will work as is. For example, converting the above code to NUMA-aware would be sth like matrix = malloc(size) init(matrix); other_trivial_ops(matrix); matrix_parts = split_balanced(matrix, nr_threads); + matrix_interleaved = numa_alloc_interleaved_custom(size, matrix_parts, nr_threads, mynodes); + memcpy(matrix_interleaved, matrix); + free(matrix); for (i = 0; i < nr_threads; i++) spawn_thread(do_sth_memory_intensive, matrix_interleaved, matrix_parts[i]) Of course, if the splitting info can be obtained independently from `matrix' the memcpy() could be omitted and `numa_alloc_interleaved_custom()' would replace the first `malloc()'. But, anyway, this was my exact scenario. > My first reaction is that you can create any interleaving scheme you > want anyways using the low level functions. This is true, but creating your own interleaving has some low-level intricacies, that the proposed function hides effectively. Supposing you have already you own partition scheme, I suppose an out-of-the-box implementation will be quite tempting. > > -Andi > -- > To unsubscribe from this list: send the line "unsubscribe linux-numa" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html Regards, -- V.K. -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html