Re: [PATCH] Custom interleaved allocation

Vasileios Karakasis <bkk@xxxxxxxxxxxxxxxxx> · Tue, 27 Nov 2012 18:39:37 +0200

Hi,

On 11/26/2012 10:21 PM, Andi Kleen wrote:
>> +{
>> +    int i;
>> +    int nr_nodes = numa_num_configured_nodes();
>> +    int pagesize = numa_pagesize();
>> +    size_t *node_scores = calloc(nr_nodes, sizeof(*node_scores));
>> +    if (!node_scores) {
>> +        perror("malloc");
>> +        exit(1);
> You cannot handle errors like this in the library.

Definitely, this was just a proof-of-concept of my patch.

> It's not fully clear to me what the interface of your proposed new function 
> is. Can you send a spec of the new user interface in a manpage like format?

void *numa_alloc_interleaved_custom(size_t size, size_t *parts, size_t
nr_parts, int *nodes);

User-specified custom interleaved allocation.

This function allocates a continuous region of `size' bytes (rounded up
to the system's page size) that are interleaved in `nr_parts' partitions
across the system's memory nodes according to the interleaving
specification passed in the `parts' and `nodes' arrays. The size of each
partition in bytes and the node that it must lie on are passed in the
`parts' and `nodes' arrays, respectively, both of size `nr_parts'. For
example, the i-th partition of size `parts[i]' will be placed on the
node `nodes[i]'. This function takes care of the correct alignment of
each partition to the system's page size and the `parts' array is
updated accordingly. The page at the boundary of two partitions is
assigned to the partition owning the largest part, provided that each
partition is left with at least one page. In cases of very small
partitions, not exceeding the system's page size, these are merged into
larger ones and placed to the node with the largest allocation
requirements among the nodes of the merged partitions. The `parts' and
`nodes' arrays are updated accordingly. For example, supposing the
following interleaved allocation scheme (in bytes):

parts = {100, 200, 300, 400}
nodes = {0, 0, 1, 1}

the resulting allocation will be

parts = {4096, 0, 0, 0}
nodes = {1, 0, 1, 1}

Note that sizes of the merged partitions set to zero, while their
original node specification is not touched.

On success a pointer to the newly allocated area is returned, otherwise
NULL is returned. The `parts' and `nodes' arguments are set to the
values of the updated partition sizes and nodes.

Similarly to numa_alloc_onnode(), this function just reserves the
requested allocation space and binds the user-specified partitions to
the desired memory nodes. The actual (physical page) allocation will
take place on first write. In fact,
`numa_alloc_interleaved_custom(mysize, &mysize, 1, &mynode)' is
equivalent to `numa_alloc_onnode(mysize, mynode)'.

> Also it would be good to have some more information on the use case.

Suppose you have a large data structure (e.g., a sparse matrix) and a
set of operations that operate on this, assuming a continuous
allocation. However, some operations are quite time-consuming and
memory-intensive, so you have decided to use multiple threads for them.
Your balancing is based on a static determination of the data structure
partitions, while the partition boundaries are determined by `marks'
(e.g., starting/ending rows) in the original data structure. The
following is a typical (a little abstract) scenario:

void do_sth_memory_intensive(matrix, matrix_parts)
{
    for (j = matrix_parts.start to matrix_parts.end)
        do_sth(matrix[j])
}

matrix = malloc(size)
init(matrix);
other_trivial_ops(matrix);
matrix_parts = split_balanced(matrix, nr_threads);
for (i = 0; i < nr_threads; i++)
    spawn_thread(do_sth_memory_intensive, matrix, matrix_parts[i])

Since your workload is memory-intensive, NUMA capabilities are quite
tempting, but you want to achieve this with the minimum effort. Having
already your static balancing scheme with the corresponding partition
boundaries, all you need is to replace the typical malloc() call with
the custom interleaved allocation call proposed, passing it your
partition sizes and the desired nodes. The rest of the operations on
your data structure will work as is. For example, converting the above
code to NUMA-aware would be sth like

matrix = malloc(size)
init(matrix);
other_trivial_ops(matrix);
matrix_parts = split_balanced(matrix, nr_threads);

+ matrix_interleaved = numa_alloc_interleaved_custom(size, matrix_parts,
nr_threads, mynodes);
+ memcpy(matrix_interleaved, matrix);
+ free(matrix);

for (i = 0; i < nr_threads; i++)
    spawn_thread(do_sth_memory_intensive, matrix_interleaved,
matrix_parts[i])

Of course, if the splitting info can be obtained independently from
`matrix' the memcpy() could be omitted and
`numa_alloc_interleaved_custom()' would replace the first `malloc()'.
But, anyway, this was my exact scenario.

> My first reaction is that you can create any interleaving scheme you 
> want anyways using the low level functions.

This is true, but creating your own interleaving has some low-level
intricacies, that the proposed function hides effectively. Supposing you
have already you own partition scheme, I suppose an out-of-the-box
implementation will be quite tempting.

>
> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-numa" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Regards,

-- 
V.K.

--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html