Re: [PATCH] Properly align memory allocations and temporary buffers

René Scharfe <l.s.r@xxxxxx> · Sat, 8 Jan 2022 00:30:44 +0100

Am 07.01.22 um 22:30 schrieb Junio C Hamano:
> René Scharfe <l.s.r@xxxxxx> writes:
>
>> If falling back to malloc(3) is fine in general then I wonder if we can
>> just it use always.  This would save two branches and make buffer
>> management trivial here.  How much worse is malloc(3) on platforms with
>> alloca(3)?  Do we sort lots of short lists somewhere?  In other words:
>> Does this stack allocation optimization actually make a measurable
>> difference?
>
> Well all the preimage of this came from your 04ee8b87 (compat: add
> qsort_s(), 2017-01-22), so you tell me ;-)

Right, except I stole the code almost verbatim from compat/qsort.c,
which had this optimization since 43fe901b71 (compat: Add simplified
merge sort implementation from glibc, 2008-02-05).  :)  The
optimization may have raised my eyebrow a bit at the time, but not
enough to come up with a meaningful benchmark..

https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/msort.c (the
original) still uses alloca(3) (and more tricks), by the way;
https://cgit.freebsd.org/src/tree/lib/libc/stdlib/merge.c still
doesn't.

>
>> Heap fragmention should not be a concern here, at least, because the
>> pattern of requesting, using and releasing a single allocation won't
>> leave holes.
>
> Sure.  Even in a multi-threaded environment that should be true.
>
> ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
> Subject: compat/qsort_s.c: avoid using potentially unaligned access
>
> The compatibility definition for qsort_s() uses "char buffer[1024]"
> on the stack to avoid making malloc() calls for small temporary
> space, which essentially hand-rolls alloca().
>
> But the elements of the array being sorted may have alignment needs
> more strict than what an array of bytes may have. &buf[0] may be
> word aligned, but using the address as if it stores the first
> element of an array of a struct, whose first member may need to be
> aligned on double-word boundary, would be a no-no.
>
> We could use xalloca() from git-compat-util.h, or alloca() directly
> on platforms with HAVE_ALLOCA_H, but let's try using unconditionally
> xmalloc() before we know the performance characteristics of the
> callers.
>
> It may not make much of an argument to inspect the current callers
> and say "it shouldn't matter to any of them", but anyway:
>
>  * The one in object-name.c is used to sort potential matches to a
>    given ambiguous object name prefix in the error path;
>
>  * The one in pack-write.c is done once per a pack .idx file being
>    written to create the reverse index, so (1) the cost of malloc()
>    overhead is dwarfed by the cost of the packing operation, and (2)
>    the number of entries being sorted is the number of objects in a
>    pack;
>
>  * The one in ref-filter.c is used by "branch --list", "tag --list",
>    and "for-each-ref", only once per operation.  We sort an array of
>    pointers with entries, each corresponding to a ref that is shown.
>
>  * The one in string-list.c is used by sort_string_list(), which is
>    way too generic to assume any access patterns, so it may or may
>    not matter, but I do not care too much ;-)
>
> Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
> ---
>  compat/qsort_s.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
>
> diff --git c/compat/qsort_s.c w/compat/qsort_s.c
> index 52d1f0a73d..0f7ff30f5f 100644
> --- c/compat/qsort_s.c
> +++ w/compat/qsort_s.c
> @@ -49,21 +49,15 @@ int git_qsort_s(void *b, size_t n, size_t s,
>  		int (*cmp)(const void *, const void *, void *), void *ctx)
>  {
>  	const size_t size = st_mult(n, s);
> -	char buf[1024];
> +	char *tmp;
>
>  	if (!n)
>  		return 0;
>  	if (!b || !cmp)
>  		return -1;
>
> -	if (size < sizeof(buf)) {
> -		/* The temporary array fits on the small on-stack buffer. */
> -		msort_with_tmp(b, n, s, cmp, buf, ctx);
> -	} else {
> -		/* It's somewhat large, so malloc it.  */
> -		char *tmp = xmalloc(size);
> -		msort_with_tmp(b, n, s, cmp, tmp, ctx);
> -		free(tmp);
> -	}
> +	tmp = xmalloc(size);
> +	msort_with_tmp(b, n, s, cmp, tmp, ctx);
> +	free(tmp);
>  	return 0;
>  }

--- >8 ---
Subject: [PATCH] stable-qsort: avoid using potentially unaligned access

Like in the previous patch for compat/qsort_s.c, remove the optimization
of using an on-stack buffer to avoid small allocations.  This ensures
maximum alignment for the array elements and simplifies the code a bit.

The performance impact for the current callers is unlikely to be
noticeable:

 * compat/mingw.c::make_environment_block() uses ALLOC_ARRAY and
   ALLOC_GROW several times already, so another allocation of up to 1KB
   should not matter much.

 * diffcore-rename.c::diffcore_rename_extended() is called once per diff
   or twice per merge, and those require allocations for each object and
   more already.

 * merge-ort.c::detect_and_process_renames() is called once per merge.
   It's responsible for the two per-merge diffcore_rename_extended()
   calls mentioned above as well, though.  So this is possibly the most
   impacted caller.  Per-object allocations are likely to dwarf the
   additional small allocations in git_stable_qsort(), though.

Signed-off-by: René Scharfe <l.s.r@xxxxxx>
---
 stable-qsort.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/stable-qsort.c b/stable-qsort.c
index 6cbaf39f7b..7ff12467cd 100644
--- a/stable-qsort.c
+++ b/stable-qsort.c
@@ -48,15 +48,9 @@ void git_stable_qsort(void *b, size_t n, size_t s,
 		      int (*cmp)(const void *, const void *))
 {
 	const size_t size = st_mult(n, s);
-	char buf[1024];
-
-	if (size < sizeof(buf)) {
-		/* The temporary array fits on the small on-stack buffer. */
-		msort_with_tmp(b, n, s, cmp, buf);
-	} else {
-		/* It's somewhat large, so malloc it.  */
-		char *tmp = xmalloc(size);
-		msort_with_tmp(b, n, s, cmp, tmp);
-		free(tmp);
-	}
+	char *tmp;
+
+	tmp = xmalloc(size);
+	msort_with_tmp(b, n, s, cmp, tmp);
+	free(tmp);
 }
--
2.34.1