Re: [PATCH v2 2/2] hex: make hash_to_hex_algop() and friends thread-safe

René Scharfe <l.s.r@xxxxxx> · Sun, 26 Jul 2020 19:41:08 +0200

Am 18.07.20 um 05:52 schrieb Matheus Tavares:
> On Thu, Jul 16, 2020 at 9:56 AM Johannes Schindelin <Johannes.Schindelin@xxxxxx> wrote:
>>
>> Now, I am _far_ from knowing what I'm doing with Coccinelle, but I think
>> this here semantic patch should get you going:
>>
>> -- snipsnap --
>> @@
>> expression E;
>> @@
>>   {
>> ++   char hex[GIT_MAX_HEXSZ + 1];
>>      ...
>> -    oid_to_hex(E)
>> +    oid_to_hex_r(hex, E)
>>      ...
>>   }
>>
>> @@
>> expression E1, E2;
>> @@
>>   {
>> ++   char hex1[GIT_MAX_HEXSZ + 1], hex2[GIT_MAX_HEXSZ + 1];
>>      ...
>> -    oid_to_hex(E1)
>> +    oid_to_hex_r(hex1, E1)
>>      ...
>> -    oid_to_hex(E2)
>> +    oid_to_hex_r(hex2, E2)
>>      ...
>>   }
>
> Thanks for this nice example! This already worked very well in some of
> my tests :)
>
> However, with my _very_ limited notion of Coccinelle, I didn't
> understand why some code snippets didn't match the above rules. For
> example, the structure below:
>
> func(...)
> {
> 	if (cond)
> 		func2("%s", oid_to_hex(a));
> }
>
> I thought it could be because the `if` statement is missing the curly
> brackets (and it does work if I add the brackets), but to my surprise,
> adding another oid_to_hex() call in an `else` case also made the code
> match the rule:
>
> func(...)
> {
> 	if (cond)
> 		func2("%s", oid_to_hex(a));
> 	else
> 		func2("%s", oid_to_hex(a));
> }
>
> The following snippet also correctly matches, but spatch introduces only
> one `hex` variable:
>
> 	if (cond)
> 		func2("%s, %s", oid_to_hex(a), oid_to_hex(b));
> 	else
> 		func2("%s", oid_to_hex(a));
>

The produced diff looks like this:

+	char hex[GIT_MAX_HEXSZ + 1];
 	if (cond)
-		func2("%s, %s", oid_to_hex(a), oid_to_hex(b));
+		func2("%s, %s", oid_to_hex_r(hex, a), oid_to_hex_r(hex, b));
 	else
-		func2("%s", oid_to_hex(a));
+		func2("%s", oid_to_hex_r(hex, a));

So the first rule was applied (the one for a single oid_to_hex call),
but we need the second one (for two oid_to_hex calls).  Using the same
hex buffer for two different values won't work.

> We will probably want our semantic rules to handle an arbitrary number
> of `oid_to_hex()` calls in each function, but in scenarios like the
> above one, we only really need 2 hex buffers despite having 3 calls...

oid_to_hex() has two interesting features, and we need to make sure they
are preserved for callers that use them: It has a ring of four buffers,
so you can use it for four different values, and those buffers are
static, so its results can be passed around arbitrarily.

> That might be a little tricky, I guess.

It may be impossible to cover all cases.  E.g. callers of oid_to_hex()
could return that buffer (like e.g. diff_aligned_abbrev()) or save
them in some global variable (like e.g. string_list_append() with a
non-DUP string_list).  But safe conversion rules can got us surprisingly
far.

> Another thing that might be tricky in this conversion is checking for
> name conflicts with the added `hex` variable (but maybe Coccinelle
> already has a facilitator mechanism for such cases? IDK).

That's easy.  There exists no hex_buffer, yet.  We can just claim that
name for automatic conversions.  It's a bit too long for people to
type out (we have a few variables named hexbuffer, though), but not
crazy long.

So what is safe?  Calls of oid_to_hex() in the argument list of many
functions is.  E.g. puts() and printf() just consume the string that
is passed to them, and they don't store it anywhere.  That means no
static storage is needed for those.

string_list_append() is only safe if it's the duplicating variant.
Since this is a runtime option of the underlying string_list this is
hard to prove in a Coccinelle rule.  The time is better spent
converting them manually, I think.

And function calls with more than four oid_to_hex() calls are broken
already, so we only need to have rules for up to four of them.  Here
is the simplest rule I can come up with for handling up to four
oid_to_hex() calls:

@@
identifier fn =~ "^(.*printf.*|puts)$";
@@
(
  fn(...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer,
      ...
    ), ...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer2,
      ...
    ), ...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer3,
      ...
    ), ...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer4,
      ...
    ), ...
  )
|
  fn(...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer,
      ...
    ), ...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer2,
      ...
    ), ...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer3,
      ...
    ), ...
  )
|
  fn(...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer,
      ...
    ), ...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer2,
      ...
    ), ...
  )
|
  fn(...,
-   oid_to_hex
+   oid_to_hex_r
    (
+     hex_buffer,
      ...
    ), ...
  )
)

So the sub-rules are ordered from matching all four possible
oid_to_hex() calls down to a single one.  Only safe consumers are
matched.  That regex can and should be extended.

Having a list of good consumers has the nice side-effect of speeding up
the diff generation.  It still takes a few minutes on my system, though.

We still need to declare of the local buffers.  We can add them on
demand to each function with rules like this:

@avoid_duplicates@
identifier buf =~ "^hex_buffer[2-4]?$";
@@
- char buf[GIT_MAX_HEXSZ + 1];

@hex_buffer4_on_demand exists@
identifier fn;
@@
  fn(...) {
+   char hex_buffer4[GIT_MAX_HEXSZ + 1];
    <+... hex_buffer4 ...+>
  }

@hex_buffer3_on_demand exists@
identifier fn;
@@
  fn(...) {
+   char hex_buffer3[GIT_MAX_HEXSZ + 1];
    <+... hex_buffer3 ...+>
  }

@hex_buffer2_on_demand exists@
identifier fn;
@@
  fn(...) {
+   char hex_buffer2[GIT_MAX_HEXSZ + 1];
    <+... hex_buffer2 ...+>
  }

@hex_buffer_on_demand exists@
identifier fn;
@@
  fn(...) {
+   char hex_buffer[GIT_MAX_HEXSZ + 1];
    <+... hex_buffer ...+>
  }

Why remove them first?  To avoid duplicates when other convertible
oid_to_hex() calls are added later and the semantic patch applied
again.

Why declare the buffers with function scope?  To avoid shadowing.

Why are the rules reversed?  They add declarations at the top, so
hex_buffer to hex_buffer4 end up being declared in that order in
the resulting file.

Why not use the "fresh identifier" feature of Coccinelle to generate
an unused name each time?  I don't know how to integrate that into
the 4/3/2/1 rule above without having to repeat the list of safe
consumers or declaring unused buffers.  And reusing buffers between
safe consumers is fine.

René