Re: [PATCH dwarves 4/5] btf_encoder: represent "."-suffixed optimized functions (".isra.0") in BTF

Kui-Feng Lee <sinquersw@xxxxxxxxx> · Thu, 26 Jan 2023 09:43:26 -0800

On 1/25/23 10:59, Alan Maguire wrote:
On 25/01/2023 17:54, Kui-Feng Lee wrote:
On 1/24/23 05:45, Alan Maguire wrote:
+/*
+ * static functions with suffixes are not added yet - we need to
+ * observe across all CUs to see if the static function has
+ * optimized parameters in any CU, since in such a case it should
+ * not be included in the final BTF.  NF_HOOK.constprop.0() is
+ * a case in point - it has optimized-out parameters in some CUs
+ * but not others.  In order to have consistency (since we do not
+ * know which instance the BTF-specified function signature will
+ * apply to), we simply skip adding functions which have optimized
+ * out parameters anywhere.
+ */
+static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct function *fn)
+{
+    struct btf_encoder *parent = encoder->parent ? encoder->parent : encoder;
+    const char *name = function__name(fn);
+    struct function **nodep;
+    int ret = 0;
+
+    pthread_mutex_lock(&parent->saved_func_lock);
Do you have the number of static functions with suffices?

There are a few thousand, and around 25000 static functions
overall ("."-suffixed are all static) that will participate in
the tree representations (see patch 5).  This equates to roughly
half of the vmlinux BTF functions.

To evaluate the effectiveness of your patchset, I conducted an 
experiment where I ran a command:

`time env LLVM_OBJCOPY=objcopy pahole -J --btf_gen_floats 
--lang_exclude=rust -j .tmp_vmlinux.btf`.

On my machine, it took about

 - 9s w/o the patchset (3s waiting for the worker threads)

 - 13s w/ the patchset (7s waiting for the worker threads)

It was about 4s difference.

If I turned multi-threading off (w/o -j), it took

 - 28s w/o the patchset.

 - 32s w/ the patchset.

It was about 4s difference as sell.

Hence, multi-threading does not benefit us in the instance of this 
patchset. Lock contention should be taken into account heavily here. 
Approximately 4% of the time is spent when executing a Linux incremental 
build (about 96s~108s) with an insignificant modification to the source 
tree for about four seconds.

Taking into consideration the previous experience that shows a reduction 
in BTF info processing time (not including loading and IO) to 13%, I am 
uncertain if it pays off to invest my time towards reducing 4s to <1s. 
Though, cutting down 3 seconds every single time I need to rebuild the 
tree for some small changes might be worth it.

If the number of static functions with suffices is high, the contention of the lock would be an issue.

Is it possible to keep a local pool of static functions with suffices? The pool will be combined with its parent either at the completion of a CU, before ending the thread or when merging into the main thread.

It's possible alright. I'll try to lay out the possibilities so we
can figure out the best way forward.

Option 1: global tree of static functions, created during DWARF loading

Pro: Quick addition/lookup, we can flag optimizations or inconsistent prototypes as
we encounter them.
Con: Lock contention between encoder threads.

Option 2: store static functions in a per-encoder tree, traverse them all
prior to BTF merging to eliminate unwanted functions

Pro: limits contention.
Con: for each static function in each encoder, we need to look it up in all other
encoder trees. In option 1 we paid that price as the function was added, here
we pay it later on prior to merging. So processing here is
O(number_functions * num_encoders). There may be a cleverer way to handle
this but I can't see it right now.

There may be other approaches to this of course, but these were the two I
could come up with. What do you think?

Option 2 appears to be the more convenient and effective solution, 
whereas Option 1, I guess, will require considerable effort for a 
successful outcome.

Alan

+    nodep = tsearch(fn, &parent->saved_func_tree, function__compare);
+    if (nodep == NULL) {
+        fprintf(stderr, "error: out of memory adding local function '%s'\n",
+            name);
+        ret = -1;
+        goto out;
+    }
+    /* If we find an existing entry, we want to merge observations
+     * across both functions, checking that the "seen optimized-out
+     * parameters" status is reflected in our tree entry.
+     * If the entry is new, record encoder state required
+     * to add the local function later (encoder + type_id_off)
+     * such that we can add the function later.
+     */
+    if (*nodep != fn) {
+        (*nodep)->proto.optimized_parms |= fn->proto.optimized_parms;
+    } else {
+        struct btf_encoder_state *state = zalloc(sizeof(*state));
+
+        if (state == NULL) {
+            fprintf(stderr, "error: out of memory adding local function '%s'\n",
+                name);
+            ret = -1;
+            goto out;
+        }
+        state->encoder = encoder;
+        state->type_id_off = encoder->type_id_off;
+        fn->priv = state;
+        encoder->saved_func_cnt++;
+        if (encoder->verbose)
+            printf("added local function '%s'%s\n", name,
+                   fn->proto.optimized_parms ?
+                   ", optimized-out params" : "");
+    }
+out:
+    pthread_mutex_unlock(&parent->saved_func_lock);
+    return ret;
+}
+