Re: [RFC dwarves] btf_encoder: record BTF-centric function state instead of DWARF-centric

Alan Maguire <alan.maguire@xxxxxxxxxx> · Mon, 9 Sep 2024 23:19:05 +0100

On 09/09/2024 21:01, Jiri Olsa wrote:
> On Mon, Sep 09, 2024 at 04:50:31PM +0100, Alan Maguire wrote:
>> When recording function information for later comparison (as we
>> do when skipping inconsistent function descriptions), we utilize
>> DWARF-based representations because we do not want to jump the
>> gun and add BTF representations for functions that have inconsistent
>> representations across CUs (and across encoders in parallel mode).
>>
>> So to handle this, we save info about functions, and we can then add
>> them later once we have ensured their various representations are
>> in fact consistent.  However to ensure that the function info
>> is still valid, we need to specify LSK__KEEPIT for CUs, which
>> bloats memory usage (in some cases to ~4Gb).  This is not a good
>> approach, mea culpa.
>>
>> Instead, store a BTF-centric representation where we
>>
>> - store the number of parameters
>> - store the BTF ids of return type and parameters
>> - store the name of the parameters where present
>> - store any LLVM annotation values, component idxs if present
>>
>> So in summary, store everything we need to add the BTF_KIND_FUNC
>> and BTF_KIND_FUNC_PROTO and any associated annotations.  This will
>> allow us to free CUs as we go but make it possible to add functions
>> later.
>>
>> For name storage we can take advantage of the fact that
>> BTF will avoid re-adding a name string so we btf__add_str()
>> to add the parameter name and store the string offset instead;
>> this prevents duplicate name storage while ensuring the parameter
>> name is in BTF.
>>
>> When we cross-compare functions for consistency, do a shallow
>> analysis akin to what was done with DWARF prototype comparisons;
>> compare base types by name, reference types by target type,
>> match loosely between fwds, structs and unions etc.
>>
>> When this is done, memory consumption peaks at 1Gb rather than
>> ~4Gb for vmlinux generation.  Time taken appears to be approximately
>> the same for -j1, but slightly faster for multiple threads;
>> for example:
> 
> nice! did not realize LSK__KEEPIT was that bad..
>

yeah, it was when it appeared we ran out of address space on 32-bit (!)
that I started digging; I'm not sure why I didn't catch this earlier
when I made this change. Moving to using a shared ELF function table
will likely ease the pain further too I suspect. Thanks for testing!
More replies below...

>>
>> Baseline
>>
>> $ time pahole -J vmlinux -j1 --btf_features=default
>>
>> real	0m17.268s
>> user	0m15.808s
>> sys	0m1.415s
>>
>> $ time pahole -J vmlinux -j8 --btf_features=default
>>
>> real	0m10.768s
>> user	0m30.793s
>> sys	0m4.199s
>>
>> With these changes:
>>
>> $ time pahole -J vmlinux -j1 --btf_features=default
>>
>> real	0m16.564s
>> user	0m16.029s
>> sys	0m0.492s
>>
>> $ time pahole -J vmlinux -j8 --btf_features=default
>>
>> real	0m8.332s
>> user	0m30.627s
>> sys	0m0.714s
>>
>> In terms of functions encoded, 360 fewer functions make it into
> 
> on my setup I'm getting 386 fewer fcuntions
> 
>> BTF due to the different approach in consistency checking, but after
>> examining these cases, they do appear to be legitimately inconsistent
>> functions where the optimized versions have parameter mismatches
>> with the non-optimized expectations.
> 
> I checked on one case and it seems like obvious inconsistency:
>
>   static void io_serial_out(unsigned long addr, int offset, int value)
>   static void io_serial_out(struct uart_port *p, int offset, int value)
> 
> not sure how we could missed that before
> 

I _think_ I know why; there was a bug in prototype comparison where we
were returning success early if prototype return values matched; in
funcs__match() we saw:

	if (f1->proto.tag.type == f2->proto.tag.type)
                return true;

This was meant to be a fastpath for matching tags, but this is just the
return value type (which matches in the above case you found). The BTF
matching doesn't have this issue, so that likely explains why we're
being a bit fussier and matching fewer functions.

>>
>> Mileage may vary of course, and any testing folks could do would
>> be greatly appreciated!
> 
> would be nice to have some test for before/after vmlinux images,
> that would compare generated BTF functions
> 

Yeah, and also ideally keep an eye on peak memory utilization during BTF
generation. Arnaldo has added a few tests, I'll see if I can come up
with something in this area too.

> thanks,
> jirka
> 
> 
> SNIP
> 
>>  static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct function *fn, struct elf_function *func)
>>  {
>> -	fn->priv = encoder->cu;
>> -	if (func->function) {
>> -		struct function *existing = func->function;
>> +	struct btf_encoder_func_state *state = zalloc(sizeof(*state));
>> +	struct ftype *ftype = &fn->proto;
>> +	struct btf *btf = encoder->btf;
>> +	struct llvm_annotation *annot;
>> +	struct parameter *param;
>> +	uint8_t param_idx = 0;
>> +
>> +	if (!state)
>> +		return -ENOMEM;
>> +	state->nr_parms = ftype->nr_parms + (ftype->unspec_parms ? 1 : 0);
>> +	state->ret_type_id = ftype->tag.type == 0 ? 0 : encoder->type_id_off + ftype->tag.type;
>> +	if (state->nr_parms > 0) {
>> +		state->parms = zalloc(state->nr_parms * sizeof(*state->parms));
>> +		if (!state->parms)
>> +			return -ENOMEM;
>> +	}
>> +	state->inconsistent_proto = ftype->inconsistent_proto;
>> +	state->unexpected_reg = ftype->unexpected_reg;
>> +	state->optimized_parms = ftype->optimized_parms;
>> +	ftype__for_each_parameter(ftype, param) {
>> +		const char *name = parameter__name(param) ?: "";
>> +
>> +		state->parms[param_idx].name_off = btf__add_str(btf, name);
>> +		if (state->parms[param_idx].name_off < 0)
>> +			return state->parms[param_idx].name_off;
>> +		state->parms[param_idx].type_id = param->tag.type == 0 ? 0 : encoder->type_id_off + param->tag.type;
> 
> so IIUC because functions are added as last we're sure all the used types
> for arguments are stored in encoder's BTF already.. for funcs__match later ?
>

Exactly. We add functions last, so their parameters and return values
are already assigned BTF types.  The only thing that _may_ be missing
(apart from the FUNC and FUNC_PROTO themselves) is the names of the
parameters; we can efficiently add them while saving function info
because libbpf is clever enough to dedup strings when added.  This is
the thing I'd missed before; as long as we can compare the return value
+ parameter types (which we can), we can do our function equivalence
comparisons in BTF. In some cases (same encoder) we will be comparing
within the same BTF, while after thread completion, we need to compare
across encoders to catch inconsistencies.

>> +		param_idx++;
>> +	}
>> +	if (ftype->unspec_parms)
>> +		state->parms[param_idx].type_id = 0;
>> +
>> +	list_for_each_entry(annot, &fn->annots, node)
>> +		state->nr_annots++;
>> +	if (state->nr_annots) {
>> +		uint8_t idx = 0;
>> +
>> +		state->annots = zalloc(sizeof(*state->annots));
> 
> zalloc needs sizeof(..) * state->nr_annots ?
> 

oops, thanks for catching that!

>> +		if (!state->annots)
>> +			return -ENOMEM;
>> +		list_for_each_entry(annot, &fn->annots, node) {
>> +			state->annots[idx].value = btf__add_str(encoder->btf, annot->value);
>> +			if (state->annots[idx].value < 0)
>> +				return state->annots[idx].value;
>> +			state->annots[idx].component_idx = annot->component_idx;
>> +			idx++;
>> +		}
>> +	}
>> +
>> +	if (func->state) {
>> +		struct btf_encoder_func_state *existing = func->state;
>>  
>>  		/* If saving and we find an existing entry, we want to merge
>>  		 * observations across both functions, checking that the
>> @@ -899,98 +1097,136 @@ static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct functi
>>  		 * to add the local function later (encoder + type_id_off)
>>  		 * such that we can add the function later.
>>  		 */
>> -		existing->proto.optimized_parms |= fn->proto.optimized_parms;
>> -		existing->proto.unexpected_reg |= fn->proto.unexpected_reg;
>> -		if (!existing->proto.unexpected_reg && !existing->proto.inconsistent_proto &&
>> -		     !funcs__match(encoder, func, fn))
>> -			existing->proto.inconsistent_proto = 1;
>> +		existing->optimized_parms |= state->optimized_parms;
>> +		existing->unexpected_reg |= state->unexpected_reg;
>> +		if (!existing->unexpected_reg && !state->inconsistent_proto &&
>> +		     !funcs__match(encoder, func, encoder->btf, state,
>> +				   encoder->btf, existing))
>> +			existing->inconsistent_proto = 1;
>> +		zfree(&state->annots);
>> +		zfree(&state->parms);
>> +		free(state);
> 
> some error returns above are missing whole state cleanup
>

yep, will rework to ensure we free in error case.

>>  	} else {
>> -		func->state.type_id_off = encoder->type_id_off;
>> -		func->function = fn;
>> -		encoder->cu->functions_saved++;
>> +		func->state = state;
>>  	}
>>  	return 0;
>>  }
>>  
> 
> SNIP

thanks again for looking at this and trying it out!

Alan