Re: [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 02 Dec 2024 21:52:55 +0100

On Sun, Dec 01 2024 at 17:45, Vadim Fedorenko wrote:
> On 01.12.2024 12:46, Thomas Gleixner wrote:
>> On Fri, Nov 22 2024 at 16:58, Vadim Fedorenko wrote:
>>> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
>>> index de0f9e5f9f73..a549aea25f5f 100644
>>> --- a/arch/x86/net/bpf_jit_comp32.c
>>> +++ b/arch/x86/net/bpf_jit_comp32.c
>>> @@ -2094,6 +2094,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>>>   			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>>>   				int err;
>>>   
>>> +				if (imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
>>> +					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
>>> +						EMIT3(0x0F, 0xAE, 0xE8);
>>> +					EMIT2(0x0F, 0x31);
>> 
>> What guarantees that RDTSC is supported by the CPU?
>
> Well, technically it may be a problem on x86_32 because there are x86 compatible
> platforms which don't have RDTSC, but they are almost 16+ years old, and I'm not
> quite sure we expose vDSO on such platforms.

Care to look at the VDSO related config symbols? They are
unconditionally selected independent of 16+ year old platforms.

Also TSC can be disabled at boot time by a particular platform because
it's implementation is buggy.

It does not matter at all whether those platforms are old or not. What
matters is that the code which tries to access the TSC has to be correct
under all circumstances and not under magic assumptions.

>> Aside of that, if you want the read to be ordered, then you need to take
>> RDTSCP into account too.
>
> Yes, we have already had this discussion. RDTSCP has the same ordering
> guaranties as "LFENCE; RDTSC" according to the programming manuals. But it also
> provides "cookie" value, which is not used in this case and just trashes the
> value of ECX. To avoid additional register manipulation, I used lfence option.

With zero comment and zero explanation in the change log.

>>> +#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
>>> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
>>> +{
>>> +	const struct vdso_data *vd = __arch_get_k_vdso_data();
>>> +
>>> +	vd = &vd[CS_RAW];
>>> +
>>> +	/* CS_RAW clock_mode translates to VDSO_CLOCKMODE_TSC on x86 and
>> 
>> How so?
>> 
>> vd->clock_mode is not guaranteed to be VDSO_CLOCKMODE_TSC or
>> VDSO_CLOCKMODE_ARCHTIMER. CS_RAW is the access to the raw (uncorrected)
>> time of the current clocksource. If the clock mode is not matching, then
>> you cannot access it.
>
> That's more about x86 and virtualization options. But in the end all this ends
> up in reading tsc value.

As long as VDSO_CLOCKMODE_TSC == 1. There is no guarantee for this.

> And we do JIT anyway, so this function call will never be executed on
> x86. Other architectures (well, apart from MIPS) don't care about
> vd->clock_mode at all. And we don't provide kfuncs for architectures
> without JIT

They care. Because all of them can set VDSO_CLOCKMODE_NONE, which forces
the fallback into the syscall. And they do so for good reasons. Either
because the clocksource is not functional or has other limitiations
which prevent VDSO usage or even usage in the kernel.

> For MIPS I think I can ifdef these new kfuncs to the case when CONFIG_CSRC_R4K
> is not defined.

Which excludes VDSO_CLOCKMODE_GIC, which is the most common clock source
on modern MIPS systems.

Aside of that a config symbol does not guarantee at all that the
clocksource exists on the actual hardware or is properly configured and
enabled.

>>> +	 * We still have to provide vdso_data for some architectures to avoid
>>> +	 * NULL pointer dereference.
>>> +	 */
>>> +	return __arch_get_hw_counter(1, vd);
>> 
>> This is outright dangerous. __arch_get_hw_counter() is for VDSO usage
>> and not for in kernel usage. What guarantees you that the architecture
>> specific implementation does not need access to user only mappings.
>> 
>> Aside of that what guarantees that '1' is what you want and stays that
>> way forever? It's already broken on MIPS.
>
> I can ifdef MIPS case until we have JIT for it (which has pretty much 
> straightforward implementation for HW counter)

Again. You cannot make any assumptions about any implementation detail
of __arch_get_hw_counter().

It is a function solely designed for user space VDSO usage and in kernel
usage is simply bogus.

Just because it "works" today, does not guarantee that it works tomorrow
and there is no justification for BPF to enforce compatibility with
magic number '1' and a constraint that the code has to work in the
kernel.

Thanks,

        tglx