Re: [bpf-next v1 2/2] bpftool: profile online CPUs instead of possible

Tonghao Zhang <tong@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 1 Feb 2023 11:12:54 +0800



> On Jan 20, 2023, at 11:35 PM, Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
> 
> 2023-01-19 16:22 UTC+0800 ~ Tonghao Zhang <tong@xxxxxxxxxxxxx>
>>> On Jan 18, 2023, at 6:41 PM, Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
>>> 
>>> 2023-01-17 12:49 UTC+0800 ~ tong@xxxxxxxxxxxxx
>>>> From: Tonghao Zhang <tong@xxxxxxxxxxxxx>
>>>> 
>>>> The number of online cpu may be not equal to possible cpu.
>>>> bpftool prog profile, can not create pmu event on possible
>>>> but not online cpu.
>>> 
>>> s/not/on/ ?
>>> 
>>>> 
>>>> $ dmidecode -s system-product-name
>>>> PowerEdge R620
>>>> $ cat /sys/devices/system/cpu/online
>>>> 0-31
>>>> $ cat /sys/devices/system/cpu/possible
>>>> 0-47
>>>> 
>>>> To fix this issue, use online cpu instead of possible, to
>>>> create perf event and other resource.
>>>> 
>>>> Signed-off-by: Tonghao Zhang <tong@xxxxxxxxxxxxx>
>>>> Cc: Quentin Monnet <quentin@xxxxxxxxxxxxx>
>>>> Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
>>>> Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
>>>> Cc: Andrii Nakryiko <andrii@xxxxxxxxxx>
>>>> Cc: Martin KaFai Lau <martin.lau@xxxxxxxxx>
>>>> Cc: Song Liu <song@xxxxxxxxxx>
>>>> Cc: Yonghong Song <yhs@xxxxxx>
>>>> Cc: John Fastabend <john.fastabend@xxxxxxxxx>
>>>> Cc: KP Singh <kpsingh@xxxxxxxxxx>
>>>> Cc: Stanislav Fomichev <sdf@xxxxxxxxxx>
>>>> Cc: Hao Luo <haoluo@xxxxxxxxxx>
>>>> Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
>>>> ---
>>>> tools/bpf/bpftool/prog.c | 5 +++--
>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
>>>> index cfc9fdc1e863..08b352dd799e 100644
>>>> --- a/tools/bpf/bpftool/prog.c
>>>> +++ b/tools/bpf/bpftool/prog.c
>>>> @@ -2056,6 +2056,7 @@ static int profile_parse_metrics(int argc, char **argv)
>>>> 
>>>> static void profile_read_values(struct profiler_bpf *obj)
>>>> {
>>>> +	__u32 possible_cpus = libbpf_num_possible_cpus();
>>>> 	__u32 m, cpu, num_cpu = obj->rodata->num_cpu;
>>>> 	int reading_map_fd, count_map_fd;
>>>> 	__u64 counts[num_cpu];
>>>> @@ -2080,7 +2081,7 @@ static void profile_read_values(struct profiler_bpf *obj)
>>>> 		profile_total_count += counts[cpu];
>>>> 
>>>> 	for (m = 0; m < ARRAY_SIZE(metrics); m++) {
>>>> -		struct bpf_perf_event_value values[num_cpu];
>>>> +		struct bpf_perf_event_value values[possible_cpus];
>>>> 
>>>> 		if (!metrics[m].selected)
>>>> 			continue;
>>>> @@ -2321,7 +2322,7 @@ static int do_profile(int argc, char **argv)
>>>> 	if (num_metric <= 0)
>>>> 		goto out;
>>>> 
>>>> -	num_cpu = libbpf_num_possible_cpus();
>>>> +	num_cpu = libbpf_num_online_cpus();
>>>> 	if (num_cpu <= 0) {
>>>> 		p_err("failed to identify number of CPUs");
>>>> 		goto out;
>>> 
>>> Thanks, but it doesn't seem to be enough to solve the issue. How did you
>>> test it? With your series applied locally, I'm trying the following
>>> (Intel x86_64, CPUs: 0..7):
>>> 
>>> 	# echo 0 > /sys/devices/system/cpu/cpu2/online
>>> 	# ./bpftool prog profile id 1525 duration 1 cycles instructions
>>> 	Error: failed to create event cycles on cpu 2
>>> 
>>> It seems that we're still trying to open the perf events on the offline
>>> CPU in profile_open_perf_events(), because even though we try to use
>>> fewer of the possible CPUs we're still referencing them in order by
>> Hi
>> Thanks for your review and comment.
>> I don’t test the case that one cpu is offline which is not last CPU.
>>> their index. So it works if I only disabled the last CPUs instead (#7,
>>> then #6, ...), but to work with _any_ CPU disabled, we would need to
>>> retrieve the list of online CPUs.
>>> 
>> Yes, In other way, to fix it, we can use the errno return by open_perf_event. If errno is ENODEV, we can skip this cpu to profile? 
>> The patch fix this issue.
>> 
>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>> index 620032042576..deec7196b48c 100644
>> --- a/tools/bpf/bpftool/common.c
>> +++ b/tools/bpf/bpftool/common.c
>> @@ -55,6 +55,24 @@ void p_err(const char *fmt, ...)
>> 	va_end(ap);
>> }
>> 
>> +void p_warn(const char *fmt, ...)
>> +{
>> +	va_list ap;
>> +
>> +	va_start(ap, fmt);
>> +	if (json_output) {
>> +		jsonw_start_object(json_wtr);
>> +		jsonw_name(json_wtr, "warning");
>> +		jsonw_vprintf_enquote(json_wtr, fmt, ap);
>> +		jsonw_end_object(json_wtr);
>> +	} else {
>> +		fprintf(stderr, "Warn: ");
>> +		vfprintf(stderr, fmt, ap);
>> +		fprintf(stderr, "\n");
>> +	}
>> +	va_end(ap);
>> +}
>> +
>> void p_info(const char *fmt, ...)
>> {
>> 	va_list ap;
>> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
>> index a84224b6a604..e62edec9e13a 100644
>> --- a/tools/bpf/bpftool/main.h
>> +++ b/tools/bpf/bpftool/main.h
>> @@ -86,6 +86,7 @@ extern struct btf *base_btf;
>> extern struct hashmap *refs_table;
>> 
>> void __printf(1, 2) p_err(const char *fmt, ...);
>> +void __printf(1, 2) p_warn(const char *fmt, ...);
>> void __printf(1, 2) p_info(const char *fmt, ...);
>> 
>> bool is_prefix(const char *pfx, const char *str);
>> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
>> index cfc9fdc1e863..d9363ba01ec0 100644
>> --- a/tools/bpf/bpftool/prog.c
>> +++ b/tools/bpf/bpftool/prog.c
>> @@ -2233,10 +2233,36 @@ static void profile_close_perf_events(struct profiler_bpf *obj)
>> 	profile_perf_event_cnt = 0;
>> }
>> 
>> +static int profile_open_perf_event(int mid, int cpu, int map_fd)
>> +{
>> +	int pmu_fd;
>> +
>> +	pmu_fd = syscall(__NR_perf_event_open, &metrics[mid].attr,
>> +			 -1/*pid*/, cpu, -1/*group_fd*/, 0);
>> +	if (pmu_fd < 0) {
>> +		if (errno == ENODEV) {
>> +			p_warn("cpu %d may be offline, skip %s metric profiling.",
>> +				cpu, metrics[mid].name);
> 
> Nit: I think it's fine to keep this at the info level (p_info()).
Ok
>> +			profile_perf_event_cnt++;
>> +			return 0;
>> +		}
>> +		return -1;
>> +	}
>> +
>> +	if (bpf_map_update_elem(map_fd,
>> +				&profile_perf_event_cnt,
>> +				&pmu_fd, BPF_ANY) ||
>> +	    ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0))
>> +		return -1;
>> +
>> +	profile_perf_events[profile_perf_event_cnt++] = pmu_fd;
>> +	return 0;
>> +}
>> +
>> static int profile_open_perf_events(struct profiler_bpf *obj)
>> {
>> 	unsigned int cpu, m;
>> -	int map_fd, pmu_fd;
>> +	int map_fd;
>> 
>> 	profile_perf_events = calloc(
>> 		sizeof(int), obj->rodata->num_cpu * obj->rodata->num_metric);
>> @@ -2255,17 +2281,11 @@ static int profile_open_perf_events(struct profiler_bpf *obj)
>> 		if (!metrics[m].selected)
>> 			continue;
>> 		for (cpu = 0; cpu < obj->rodata->num_cpu; cpu++) {
>> -			pmu_fd = syscall(__NR_perf_event_open, &metrics[m].attr,
>> -					 -1/*pid*/, cpu, -1/*group_fd*/, 0);
>> -			if (pmu_fd < 0 ||
>> -			    bpf_map_update_elem(map_fd, &profile_perf_event_cnt,
>> -						&pmu_fd, BPF_ANY) ||
>> -			    ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0)) {
>> +			if (profile_open_perf_event(m, cpu, map_fd)) {
>> 				p_err("failed to create event %s on cpu %d",
>> 				      metrics[m].name, cpu);
>> 				return -1;
>> 			}
>> -			profile_perf_events[profile_perf_event_cnt++] = pmu_fd;
>> 		}
>> 	}
>> 	return 0;
>> 
>> 
>> ----
>> Best Regards, Tonghao <tong@xxxxxxxxxxxxx>
>> 
> 
> I haven't tested this patch, but yes, looks like it should address the
> issue. Could you submit a proper v2, please?
Yes, v2 will be sent soon. I’m on vacation. So sorry reply you late.
> Quentin

----
Best Regards, Tonghao <tong@xxxxxxxxxxxxx>