> On Nov 8, 2019, at 11:34 AM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Thu, Nov 7, 2019 at 10:39 PM Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Nov 7, 2019, at 8:20 PM, Andrii Nakryiko <andriin@xxxxxx> wrote: >>> >>> Add ability to memory-map contents of BPF array map. This is extremely useful >>> for working with BPF global data from userspace programs. It allows to avoid >>> typical bpf_map_{lookup,update}_elem operations, improving both performance >>> and usability. >>> >>> There had to be special considerations for map freezing, to avoid having >>> writable memory view into a frozen map. To solve this issue, map freezing and >>> mmap-ing is happening under mutex now: >>> - if map is already frozen, no writable mapping is allowed; >>> - if map has writable memory mappings active (accounted in map->writecnt), >>> map freezing will keep failing with -EBUSY; >>> - once number of writable memory mappings drops to zero, map freezing can be >>> performed again. >>> >>> Only non-per-CPU arrays are supported right now. Maps with spinlocks can't be >>> memory mapped either. >>> >>> Cc: Rik van Riel <riel@xxxxxxxxxxx> >>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >>> Signed-off-by: Andrii Nakryiko <andriin@xxxxxx> >> >> Acked-by: Song Liu <songliubraving@xxxxxx> >> >> With one nit below. >> >> >> [...] >> >>> - if (percpu) >>> + data_size = 0; >>> + if (percpu) { >>> array_size += (u64) max_entries * sizeof(void *); >>> - else >>> - array_size += (u64) max_entries * elem_size; >> >>> + } else { >>> + if (attr->map_flags & BPF_F_MMAPABLE) { >>> + data_size = (u64) max_entries * elem_size; >>> + data_size = round_up(data_size, PAGE_SIZE); >>> + } else { >>> + array_size += (u64) max_entries * elem_size; >>> + } >>> + } >>> >>> /* make sure there is no u32 overflow later in round_up() */ >>> - cost = array_size; >>> + cost = array_size + data_size; >> >> >> >> This is a little confusing. Maybe we can do >> > > I don't think I can do that without even bigger code churn. In > non-mmap()-able case, array_size specifies the size of one chunk of > memory, which consists of sizeof(struct bpf_array) bytes, followed by > actual data. This is accomplished in one allocation. That's current > case for arrays. > > For BPF_F_MMAPABLE case, though, we have to do 2 separate allocations, > to make sure that mmap()-able part is allocated with vmalloc() and is > page-aligned. So array_size keeps track of number of bytes allocated > for struct bpf_array plus, optionally, per-cpu or non-mmapable array > data, while data_size is explicitly for vmalloc()-ed mmap()-able chunk > of data. If not for this, I'd just keep adjusting array_size. > > So the invariant for per-cpu and non-mmapable case is that data_size = > 0, array_size = sizeof(struct bpf_array) + whatever amount of data we > need. For mmapable case: array_size = sizeof(struct bpf_array), > data_size = actual amount of array data. I see. Thanks for the explanation. Song