On 11/15/19 3:44 PM, Daniel Borkmann wrote: > On 11/16/19 12:37 AM, Alexei Starovoitov wrote: >> On 11/15/19 3:31 PM, Daniel Borkmann wrote: >>> On 11/15/19 5:02 AM, Andrii Nakryiko wrote: >>>> Add ability to memory-map contents of BPF array map. This is extremely >>>> useful >>>> for working with BPF global data from userspace programs. It allows to >>>> avoid >>>> typical bpf_map_{lookup,update}_elem operations, improving both >>>> performance >>>> and usability. >>>> >>>> There had to be special considerations for map freezing, to avoid >>>> having >>>> writable memory view into a frozen map. To solve this issue, map >>>> freezing and >>>> mmap-ing is happening under mutex now: >>>> - if map is already frozen, no writable mapping is allowed; >>>> - if map has writable memory mappings active (accounted in >>>> map->writecnt), >>>> map freezing will keep failing with -EBUSY; >>>> - once number of writable memory mappings drops to zero, map >>>> freezing can be >>>> performed again. >>>> >>>> Only non-per-CPU plain arrays are supported right now. Maps with >>>> spinlocks >>>> can't be memory mapped either. >>>> >>>> For BPF_F_MMAPABLE array, memory allocation has to be done through >>>> vmalloc() >>>> to be mmap()'able. We also need to make sure that array data memory is >>>> page-sized and page-aligned, so we over-allocate memory in such a way >>>> that >>>> struct bpf_array is at the end of a single page of memory with >>>> array->value >>>> being aligned with the start of the second page. On deallocation we >>>> need to >>>> accomodate this memory arrangement to free vmalloc()'ed memory >>>> correctly. >>>> >>>> One important consideration regarding how memory-mapping subsystem >>>> functions. >>>> Memory-mapping subsystem provides few optional callbacks, among them >>>> open() >>>> and close(). close() is called for each memory region that is >>>> unmapped, so >>>> that users can decrease their reference counters and free up >>>> resources, if >>>> necessary. open() is *almost* symmetrical: it's called for each memory >>>> region >>>> that is being mapped, **except** the very first one. So bpf_map_mmap >>>> does >>>> initial refcnt bump, while open() will do any extra ones after that. >>>> Thus >>>> number of close() calls is equal to number of open() calls plus one >>>> more. >>>> >>>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >>>> Cc: Rik van Riel <riel@xxxxxxxxxxx> >>>> Acked-by: Song Liu <songliubraving@xxxxxx> >>>> Acked-by: John Fastabend <john.fastabend@xxxxxxxxx> >>>> Signed-off-by: Andrii Nakryiko <andriin@xxxxxx> >>> >>> [...] >>>> +/* called for any extra memory-mapped regions (except initial) */ >>>> +static void bpf_map_mmap_open(struct vm_area_struct *vma) >>>> +{ >>>> + struct bpf_map *map = vma->vm_file->private_data; >>>> + >>>> + bpf_map_inc(map); >>> >>> This would also need to inc uref counter since it's technically a >>> reference >>> of this map into user space as otherwise if map->ops->map_release_uref >>> would >>> be used for maps supporting mmap, then the callback would trigger even >>> if user >>> space still has a reference to it. >> >> I thought we use uref only for array that can hold FDs ? >> That's why I suggested Andrii earlier to drop uref++. > > Yeah, only for fd array currently. Question is, if we ever reuse that > map_release_uref > callback in future for something else, will we remember that we earlier > missed to add > it here? :/ What do you mean 'missed to add' ? This is mmap path. Anything that needs releasing (like FDs for prog_array or progs for sockmap) cannot be mmap-able.