Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption

Brian Johannesmeyer <bjohannesmeyer@xxxxxxxxx> · Thu, 21 Nov 2024 10:31:11 -0700

> Here's my quick thoughts at this late hour, though I might have
> something else tomorrow. If the links are external to the dma addr
> being freed, then I think you need to change the entire alloc/free API
> to replace the dma_addr_t handle with a new type, like a tuple of
>
>   { dma_addr_t, priv_list_link }
>
> That should make it possible to preserve the low constant time to alloc
> and free in the critical section, which I think is a good thing to have.
> I found 170 uses of dma_pool_alloc, and 360 dma_pool_free in the kernel,
> so changing the API is no small task. :(

Indeed, an API change like this might be the only way to move metadata
out of the DMA blocks while maintaining its current performance.

Regarding the performance hit of the submitted patches:
- *Allocations* remain constant time (`O(1)`), as they simply check
the `pool->next_block` pointer.
- *Frees* are no longer constant time. Previously, `dma_pool_free()`
converted a `vaddr` to its corresponding `block` by type casting, but
now it calls `pool_find_block()`, which iterates over all pages
(`O(n)`).

Therefore, optimizing `dma_pool_free()` is key. While restructuring
the API is the "best" solution, I implemented a compromise:
introducing a `struct maple_tree block_map` field in `struct dma_pool`
to save mappings of a `vaddr` to its corresponding `block`. A maple
tree isn’t constant time, but it offers `O(log n)` performance, which
is a significant improvement over the current `O(n)` iteration.

Here are the performance results. I've already reported the first two
sets of numbers, but I'll repeat them here:

**Without no patches applied:**
```
dmapool test: size:16   align:16   blocks:8192 time:11860
dmapool test: size:64   align:64   blocks:8192 time:11951
dmapool test: size:256  align:256  blocks:8192 time:12287
dmapool test: size:1024 align:1024 blocks:2048 time:3134
dmapool test: size:4096 align:4096 blocks:1024 time:1686
dmapool test: size:68   align:32   blocks:8192 time:12050
```

**With the submitted patches applied:**
```
dmapool test: size:16   align:16   blocks:8192 time:34432
dmapool test: size:64   align:64   blocks:8192 time:62262
dmapool test: size:256  align:256  blocks:8192 time:238137
dmapool test: size:1024 align:1024 blocks:2048 time:61386
dmapool test: size:4096 align:4096 blocks:1024 time:75342
dmapool test: size:68   align:32   blocks:8192 time:88243
```

**With the submitted patches applied AND using a maple tree to improve
the performance of vaddr-to-block translations:**
```
dmapool test: size:16   align:16   blocks:8192 time:43668
dmapool test: size:64   align:64   blocks:8192 time:44746
dmapool test: size:256  align:256  blocks:8192 time:45434
dmapool test: size:1024 align:1024 blocks:2048 time:11013
dmapool test: size:4096 align:4096 blocks:1024 time:5250
dmapool test: size:68   align:32   blocks:8192 time:45900
```

The maple tree optimization reduces the performance hit for most block
sizes, especially for larger blocks. While the performance is not
fully back to baseline, it gives a reasonable trade-off between
protection, runtime performance, and ease of deployment (i.e., not
requiring an API change).

If this approach looks acceptable, I can submit it as a V3 patch
series for further review and discussion.

Thanks,

Brian Johannesmeyer