> > > > >--------- Original Message --------- > >Sender : John Stultz <jstultz@xxxxxxxxxx> > >Date : 2023-02-07 13:37 (GMT+9) > >Title : Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4 > > > >On Sat, Feb 4, 2023 at 7:02 AM Jaewon Kim <jaewon31.kim@xxxxxxxxxxx> wrote: > >> Hello John Stultz, sorry for late reply. > >> I had to manage other urgent things and this test also took some time to finish. > >> Any I hope you to be happy with following my test results. > >> > >> > >> 1. system heap modification > >> > >> To avoid effect of allocation from the pool, all the freed dma > >> buffer were passed to buddy without keeping them in the pool. > >> Some trace_printk and order counting logic were added. > >> > >> 2. the test tool > >> > >> To test the dma-buf system heap allocation speed, I prepared > >> a userspace test program which requests a specified size to a heap. > >> With the program, I tried to request 16 times of 10 MB size and > >> added 1 sleep between each request. Each memory was not freed > >> until the total 16 times total memory was allocated. > > > >Oof. I really appreciate all your effort that I'm sure went in to > >generate these numbers, but this wasn't quite what I was asking for. > >I know you've been focused on allocation performance under memory > >pressure, but I was hoping to see the impact of __using__ order 0 > >pages over order 4 pages in real world conditions (for camera or video > >recording or other use cases that use large allocations). These > >results seem to be still just focused on the difference in allocation > >performance between order 0 and order 4 with and without contention. > > > >That said, re-reading my email, I probably could have been more clear > >on this aspect. > > > > > >> 3. the test device > >> > >> The test device has arm64 CPU cores and v5.15 based kernel. > >> To get stable results, the CPU clock was fixed not to be changed > >> in run time, and the test tool was set to some specific CPU cores > >> running in the same CPU clock. > >> > >> 4. test results > >> > >> As we expected if order 4 exist in the buddy, the order 8, 4, 0 > >> allocation was 1 to 4 times faster than the order 8, 0, 0. But > >> the order 8, 0, 0 also looks fast enough. > >> > >> Here's time diff, and number of each order. > >> > >> order 8, 4, 0 in the enough order 4 case > >> > >> diff 8 4 0 > >> 665 usec 0 160 0 > >> 1,148 usec 0 160 0 > >> 1,089 usec 0 160 0 > >> 1,154 usec 0 160 0 > >> 1,264 usec 0 160 0 > >> 1,414 usec 0 160 0 > >> 873 usec 0 160 0 > >> 1,148 usec 0 160 0 > >> 1,158 usec 0 160 0 > >> 1,139 usec 0 160 0 > >> 1,169 usec 0 160 0 > >> 1,174 usec 0 160 0 > >> 1,210 usec 0 160 0 > >> 995 usec 0 160 0 > >> 1,151 usec 0 160 0 > >> 977 usec 0 160 0 > >> > >> order 8, 0, 0 in the enough order 4 case > >> > >> diff 8 4 0 > >> 441 usec 10 0 0 > >> 747 usec 10 0 0 > >> 2,330 usec 2 0 2048 > >> 2,469 usec 0 0 2560 > >> 2,518 usec 0 0 2560 > >> 1,176 usec 0 0 2560 > >> 1,487 usec 0 0 2560 > >> 1,402 usec 0 0 2560 > >> 1,449 usec 0 0 2560 > >> 1,330 usec 0 0 2560 > >> 1,089 usec 0 0 2560 > >> 1,481 usec 0 0 2560 > >> 1,326 usec 0 0 2560 > >> 3,057 usec 0 0 2560 > >> 2,758 usec 0 0 2560 > >> 3,271 usec 0 0 2560 > >> > >> From the perspective of responsiveness, the deterministic > >> memory allocation speed, I think, is quite important. So I > >> tested other case where the free memory are not enough. > >> > >> On this test, I ran the 16 times allocation sets twice > >> consecutively. Then it showed the first set order 8, 4, 0 > >> became very slow and varied, but the second set became > >> faster because of the already created the high order. > >> > >> order 8, 4, 0 in low memory > >> > >> diff 8 4 0 > >> 584 usec 0 160 0 > >> 28,428 usec 0 160 0 > >> 100,701 usec 0 160 0 > >> 76,645 usec 0 160 0 > >> 25,522 usec 0 160 0 > >> 38,798 usec 0 160 0 > >> 89,012 usec 0 160 0 > >> 23,015 usec 0 160 0 > >> 73,360 usec 0 160 0 > >> 76,953 usec 0 160 0 > >> 31,492 usec 0 160 0 > >> 75,889 usec 0 160 0 > >> 84,551 usec 0 160 0 > >> 84,352 usec 0 160 0 > >> 57,103 usec 0 160 0 > >> 93,452 usec 0 160 0 > >> > >> diff 8 4 0 > >> 808 usec 10 0 0 > >> 778 usec 4 96 0 > >> 829 usec 0 160 0 > >> 700 usec 0 160 0 > >> 937 usec 0 160 0 > >> 651 usec 0 160 0 > >> 636 usec 0 160 0 > >> 811 usec 0 160 0 > >> 622 usec 0 160 0 > >> 674 usec 0 160 0 > >> 677 usec 0 160 0 > >> 738 usec 0 160 0 > >> 1,130 usec 0 160 0 > >> 677 usec 0 160 0 > >> 553 usec 0 160 0 > >> 1,048 usec 0 160 0 > >> > >> > >> order 8, 0, 0 in low memory > >> > >> diff 8 4 0 > >> 1,699 usec 2 0 2048 > >> 2,082 usec 0 0 2560 > >> 840 usec 0 0 2560 > >> 875 usec 0 0 2560 > >> 845 usec 0 0 2560 > >> 1,706 usec 0 0 2560 > >> 967 usec 0 0 2560 > >> 1,000 usec 0 0 2560 > >> 1,905 usec 0 0 2560 > >> 2,451 usec 0 0 2560 > >> 3,384 usec 0 0 2560 > >> 2,397 usec 0 0 2560 > >> 3,171 usec 0 0 2560 > >> 2,376 usec 0 0 2560 > >> 3,347 usec 0 0 2560 > >> 2,554 usec 0 0 2560 > >> > >> diff 8 4 0 > >> 1,409 usec 2 0 2048 > >> 1,438 usec 0 0 2560 > >> 1,035 usec 0 0 2560 > >> 1,108 usec 0 0 2560 > >> 825 usec 0 0 2560 > >> 927 usec 0 0 2560 > >> 1,931 usec 0 0 2560 > >> 2,024 usec 0 0 2560 > >> 1,884 usec 0 0 2560 > >> 1,769 usec 0 0 2560 > >> 2,136 usec 0 0 2560 > >> 1,738 usec 0 0 2560 > >> 1,328 usec 0 0 2560 > >> 1,438 usec 0 0 2560 > >> 1,972 usec 0 0 2560 > >> 2,963 usec 0 0 2560 > > > >So, thank you for generating all of this. I think this all looks as > >expected, showing the benefit of your change to allocation under > >contention and showing the potential downside in the non-contention > >case. > > > >I still worry about the performance impact outside of allocation time > >of using the smaller order pages (via map and unmap through iommu to > >devices, etc), so it would still be nice to have some confidence this > >won't introduce other regressions, but I do agree the worse case > >impact is very bad. > > > >> Finally if we change order 4 to use HIGH_ORDER_GFP, > >> I expect that we could avoid the very slow cases. > >> > > > >Yeah. Again, this all aligns with the upside of your changes, which > >I'm eager for. > >I just want to have a strong sense of any regressions it might also cause. > > > >I don't mean to discourage you, especially after all the effort here. > >Do you think evaluating the before and after impact to buffer usage > >(not just allocation) would be doable in the near term? > Hello sorry but I don't have expertise on iommu. Actually I'm also wondering all IOMMU can use order 4 free pages, if they are allocated. I am not sure but I remember I heard order 9 (2MB) could be used, but I don't know about order 8 4. I guess IOMMU mmap also be same patern like we expect. I mean if order 4 is prepared it could be faster like 1 to 4 times. But it, I think, should NOT be that much slow even though the entire free memory is prepared as order 0 pages. > > >If you don't think so, given the benefit to allocation under pressure > >is large (and I don't mean to give you hurdles to jump), I'm willing > >to ack your change to get it merged, but if we later see performance > >trouble, I'll be quick to advocate for reverting it. Is that ok? > Yes sure. I also want to know if it is. Thank you > > >thanks > >-john > >