RE:(2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

Jaewon Kim <jaewon31.kim@xxxxxxxxxxx> · Sun, 05 Feb 2023 00:02:15 +0900

> 
> --------- Original Message ---------
> 
> Sender : John Stultz <jstultz@xxxxxxxxxx>
> 
> Date : 2023-01-26 14:04 (GMT+9)
> 
> Title : Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4
> 
>  
> 
> On Wed, Jan 25, 2023 at 8:42 PM 김재원 <jaewon31.kim@xxxxxxxxxxx> wrote:
> 
> > > On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <jaewon31.kim@xxxxxxxxxxx> wrote:
> 
> > > > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
> 
> > > But because your change is different from what the old ion code did, I
> 
> > > want to be a little cautious. So it would be nice to see some
> 
> > > evaluation of not just the benefits the patch provides you but also of
> 
> > > what negative impact it might have.  And so far you haven't provided
> 
> > > any details there.
> 
> > >
> 
> > > A quick example might be for the use case where mid-order allocations
> 
> > > are causing you trouble, you could see how the performance changes if
> 
> > > you force all mid-order allocations to be single page allocations (so
> 
> > > orders[] = {8, 0, 0};) and compare it with the current code when
> 
> > > there's no memory pressure (right after reboot when pages haven't been
> 
> > > fragmented) so the mid-order allocations will succeed.  That will let
> 
> > > us know the potential downside if we have brief / transient pressure
> 
> > > at allocation time that forces small pages.
> 
> > >
> 
> > > Does that make sense?
> 
> >
> 
> > Let me try this. It make take some days. But I guess it depends on memory
> 
> > status as you said. If there were quite many order 4 pages, then 8 4 0
> 
> > should be faster than 8 0 0.
> 
> >
> 
> > I don't know this is a right approach. In my opinion, except the specific
> 
> > cases like right after reboot, there are not many order 4 pages. And
> 
> > in determinisitic allocation time perspective, I think avoiding too long
> 
> > allocations is more important than making faster with already existing
> 
> > free order 4 pages.
> 
> 
> 
> I suspect you are right, and do think your change will be helpful.
> 
> But I just want to make sure we're doing some due diligence, instead
> 
> of going on just gut instinct.
> 
> 
> 
> Thanks so much for helping with this!
> 
> -john
> 
> 

Hello John Stultz, sorry for late reply.
I had to manage other urgent things and this test also took some time to finish.
Any I hope you to be happy with following my test results.

1. system heap modification

To avoid effect of allocation from the pool, all the freed dma
buffer were passed to buddy without keeping them in the pool.
Some trace_printk and order counting logic were added.

2. the test tool

To test the dma-buf system heap allocation speed, I prepared
a userspace test program which requests a specified size to a heap.
With the program, I tried to request 16 times of 10 MB size and
added 1 sleep between each request. Each memory was not freed
until the total 16 times total memory was allocated.

3. the test device

The test device has arm64 CPU cores and v5.15 based kernel.
To get stable results, the CPU clock was fixed not to be changed
in run time, and the test tool was set to some specific CPU cores
running in the same CPU clock.

4. test results

As we expected if order 4 exist in the buddy, the order 8, 4, 0
allocation was 1 to 4 times faster than the order 8, 0, 0. But
the order 8, 0, 0 also looks fast enough.

Here's time diff, and number of each order.

order 8, 4, 0 in the enough order 4 case

         diff	8	4	0
     665 usec	0	160	0
   1,148 usec	0	160	0
   1,089 usec	0	160	0
   1,154 usec	0	160	0
   1,264 usec	0	160	0
   1,414 usec	0	160	0
     873 usec	0	160	0
   1,148 usec	0	160	0
   1,158 usec	0	160	0
   1,139 usec	0	160	0
   1,169 usec	0	160	0
   1,174 usec	0	160	0
   1,210 usec	0	160	0
     995 usec	0	160	0
   1,151 usec	0	160	0
     977 usec	0	160	0

order 8, 0, 0 in the enough order 4 case

         diff	8	4	0
     441 usec	10	0	0
     747 usec	10	0	0
   2,330 usec	2	0	2048
   2,469 usec	0	0	2560
   2,518 usec	0	0	2560
   1,176 usec	0	0	2560
   1,487 usec	0	0	2560
   1,402 usec	0	0	2560
   1,449 usec	0	0	2560
   1,330 usec	0	0	2560
   1,089 usec	0	0	2560
   1,481 usec	0	0	2560
   1,326 usec	0	0	2560
   3,057 usec	0	0	2560
   2,758 usec	0	0	2560
   3,271 usec	0	0	2560

>From the perspective of responsiveness, the deterministic
memory allocation speed, I think, is quite important. So I
tested other case where the free memory are not enough.

On this test, I ran the 16 times allocation sets twice
consecutively. Then it showed the first set order 8, 4, 0
became very slow and varied, but the second set became
faster because of the already created the high order.

order 8, 4, 0 in low memory

         diff	8	4	0
     584 usec	0	160	0
  28,428 usec	0	160	0
 100,701 usec	0	160	0
  76,645 usec	0	160	0
  25,522 usec	0	160	0
  38,798 usec	0	160	0
  89,012 usec	0	160	0
  23,015 usec	0	160	0
  73,360 usec	0	160	0
  76,953 usec	0	160	0
  31,492 usec	0	160	0
  75,889 usec	0	160	0
  84,551 usec	0	160	0
  84,352 usec	0	160	0
  57,103 usec	0	160	0
  93,452 usec	0	160	0

         diff	8	4	0
     808 usec	10	0	0
     778 usec	4	96	0
     829 usec	0	160	0
     700 usec	0	160	0
     937 usec	0	160	0
     651 usec	0	160	0
     636 usec	0	160	0
     811 usec	0	160	0
     622 usec	0	160	0
     674 usec	0	160	0
     677 usec	0	160	0
     738 usec	0	160	0
   1,130 usec	0	160	0
     677 usec	0	160	0
     553 usec	0	160	0
   1,048 usec	0	160	0

order 8, 0, 0 in low memory

        diff	8	4	0
  1,699 usec	2	0	2048
  2,082 usec	0	0	2560
    840 usec	0	0	2560
    875 usec	0	0	2560
    845 usec	0	0	2560
  1,706 usec	0	0	2560
    967 usec	0	0	2560
  1,000 usec	0	0	2560
  1,905 usec	0	0	2560
  2,451 usec	0	0	2560
  3,384 usec	0	0	2560
  2,397 usec	0	0	2560
  3,171 usec	0	0	2560
  2,376 usec	0	0	2560
  3,347 usec	0	0	2560
  2,554 usec	0	0	2560

       diff	8	4	0
 1,409 usec	2	0	2048
 1,438 usec	0	0	2560
 1,035 usec	0	0	2560
 1,108 usec	0	0	2560
   825 usec	0	0	2560
   927 usec	0	0	2560
 1,931 usec	0	0	2560
 2,024 usec	0	0	2560
 1,884 usec	0	0	2560
 1,769 usec	0	0	2560
 2,136 usec	0	0	2560
 1,738 usec	0	0	2560
 1,328 usec	0	0	2560
 1,438 usec	0	0	2560
 1,972 usec	0	0	2560
 2,963 usec	0	0	2560

Finally if we change order 4 to use HIGH_ORDER_GFP,
I expect that we could avoid the very slow cases.

order 8, 4, 0 in low memory with HIGH_ORDER_GFP

          diff	8	4	0
 1,356 usec	0	155	80
 1,901 usec	0	11	2384
 1,912 usec	0	0	2560
 1,911 usec	0	0	2560
 1,884 usec	0	0	2560
 1,577 usec	0	0	2560
 1,366 usec	0	0	2560
 1,711 usec	0	0	2560
 1,635 usec	0	28	2112
   544 usec	10	0	0
   633 usec	2	128	0
   848 usec	0	160	0
   729 usec	0	160	0
 1,000 usec	0	160	0
 1,358 usec	0	160	0
 2,638 usec	0	31	2064

          diff	8	4	0
   669 usec	10	0	0
   789 usec	8	32	0
   603 usec	3	112	0
   578 usec	0	160	0
   562 usec	0	160	0
   564 usec	0	160	0
   686 usec	0	160	0
 1,621 usec	0	160	0
 2,080 usec	0	40	1920
 1,749 usec	0	0	2560
 2,244 usec	0	0	2560
 2,333 usec	0	0	2560
 1,257 usec	0	0	2560
 1,703 usec	0	0	2560
 1,782 usec	0	1	2544
 2,225 usec	0	0	2560

Thank you
Jaewon Kim