On Tue, Aug 04, 2020 at 11:12:55AM +0200, Vlastimil Babka wrote: > On 8/4/20 4:35 AM, Cho KyongHo wrote: > > On Mon, Aug 03, 2020 at 05:45:55PM +0200, Vlastimil Babka wrote: > >> On 8/3/20 9:57 AM, David Hildenbrand wrote: > >> > On 03.08.20 08:10, pullip.cho@xxxxxxxxxxx wrote: > >> >> From: Cho KyongHo <pullip.cho@xxxxxxxxxxx> > >> >> > >> >> LPDDR5 introduces rank switch delay. If three successive DRAM accesses > >> >> happens and the first and the second ones access one rank and the last > >> >> access happens on the other rank, the latency of the last access will > >> >> be longer than the second one. > >> >> To address this panelty, we can sort the freelist so that a specific > >> >> rank is allocated prior to another rank. We expect the page allocator > >> >> can allocate the pages from the same rank successively with this > >> >> change. It will hopefully improves the proportion of the consecutive > >> >> memory accesses to the same rank. > >> > > >> > This certainly needs performance numbers to justify ... and I am sorry, > >> > "hopefully improves" is not a valid justification :) > >> > > >> > I can imagine that this works well initially, when there hasn't been a > >> > lot of memory fragmentation going on. But quickly after your system is > >> > under stress, I doubt this will be very useful. Proof me wrong. ;) > >> > >> Agreed. The implementation of __preferred_rank() seems to be very simple and > >> optimistic. > > > > DRAM rank is selected by CS bits from DRAM controllers. In the most systems > > CS bits are alloated to specific bit fields in BUS address. For example, > > If CS bit is allocated to bit[16] in bus (physical) address in two rank > > system, all 16KiB with bit[16] = 1 are in the rank 1 and the others are > > in the rank 0. > > This patch is not beneficial to other system than the mobile devices > > with LPDDR5. That is why the default behavior of this patch is noop. > > Hmm, the patch requires at least pageblock_nr_pages, which is 2MB on x86 (dunno > about ARM), so 16KiB would be way too small. What are the actual granularities then? 16KiB is just an example. pageblock_nr_pages is 4Mb on both ARM and ARM64. __perferred_rank() works if rank granule >= 4MB. > >> I think these systems could perhaps better behave as NUMA with (interleaved) > >> nodes for each rank, then you immediately have all the mempolicies support etc > >> to achieve what you need? Of course there's some cost as well, but not the costs > >> of adding hacks to page allocator core? > > > > Thank you for the proposal. NUMA will be helpful to allocate pages from > > a specific rank programmatically. I should consider NUMA if rank > > affinity should be also required. > > However, page allocation overhead by this policy (page migration and > > reclamation ect.) will give the users worse responsiveness. The intend > > of this patch is to reduce rank switch delay optimistically without > > hurting page allocation speed. > > The problem is, without some control of page migration and reclaim, the simple > preference approach will not work after some uptime, as David suggested. It will > just mean that the preferred rank will be allocated first, then the > non-preferred rank (Linux will fill all unused memory with page cache if > possible), then reclaim will free memory from both ranks without any special > care, and new allocations will thus come from both ranks. > In fact, I did't considered about NUMA in that way. I now need to check NUMA if it gives us the same result with this patch. Thank you again for your comments about NUMA :) NUMA