David Hildenbrand <david@xxxxxxxxxx> writes: >>> If we could avoid instantiating more zones and rather improve existing >>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure >>> it's not easy, but that shouldn't stop us from trying ;) >> I do think improving PCP or adding another level of cache will help >> performance and scalability. >> And, I think that it has value too to improve the performance of >> zone >> itself. Because there will be always some cases that the zone lock >> itself is contended. >> That is, PCP and zone works at different level, and both deserve to >> be >> improved. Do you agree? > > Spoiler: my humble opinion > > > Well, the zone is kind-of your "global" memory provider, and PCPs > cache a fraction of that to avoid exactly having to mess with that > global datastructure and lock contention. > > One benefit I can see of such a "global" memory provider with caches > on top is is that it is nicely integrated: for example, the concept of > memory pressure exists for the zone as a whole. All memory is of the > same kind and managed in a single entity, but free memory is cached > for performance. > > As soon as you manage the memory in multiple zones of the same kind, > you lose that "global" view of your memory that is of the same kind, > but managed in different bucks. You might end up with a lot of memory > pressure in a single such zone, but still have plenty in another zone. > > As one example, hot(un)plug of memory is easy: there is only a single > zone. No need to make smart decisions or deal with having memory we're > hotunplugging be stranded in multiple zones. I understand that there are some unresolved issues for splitting zone. I will think more about them and the possible solutions. >> >>> I did not look into the details of this proposal, but seeing the >>> change in include/linux/page-flags-layout.h scares me. >> It's possible for us to use 1 more bit in page->flags. Do you think >> that will cause severe issue? Or you think some other stuff isn't >> acceptable? > > The issue is, everybody wants to consume more bits in page->flags, so > if we can get away without it that would be much better :) Yes. > The more bits you want to consume, the more people will ask for making > this a compile-time option and eventually compile it out on distro > kernels (e.g., with many NUMA nodes). So we end up with more code and > complexity and eventually not get the benefits where we really want > them. That's possible. Although I think we will still use more page flags when necessary. >> >>> Further, I'm not so sure how that change really interacts with >>> hot(un)plug of memory ... on a quick glimpse I feel like this series >>> hacks the code such that such that the split works based on the boot >>> memory size ... >> Em..., the zone stuff is kind of static now. It's hard to add a >> zone at >> run-time. So, in this series, we determine the number of zones per zone >> type based on boot memory size. This may be improved in the future via >> pre-allocate some empty zone instances during boot and hot-add some >> memory to these zones. > > Just to give you some idea: with virtio-mem, hyper-v, daxctl, and > upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might > see quite a small boot memory (e.g., 4 GiB) but a significant amount > of memory getting hotplugged incrementally (e.g., up to 1 TiB) -- > well, and hotunplugged. With multiple zone instances you really have > to be careful and might have to re-balance between the multiple zones > to keep the scalability, to not create imbalances between the zones > ... Thanks for your information! > Something like PCP auto-tuning would be able to handle that mostly > automatically, as there is only a single memory pool. I agree that optimizing PCP will help performance regardless of splitting zone or not. >> >>> I agree with Michal that looking into auto-tuning PCP would be >>> preferred. If that can't be done, adding another layer might end up >>> cleaner and eventually cover more use cases. >> I do agree that it's valuable to make PCP etc. cover more use cases. >> I >> just think that this should not prevent us from optimizing zone itself >> to cover remaining use cases. > > I really don't like the concept of replicating zones of the same kind > for the same NUMA node. But that's just my personal opinion > maintaining some memory hot(un)plug code :) > > Having that said, some kind of a sub-zone concept (additional layer) > as outlined by Michal IIUC, for example, indexed by core > id/has/whatsoever could eventually be worth exploring. Yes, such a > design raises various questions ... :) Yes. That's another possible solution for the page allocation scalability problem. Best Regards, Huang, Ying