Re: dax alignment problem on arm64 (and other achitectures)

David Hildenbrand <david@xxxxxxxxxx> · Thu, 28 Jan 2021 16:03:07 +0100

One issue usually is that often firmware can allocate from available
system RAM and/or modify/initialize it. I assume you're running some
custom firmware :)

We have a special firmware that does not touch the last 2G of physical
memory for its allocations :)

Fancy :)

[...]

Personally, I think the future is 4k, especially for smaller machines.
(also, imagine right now how many 512MB THP you can actually use in your
8GB VM ..., simply not suitable for small machines).

Um, this is not really about 512THP. Yes, this is smaller machine, but
performance is very important to us. Boot budget for the kernel is
under half a second. With 64K we save 0.2s  0.35s vs 0.55s. This is
because fewer struct pages need to be initialized. Also, fewer TLB
misses, and 3-level page tables add up as performance benefits. >
For larger servers 64K pages make total sense: Less memory is wasted as metdata.

Yes, indeed, for very large servers it might make sense in that regard. 
However, once we can eventually free vmemmap of hugetlbfs things could 
change; assuming user space will be consuming huge pages (which large 
machines better be doing ... databases, hypervisors ... ).

Also, some hypervisors try allocating the memmap completely ... but I 
consider that rather a special case.

Personally, I consider being able to use THP/huge pages more important 
than having 64k base pages and saving some TLB space there. Also, with 
64k you have other drawbacks: for example, each stack, each TLS for 
threads in applications suddenly consumes 16 times more memory as "minimum".

Optimizing boot time/memmap initialization further is certainly an 
interesting topic.

Anyhow, you know your use case best, just sharing my thoughts :)

[...]

Right, but I do not think it is possible to do for dax devices (as of
right now). I assume, it contains information about what kind of
device it is: devdax, fsdax, sector, uuid etc.
See [1] namespaces tabel. It contains summary of pmem devices types,
and which of them have label (all except for raw).

Interesting, I wonder if the label is really required to get this
special use case running. I mean, all you want is to have dax/kmem
expose the whole thing as system RAM. You don't want to lose even 2MB if
it's just for the sake of unnecessary metadata - this is not a real
device, it's "fake" already.

Hm, would not it essentially  mean allowing memory hot-plug for raw
pmem devices? Something like create mmap, and hot-add raw pmem?

Theoretically yes, but I have no idea if that would make sense for real 
"raw pmem" as well. Hope some of the pmem/nvdimm experts can clarify 
what's possible and what's not :)

--
Thanks,

David / dhildenb