On Wed, Oct 06, 2010 at 04:09:29PM -0700, H. Peter Anvin wrote: > On 10/06/2010 03:47 PM, Vivek Goyal wrote: > > > > I really don't mind fixing the things properly in long term, just that I am > > running out of ideas regarding how to fix it in proper way. > > > > To me the best thing would be that this whole allocation thing be dyanmic > > from user space where kexec will run, determine what it is loading, > > determine what are the memory contstraints on these segments (min, upper > > limit, alignment etc), and then ask kernel for reserving contiguous > > memory. This kind of dynamic reservation will remove lot of problems > > associated with crashkernel= reservations. > > > > But I am not aware of anyway of doing dynamic allocation and it certainly > > does not seem to be easy to be able to allocated 128M of memory contiguously. > > > > Because we don't have a way to reserve memory dynamically later, we end up > > doing a big chunk of reservation using kernel command line and later > > figure out what to load where. Now with this approach kexec has not even run > > so how it can tell you what are the memory constraints. > > > > So to me one of the ways of properly fixing is adding some kind of > > capability to reserve the memory dynamically (may be using sys_kexec()) > > and get rid of this notion of reserving memory at boot time. > > The problem, of course, will allocating very large chunks of memory at > runtime is that there are going to be some number of non-movable and > non-evictable pages that are going to break up the contiguous ranges. > However, the mm recently added support for moving most pages, which > should make that kind of allocation a lot more feasible. I haven't > experimented how well it works in practice, but I rather suspect that as > long as the crashkernel is installed sufficiently early in the boot > process it should have a very good probability of success. Ok. > Another > option, although one which has its own hackiness issues, is to do a > conservative allocation at boot time in preparation of the kexec call, > which is then freed. This doesn't really address the issue of location, > though, which is part of the problem here. > > > The other concern you raised is hiding constraints from kernel. At this > > point of time the only problem with crashkernel=X at 0 syntax is that it > > does not tell you whether to look for memory bottom up or top down. How > > about if we specify it explicitly in the syntax so that kernel does not > > have to assume things? > > See below. > > > In fact the initial crashkernel syntax was. crashkernel=X at Y. This meant > > allocated X amount of memory at location Y. This left no ambiguity and > > kernel did not have to assume things. It had the problem though that > > we might not have physical RAM at location Y. So I think that's when > > somebody came up with the idea of crashkernel=X at 0 so that we ideally > > want memory at location 0, but if you can't provide that, then provide > > anything available next scanning bottom up. > > > > So the only part missing from syntax is explicitly speicifying "next > > available location scanning bottom up". If we add that to syntax then > > kernel does not have to make assumptions. (except the alignment part). > > > > So how about modifying syntax to crashkernel=X at Y#BU. > > > > The "#BU" part can be optional and in that case kernel is free to allocate > > memory either top down or bottom up. > > > > Or any other string which can communicate the bottom up part in a more > > intutive manner. > > The whole problem here is that "bottoms up" isn't the true constraint -- > it's a proxy for "this chunk needs < address X, this chunk needs < > address Y, ..." which is the real issue. This is particularly messy > since low memory is a (sometimes very) precious resource that is used by > a lot of things (BIOS stubs, DMA-mask-limited hardware devices, and > perhaps especially 1:1 mappable pages on 32 bits, and so on), and one of > the major reasons we want to switch to a top-down allocation scheme is > to not waste a precious resource when we don't have to. > > The one improvement one could to the crashkernel= syntax is perhaps > "crashkernel=X<Y" meaning "allocate entirely below Y", since that is (at > least in part) the real constraint. It could even be extended to > multiple segments: "crashkernel=X<Y,Z<W,..." if we really need to... > that way you have your preallocation. Ok, I was browsing through kexec-tools, x86 bzImage code and trying to refresh my memory what segments were being loaded and what were memory address concerns. - relocatable bzImage (max addr 0x37ffffff, 896MB). Though I don't know/understand where that 896MB come from. - initrd (max addr 0x37ffffff, 896MB) Don't know why 896MB as upper limit - Purgatory (max addr 2G) - A segment to keep elf headers (no limit) These are accessed when second kernel as fully booted so can be addressed in higher addresses. - A backup segment to copy first 640K of memory (not aware of any limit) - Setup/parameter segment (no limit) - We don't really execute anything here and just access it for command line. So atleast for bzImage it looks that if we specify crashkernel=128M<896M, it will work. So I am fine with above additional syntax for crashkernel=. May be we shall have to the deprecate the crashkernel=X<@0 syntax. CCing kexec list, in case others have any comments. Thanks Vivek