Hi Don, I still need some help understanding this "resurrection" of the remote access facility, i.e., the new meanings of the REMOTE_xxx flags/macros. The currently-existing comment for the is_remote_daemon() function in remote.c describes the (obsolete) usage of the facility: /* * Parse, verify and establish a connection with the network daemon * specified on the crash command line. * * The format is: [remote-hostname]:port[,remote-namelist][,remote-dumpfile] * * where everything but the port number is optional, and the remote-namelist * and remote-dumpfile can be reversed. * * 1. The default remote host is the local host. * 2. The default dumpfile is /dev/mem. * 3. If no remote-namelist and remote-dumpfile are given, the daemon * is queried for a kernel that matches the remote /proc/version. * If no local kernel namelist is entered, the remote version will * be copied locally when fd_init() is called. * 4. If a remote-dumpfile is given with no remote namelist, it is presumed * that the kernel namelist will be entered locally. */ Back then, the remote "crashd" daemon could be contacted for: (1) live access to the remote machine's kernel using its /dev/mem, or (2) accessing a vmlinux/vmcore pair that existed on the remote machine. Can you explain this new usage? It sounds like there is a remote daemon running on a remote Dom0 kernel, and that your new functionality can only be used to: (1) access a running domU guest hosted by that remote system, or (2) access a paused domU guest hosted by that remote system. And that there is also a device that the the remote daemon uses to access the memory of the guest, be it paused or live. Am I in the ballpark? Dave ----- Original Message ----- > On 11/19/13 11:08, Dave Anderson wrote: > > > > ----- Original Message ----- > >> From: Don Slutz <dslutz@xxxxxxxxxxx> > >> > >> Currently crash 4.0 and later using the code: > >> > >> http://lists.xen.org/archives/html/xen-devel/2013-11/msg02569.html > >> > >> There is some very limited documention on crashes remote ptotocol in: > >> > >> http://lists.xen.org/archives/html/xen-devel/2013-11/msg02352.html > >> > >> and some fixes in the attachment > >> 0001-xen-crashd-Connect-crash-with-domain.patch > >> from the 1st link above. > >> > >> How ever there are issues with the current code: > >> > >> 1) The remote protocol has a minor issue (which I have not been able to > >> happen) based on the fact TCP/IP is stream based protocol. This means > >> a RECV or a SEND may not do the fully requested size of data. In fact > >> the current code assumes that the amount of data that a SEND is called > >> with will all be read with a single RECV. > >> > >> 2) The most common mismatch between crash and older kernels is > >> phys_base. In the remote case, see if the remote server supports > >> vitrual memory access, and if so, see if phys_base can be retreived. > >> > >> 3) crash assumes that the remote system is active and can not return > >> currect IP and SP to do a better back trace. > >> > >> 4) enable a non-active mode of remote access. > >> > >> This patch set attempts to fix these: > >> > >> The fix for #1 I have called NIL mode (patch 1). > >> The fix for #2 uses get_remote_phys_base (patch 3). > >> The fix for #3 uses get_remote_regs (patch 5). > >> The fix for #4 uses special file /dev/xenmem (patch 7). > >> It also REMOTE_NIL() to indicate "remote paused system". > >> > >> Don Slutz (7): > >> Add NIL mode to remote. > >> remote_proc_version: NULL terminate passed buffer on error. > >> Add get_remote_phys_base. > >> Add remote_vtop. > >> bt: get remote live registers if possible. > >> Add get_remote_cr3 > >> Add support for non-live remote. > >> > >> defs.h | 8 +- > >> kernel.c | 25 +++- > >> memory.c | 25 +++- > >> remote.c | 462 > >> +++++++++++++++++++++++++++++++++++++++++++++++++++++---------- > >> x86_64.c | 24 ++++ > >> 5 files changed, 465 insertions(+), 79 deletions(-) > >> > >> -- > >> 1.8.4 > > > > Hi Don, > > > > A few comments... > > > > I don't really understand what entity the code is communicating with, > > so for the most part, I'm primarily interested in making sure that your > > patches in no way can affect the normal usage of the crash utility. > > > > That being said, is the term "NIL mode" some kind of Xen-ism? > Nope, it is term I came up with to describe the protocol change. > > It wasn't > > until I saw the display_sys_stats() function that I even understood that > > REMOTE_NIL mean "(remote paused system)". Anyway, if you made it something > > like "REMOTE_PAUSED" or something like that, the code might make some sense > > to the untrained eye. > Will do. Looking back at it, I ended up overloading the term. :( > > > > And it's not entirely clear to me what REMOTE_ACTIVE() means now? It used > > to imply communicating with a remote crash daeamon that accessed the live > > system that the daemon was running on. Has that meaning been changed to > > mean > > that crash is communicating with a daemon that's accessing a "live" Xen > > guest > > kernel? > The meaning is not as clear. Since the existing versions of crash already > have REMOTE_ACTIVE() and I wanted the Xen utility to be usable to them, in > both cases the "live" Xen guest kernel is being accessed while it is paused. > I added REMOTE_PAUSED() to have a few changes to crash to correctly > understand that the guest is paused. As part of this I add the fake file > "/dev/xenmem". > > > > I'm confused about these two changes, where you re-patch your own patch: > > > > - else > > + else if (REMOTE_ACTIVE() && ((bt->task == tt->this_task) || > > is_task_active(bt->task))) { > > + if (get_remote_regs(bt, &eip, &esp)) > > + machdep->get_stack_frame(bt, &eip, &esp); > > + } else > > > > followed by: > > > > - else if (REMOTE_ACTIVE() && ((bt->task == tt->this_task) || > > is_task_active(bt->task))) { > > + else if (REMOTE_NIL() && ((bt->task == tt->this_task) || > > is_task_active(bt->task))) { > > if (get_remote_regs(bt, &eip, &esp)) > > machdep->get_stack_frame(bt, &eip, &esp); > > } else > Clearly I need to adjust the patch order. > > If I'm not mistaken, it would seem to be impossible for "(bt->task == > > tt->this_task)" > > to ever be true in the REMOTE_NIL() case? As I recall tt->this_task used > > to be the > > PID of the remote daemon itself, which would be accessing the memory of its > > own kernel. > > So how could the pid of the remote daemon ever be the target of a back > > trace of a > > paused guest? And so I don't understand why you first made an if clause > > for > > REMOTE_ACTIVE() to begin with -- whatever that means now -- and then > > changed it to > > REMOTE_NIL()? > This came about from some old changes from 2012 to a crash version 4.0 for > other work. I had not thought of /dev/xenmem at the time, and just copied > the condition from above. I see no reason not to drop "bt->task == > tt->this_task". Will check that the right actions are still done. > > In any case, since the ultimate change is only interested in REMOTE_NIL(), > > can you make it all less of an eyesore by encapulating its if case > > something like: > > > > else if (SADUMP_DUMPFILE()) > > get_sadump_regs(bt, &eip, &esp); > > + else if (REMOTE_NIL()) { > > + if (!((bt->task == tt->this_task) || > > is_task_active(bt->task)) || > > + !get_remote_regs(bt, &eip, &esp)) > > + machdep->get_stack_frame(bt, &eip, &esp); > > } else > > machdep->get_stack_frame(bt, &eip, &esp); > > > > Note above, for consistency's sake with the rest of the crash utility, > > can you make get_remote_regs() return non-zero when it successfully gets > > the registers, and zero if it fails? > Sure. > > And lastly, can you fix these? > > > > $ make warn > > ... > > remote.c: In function ‘validate_phys_base’: > > remote.c:2302:25: warning: format ‘%llx’ expects argument of type > > ‘long long unsigned int’, but argument 3 has type ‘physaddr_t’ > > [-Wformat] > > remote.c:2302:25: warning: format ‘%llx’ expects argument of type > > ‘long long unsigned int’, but argument 4 has type ‘physaddr_t’ > > [-Wformat] > > remote.c:2302:25: warning: format ‘%llx’ expects argument of type > > ‘long long unsigned int’, but argument 5 has type ‘physaddr_t’ > > [-Wformat] > > remote.c:2302:25: warning: format ‘%llx’ expects argument of type > > ‘long long unsigned int’, but argument 6 has type ‘physaddr_t’ > > [-Wformat] > > remote.c: In function ‘remote_vtop’: > > remote.c:2361:2: warning: format ‘%llx’ expects argument of type > > ‘long long unsigned int’, but argument 4 has type ‘physaddr_t’ > > [-Wformat] > > remote.c:2367:4: warning: format ‘%llx’ expects argument of type > > ‘long long unsigned int’, but argument 4 has type ‘physaddr_t’ > > [-Wformat] > > remote.c:2352:18: warning: variable ‘p3’ set but not used > > [-Wunused-but-set-variable] > > remote.c:2352:13: warning: variable ‘p2’ set but not used > > [-Wunused-but-set-variable] > > remote.c:2352:8: warning: variable ‘p1’ set but not used > > [-Wunused-but-set-variable] > > remote.c: In function ‘get_remote_regs’: > > remote.c:2392:33: warning: unused variable ‘p6’ [-Wunused-variable] > > remote.c:2392:28: warning: variable ‘p5’ set but not used > > [-Wunused-but-set-variable] > > remote.c:2392:18: warning: variable ‘p3’ set but not used > > [-Wunused-but-set-variable] > > remote.c:2392:8: warning: variable ‘p1’ set but not used > > [-Wunused-but-set-variable] > > remote.c: In function ‘get_remote_cr3’: > > remote.c:2449:13: warning: variable ‘p2’ set but not used > > [-Wunused-but-set-variable] > > remote.c:2449:8: warning: variable ‘p1’ set but not used > > [-Wunused-but-set-variable] > > ... > Yes. > > > And I'm not interested in appending these to the 4 files that you changed: > > > > +/* > > + * Specify Emacs local variables so the formating > > + * of the code stays the same. > > + * > > + * Local variables: > > + * mode: C > > + * c-set-style: "BSD" > > + * c-basic-offset: 8 > > + * indent-tabs-mode: t > > + * End: > > + */ > > > > Thanks, > > Dave > Ok, that is 2 votes to drop this, so I will. > -Don Slutz > -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility