Hi Mike, Thanks for the updated patch. I've applied your patches for the next man-pages release, but would be happy if you could answer the questions below. On 12/31/13 20:41, Mike Frysinger wrote: > --- > man2/syscall.2 | 6 +- > man2/syscalls.2 | 3 +- > man3/getauxval.3 | 4 +- > man7/libc.7 | 5 +- > man7/vdso.7 | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 5 files changed, 468 insertions(+), 7 deletions(-) > create mode 100644 man7/vdso.7 > > diff --git a/man2/syscall.2 b/man2/syscall.2 > index e712b41..fe5f86d 100644 > --- a/man2/syscall.2 > +++ b/man2/syscall.2 > @@ -145,7 +145,8 @@ The details for various architectures are listed in the two tables below. > > The first table lists the instruction used to transition to kernel mode, > (which might not be the fastest or best way to transition to the kernel, > -so you might have to refer to the VDSO), > +so you might have to refer to > +.BR vdso (7)), > the register used to indicate the system call number, > and the register used to return the system call result. > .if t \{\ > @@ -219,4 +220,5 @@ main(int argc, char *argv[]) > .SH SEE ALSO > .BR _syscall (2), > .BR intro (2), > -.BR syscalls (2) > +.BR syscalls (2), > +.BR vdso (7) > diff --git a/man2/syscalls.2 b/man2/syscalls.2 > index 265c654..0d085e1 100644 > --- a/man2/syscalls.2 > +++ b/man2/syscalls.2 > @@ -833,4 +833,5 @@ and similarly > .SH SEE ALSO > .BR syscall (2), > .BR unimplemented (2), > -.BR libc (7) > +.BR libc (7), > +.BR vdso (7) > diff --git a/man3/getauxval.3 b/man3/getauxval.3 > index 8f27932..09d5bdc 100755 > --- a/man3/getauxval.3 > +++ b/man3/getauxval.3 > @@ -210,7 +210,5 @@ see > for more information. > .SH SEE ALSO > .BR secure_getenv (3), > +.BR vdso (7), > .BR ld-linux.so (8) > - > -The kernel source file > -.IR Documentation/ABI/stable/vdso > diff --git a/man7/libc.7 b/man7/libc.7 > index a9aeba2..f687ced 100644 > --- a/man7/libc.7 > +++ b/man7/libc.7 > @@ -98,6 +98,9 @@ Details of these libraries are generally not covered by the > project. > .SH SEE ALSO > .BR syscalls (2), > +.BR getauxval (3), > +.BR proc (5), > .BR feature_test_macros (7), > .BR man-pages (7), > -.BR standards (7) > +.BR standards (7), > +.BR vdso (7) > diff --git a/man7/vdso.7 b/man7/vdso.7 > new file mode 100644 > index 0000000..3c4b7fb > --- /dev/null > +++ b/man7/vdso.7 > @@ -0,0 +1,457 @@ > +.\" Written by Mike Frysinger <vapier@xxxxxxxxxx> > +.\" > +.\" %%%LICENSE_START(PUBLIC_DOMAIN) > +.\" This page is in the public domain. > +.\" %%%LICENSE_END > +.\" > +.TH VDSO 7 2013-04-09 "Linux" "Linux Programmer's Manual" > +.SH NAME > +vDSO \- overview of the virtual ELF dynamic shared object > +.SH SYNOPSIS > +.B #include <sys/auxv.h> > + > +.B void *vdso = (uintptr_t) getauxval(AT_SYSINFO_EHDR); > +.SH DESCRIPTION > +The "vDSO" is a small shared library that the kernel automatically maps into the > +address space of all user-space applications. > +Applications themselves usually need not concern themselves with these details > +as the vDSO is most commonly called by the C library. > +This way you can write using standard functions and the C library will take care After "write" I added "programs". Okay? > +of using any available functionality. I made this piece: of using any functionality that is available via the vDSO. Okay? > + > +Why does the vDSO exist at all? > +There are some facilities the kernel provides that user space ends up using I changed "facilities" to "system calls". Okay? > +frequently to the point that such calls can dominate overall performance. > +This is due both to the frequency of the call as well as the context overhead > +from exiting user space and entering the kernel. > + > +The rest of this documentation is geared towards the curious and/or C library > +writers rather than general developers. > +If you're trying to call the vDSO in your own application rather than using > +the C library, you're most likely doing it wrong. > +.SS Example background > +Making system calls can be slow. > +In x86 32-bit systems, you can trigger a software interrupt (int $0x80) to tell > +the kernel you wish to make a system call. > +However, this instruction is expensive: it goes through the full interrupt > +handling paths in the processor's microcode as well as in the kernel. > +Newer processors have faster (but backwards incompatible) instructions to > +initiate system calls. > +Rather than require the C library to figure out if this functionality is > +available at runtime itself, it can use functions provided by the kernel in > +the vDSO. > + > +Note that the terminology can be confusing. > +On x86 systems, the vDSO function is named "__kernel_vsyscall", but on x86_64, After "function" I added used to determine the preferred method of making a system call is Okay? > +the term "vsyscall" also refers to an obsolete way to ask the kernel what time > +it is or what CPU the caller is on. > + > +One system call frequently called is gettimeofday(). > +This is called both directly by user-space applications as well as indirectly by > +the C library. > +Think timestamps or timing loops or polling -- all of these frequently need to > +know what time it is right now. > +This information is also not secret -- any application in any privilege mode > +(root or any user) will get the same answer. > +Thus the kernel arranges for the information required to answer this question > +to be placed in memory the process can access. > +Now a call to gettimeofday() changes from a system call to a normal function > +call and a few memory accesses. > +.SS Finding the vDSO > +The base address of the vDSO (if one exists) is passed by the kernel to each > +program in the initial auxiliary vector. > +Specifically, via the > +.B AT_SYSINFO_EHDR > +tag. > + > +You must not assume the vDSO is mapped at any particular location in the > +user's memory map. > +The base address will usually be randomized at runtime every time a new > +process image is created (at > +.BR execve (2) > +time). > +This is done for security reasons to prevent standard "return-to-libc" attacks. > + > +For some architectures, there is also a > +.B AT_SYSINFO > +tag. > +This is used only for locating the vsyscall entry point and is frequently > +omitted or set to 0 (meaning it's not available). > +It is a throwback to the initial vDSO work (see > +.IR HISTORY > +below) and should be avoided. > + > +Refer to > +.BR getauxval (3) > +for more details on accessing these fields. > +.SS File format > +Since the vDSO is a fully formed ELF image, you can do symbol lookups on it. > +This allows new symbols to be added with newer kernel releases, and for the > +C library to detect available functionality at runtime when running under > +different kernel versions. > +Often times the C library will do detection with the first call and then > +cache the result for subsequent calls. > + > +All symbols are also versioned (using the GNU version format). > +This allows the kernel to update the function signature without breaking > +backwards compatibility. > +This means changing the arguments that the function accepts as well as the > +return value. > +Thus, when looking up a symbol in the vDSO, you must always include the version > +to match the ABI you expect. > + > +Typically the vDSO follows the naming convention of prefixing all symbols with > +"__vdso_" or "__kernel_" so as to distinguish them from other standard symbols. > +e.g. The "gettimeofday" function is named "__vdso_gettimeofday". > + > +You use the standard C calling conventions when calling any of these functions. > +No need to worry about weird register or stack behavior. > +.SH NOTES > +.SS Source > +When you compile the kernel, it will automatically compile and link the vDSO > +code for you. > +You will frequently find it under the architecture-specific dir: > + > + find arch/$ARCH/ -name '*vdso*.so*' -o -name '*gate*.so*' > + > +Note that the vDSO that is used is based on the ABI of your user-space code > +and not the ABI of the kernel. > +i.e. If you run an i386 32-bit ELF under an i386 32-bit kernel or under an > +x86_64 64-bit kernel, you'll get the same vDSO. > +So when referring to sections below, use the user-space ABI. I still can't make any sense of that last sentence. What are "sections" in this context? What does it mean to "*use* the user-space ABI"? > +.SS vDSO names > +The name of this shared object varies across architectures. > +It will often show up in things like glibc's `ldd` output. > +The exact name should not matter to any code, so do not hardcode it. > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +user ABI vDSO name > +_ > +aarch64 linux-vdso.so.1 > +ia64 linux-gate.so.1 > +ppc/32 linux-vdso32.so.1 > +ppc/64 linux-vdso64.so.1 > +s390 linux-vdso32.so.1 > +s390x linux-vdso64.so.1 > +sh linux-gate.so.1 > +i386 linux-gate.so.1 > +x86_64 linux-vdso.so.1 > +x86/x32 linux-vdso.so.1 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS arm functions > +.\" See linux/arch/arm/kernel/entry-armv.S > +.\" See linux/Documentation/arm/kernel_user_helpers.txt > +The arm port has a code page full of utility functions. > +Since it's just a raw page of code, there is no ELF information for doing > +symbol lookups or versioning. > +It does provide support for different versions though. > + > +For documentation on this code page, it's better you refer to the kernel doc > +as it's extremely detailed and covers everything you need to know: > +.br > +Documentation/arm/kernel_user_helpers.txt > +.SS aarch64 functions > +.\" See linux/arch/arm64/kernel/vdso/vdso.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version You don't explicitly say what tables such as the below are about. Could you provide me with a sentence to describe them? Cheers, Michael > +_ > +__kernel_rt_sigreturn LINUX_2.6.39 > +__kernel_gettimeofday LINUX_2.6.39 > +__kernel_clock_gettime LINUX_2.6.39 > +__kernel_clock_getres LINUX_2.6.39 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS bfin (Blackfin) functions > +.\" See linux/arch/blackfin/kernel/fixed_code.S > +.\" See http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:fixed-code > +As this CPU lacks a memory management unit (MMU), it doesn't set up a vDSO in > +the normal sense. > +Instead, it maps at boot time a few raw functions into a fixed location in > +memory. > +User-space applications then call directly into that region. > +There is no provision for backwards compatibility beyond sniffing raw opcodes, > +but as this is an embedded CPU, it can get away with things -- some of the > +object formats it runs aren't even ELF based (they're bFLT/FLAT). > + > +For documentation on this code page, it's better you refer to the public docs: > +.br > +http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:fixed-code > +.SS ia64 (Itanium) functions > +.\" See linux/arch/ia64/kernel/gate.lds.S > +.\" Also linux/arch/ia64/kernel/fsys.S and linux/Documentation/ia64/fsys.txt > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_sigtramp LINUX_2.5 > +__kernel_syscall_via_break LINUX_2.5 > +__kernel_syscall_via_epc LINUX_2.5 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > + > +The Itanium port actually likes to get tricky. > +In addition to the vDSO above, it also has "light-weight system calls" (also > +known as "fast syscalls" or "fsys"). > +You can invoke these via the __kernel_syscall_via_epc vDSO helper. > +The system calls listed here have the same semantics as if you called them > +directly via > +.BR syscall (3), > +so refer to the relevant > +documentation for each. > +The table below lists the functions available via this mechanism. > +.if t \{\ > +.ft CW > +\} > +.TS > +l. > +function > +_ > +clock_gettime > +getcpu > +getpid > +getppid > +gettimeofday > +set_tid_address > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS parisc (hppa) functions > +.\" See linux/arch/parisc/kernel/syscall.S > +.\" See linux/Documentation/parisc/registers > +The parisc port has a code page full of utility functions called a gateway page. > +Rather than use the normal ELF aux vector approach, it passes the address of > +the page to the process via the SR2 register. > +The permissions on the page are such that merely executing those addresses > +automatically executes with kernel privileges and not in user-space. > +This is done to match the way HP-UX works. > + > +Since it's just a raw page of code, there is no ELF information for doing > +symbol lookups or versioning. > +Simply call into the appropriate offset via the branch instruction, e.g.: > +.br > +ble <offset>(%sr2, %r0) > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +offset function > +_ > +00b0 lws_entry > +00e0 set_thread_pointer > +0100 linux_gateway_entry (syscall) > +0268 syscall_nosys > +0274 tracesys > +0324 tracesys_next > +0368 tracesys_exit > +03a0 tracesys_sigexit > +03b8 lws_start > +03dc lws_exit_nosys > +03e0 lws_exit > +03e4 lws_compare_and_swap64 > +03e8 lws_compare_and_swap > +0404 cas_wouldblock > +0410 cas_action > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS ppc/32 functions > +.\" See linux/arch/powerpc/kernel/vdso32/vdso32.lds.S > +The functions marked with a > +.I * > +below are only available when the kernel is > +a powerpc64 (64-bit) kernel. > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_clock_getres LINUX_2.6.15 > +__kernel_clock_gettime LINUX_2.6.15 > +__kernel_datapage_offset LINUX_2.6.15 > +__kernel_get_syscall_map LINUX_2.6.15 > +__kernel_get_tbfreq LINUX_2.6.15 > +__kernel_getcpu \fI*\fR LINUX_2.6.15 > +__kernel_gettimeofday LINUX_2.6.15 > +__kernel_sigtramp_rt32 LINUX_2.6.15 > +__kernel_sigtramp32 LINUX_2.6.15 > +__kernel_sync_dicache LINUX_2.6.15 > +__kernel_sync_dicache_p5 LINUX_2.6.15 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS ppc/64 functions > +.\" See linux/arch/powerpc/kernel/vdso64/vdso64.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_clock_getres LINUX_2.6.15 > +__kernel_clock_gettime LINUX_2.6.15 > +__kernel_datapage_offset LINUX_2.6.15 > +__kernel_get_syscall_map LINUX_2.6.15 > +__kernel_get_tbfreq LINUX_2.6.15 > +__kernel_getcpu LINUX_2.6.15 > +__kernel_gettimeofday LINUX_2.6.15 > +__kernel_sigtramp_rt64 LINUX_2.6.15 > +__kernel_sync_dicache LINUX_2.6.15 > +__kernel_sync_dicache_p5 LINUX_2.6.15 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS s390 functions > +.\" See linux/arch/s390/kernel/vdso32/vdso32.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_clock_getres LINUX_2.6.29 > +__kernel_clock_gettime LINUX_2.6.29 > +__kernel_gettimeofday LINUX_2.6.29 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS s390x functions > +.\" See linux/arch/s390/kernel/vdso64/vdso64.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_clock_getres LINUX_2.6.29 > +__kernel_clock_gettime LINUX_2.6.29 > +__kernel_gettimeofday LINUX_2.6.29 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS sh (SuperH) functions > +.\" See linux/arch/sh/kernel/vsyscall/vsyscall.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_rt_sigreturn LINUX_2.6 > +__kernel_sigreturn LINUX_2.6 > +__kernel_vsyscall LINUX_2.6 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS i386 functions > +.\" See linux/arch/x86/vdso/vdso32/vdso32.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__kernel_sigreturn LINUX_2.5 > +__kernel_rt_sigreturn LINUX_2.5 > +__kernel_vsyscall LINUX_2.5 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS x86_64 functions > +.\" See linux/arch/x86/vdso/vdso.lds.S > +All of these symbols are also available without the "__vdso_" prefix, but > +you should ignore those and stick to the names below. > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__vdso_clock_gettime LINUX_2.6 > +__vdso_getcpu LINUX_2.6 > +__vdso_gettimeofday LINUX_2.6 > +__vdso_time LINUX_2.6 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS x86/x32 functions > +.\" See linux/arch/x86/vdso/vdso32.lds.S > +.if t \{\ > +.ft CW > +\} > +.TS > +l l. > +symbol version > +_ > +__vdso_clock_gettime LINUX_2.6 > +__vdso_getcpu LINUX_2.6 > +__vdso_gettimeofday LINUX_2.6 > +__vdso_time LINUX_2.6 > +.TE > +.if t \{\ > +.in > +.ft P > +\} > +.SS History > +The vDSO was originally just a single function -- the vsyscall. > +In older kernels, you might see that in a process's memory map rather than vdso. > +Over time, people realized that this was a great way to pass more functionality > +to user space, so it was reconceived as a vDSO in the current format. > +.SH SEE ALSO > +.BR syscalls (2), > +.BR getauxval (3), > +.BR proc (5) > + > +The docs/examples/sources in the Linux sources: > +.nf > +Documentation/ABI/stable/vdso > +linux/Documentation/ia64/fsys.txt > +Documentation/vDSO/* (includes examples of using the vDSO) > +find arch/ -iname '*vdso*' -o -iname '*gate*' > +.fi > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html