Dmitry Safonov <dima@xxxxxxxxxx> writes: > Discussions around time virtualization are there for a long time. > The first attempt to implement time namespace was in 2006 by Jeff Dike. > From that time, the topic appears on and off in various discussions. > > There are two main use cases for time namespaces: > 1. change date and time inside a container; > 2. adjust clocks for a container restored from a checkpoint. > > “It seems like this might be one of the last major obstacles keeping > migration from being used in production systems, given that not all > containers and connections can be migrated as long as a time dependency > is capable of messing it up.” (by github.com/dav-ell) > > The kernel provides access to several clocks: CLOCK_REALTIME, > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > start points for them are not defined and are different for each running > system. When a container is migrated from one node to another, all > clocks have to be restored into consistent states; in other words, they > have to continue running from the same points where they have been > dumped. > > The main idea behind this patch set is adding per-namespace offsets for > system clocks. When a process in a non-root time namespace requests > time of a clock, a namespace offset is added to the current value of > this clock on a host and the sum is returned. > > All offsets are placed on a separate page, this allows up to map it as > part of vvar into user processes and use offsets from vdso calls. > > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > clocks. > > Questions to discuss: > > * Clone flags exhaustion. Currently there is only one unused clone flag > bit left, and it may be worth to use it to extend arguments of the clone > system call. > > * Realtime clock implementation details: > Is having a simple offset enough? > What to do when date and time is changed on the host? > Is there a need to adjust vfs modification and creation times? > Implementation for adjtime() syscall. Overall I support this effort. In my quick skim this code looked good. My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second. What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer. I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core? What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly. I don't know about the vfs. I think the danger is being able to write dates in the future or in the past. It appears that utimes(2) and utimesnat(2) already allow this except for status change. So it is possible we simply don't care. I seem to remember that what nfs does is take the time stamp from the host writing to the file. I think the guide for filesystem timestamps should be to first ensure we don't introduce security issues, and then do what distributed filesystems do when dealing with hosts with different clocks. Given those those two guidlines above I don't think there is a need to change timestamsp the way the user namespace changes uid when displayed. As for the hardware like the real time clock we definitely should not let a root in a time namespace change it. We might even be able to get away with leaving the real time clock out of the time namespace. If not we need to be very careful how the real time clock is abstracted. I would start by leaving the real time clock hardware out of the time namespace and see if there is any part of userspace that cares. Eric > Cc: Dmitry Safonov <0x7f454c46@xxxxxxxxx> > Cc: Adrian Reber <adrian@xxxxxxxx> > Cc: Andrei Vagin <avagin@xxxxxxxxxx> > Cc: Andy Lutomirski <luto@xxxxxxxxxx> > Cc: Christian Brauner <christian.brauner@xxxxxxxxxx> > Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx> > Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > Cc: "H. Peter Anvin" <hpa@xxxxxxxxx> > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > Cc: Jeff Dike <jdike@xxxxxxxxxxx> > Cc: Oleg Nesterov <oleg@xxxxxxxxxx> > Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx> > Cc: Shuah Khan <shuah@xxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: containers@xxxxxxxxxxxxxxxxxxxxxxxxxx > Cc: criu@xxxxxxxxxx > Cc: linux-api@xxxxxxxxxxxxxxx > Cc: x86@xxxxxxxxxx > > Andrei Vagin (12): > ns: Introduce Time Namespace > timens: Add timens_offsets > timens: Introduce CLOCK_MONOTONIC offsets > timens: Introduce CLOCK_BOOTTIME offset > timerfd/timens: Take into account ns clock offsets > kernel: Take into account timens clock offsets in clock_nanosleep > x86/vdso/timens: Add offsets page in vvar > x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow > posix-timers/timens: Take into account clock offsets > selftest/timens: Add test for timerfd > selftest/timens: Add test for clock_nanosleep > timens/selftest: Add timer offsets test > > Dmitry Safonov (8): > timens: Shift /proc/uptime > x86/vdso: Restrict splitting vvar vma > x86/vdso: Purge timens page on setns()/unshare()/clone() > x86/vdso: Look for vvar vma to purge timens page > timens: Add align for timens_offsets > timens: Optimize zero-offsets > selftest: Add Time Namespace test for supported clocks > timens/selftest: Add procfs selftest > > arch/Kconfig | 5 + > arch/x86/Kconfig | 1 + > arch/x86/entry/vdso/vclock_gettime.c | 52 +++++ > arch/x86/entry/vdso/vdso-layout.lds.S | 9 +- > arch/x86/entry/vdso/vdso2c.c | 3 + > arch/x86/entry/vdso/vma.c | 67 +++++++ > arch/x86/include/asm/vdso.h | 2 + > fs/proc/namespaces.c | 3 + > fs/proc/uptime.c | 3 + > fs/timerfd.c | 16 +- > include/linux/nsproxy.h | 1 + > include/linux/proc_ns.h | 1 + > include/linux/time_namespace.h | 72 +++++++ > include/linux/timens_offsets.h | 25 +++ > include/linux/user_namespace.h | 1 + > include/uapi/linux/sched.h | 1 + > init/Kconfig | 8 + > kernel/Makefile | 1 + > kernel/fork.c | 3 +- > kernel/nsproxy.c | 19 +- > kernel/time/hrtimer.c | 8 + > kernel/time/posix-timers.c | 89 ++++++++- > kernel/time/posix-timers.h | 2 + > kernel/time_namespace.c | 230 +++++++++++++++++++++++ > tools/testing/selftests/timens/.gitignore | 5 + > tools/testing/selftests/timens/Makefile | 6 + > tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++ > tools/testing/selftests/timens/config | 1 + > tools/testing/selftests/timens/log.h | 21 +++ > tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++ > tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++ > tools/testing/selftests/timens/timer.c | 95 ++++++++++ > tools/testing/selftests/timens/timerfd.c | 96 ++++++++++ > 33 files changed, 1272 insertions(+), 13 deletions(-) > create mode 100644 include/linux/time_namespace.h > create mode 100644 include/linux/timens_offsets.h > create mode 100644 kernel/time_namespace.c > create mode 100644 tools/testing/selftests/timens/.gitignore > create mode 100644 tools/testing/selftests/timens/Makefile > create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c > create mode 100644 tools/testing/selftests/timens/config > create mode 100644 tools/testing/selftests/timens/log.h > create mode 100644 tools/testing/selftests/timens/procfs.c > create mode 100644 tools/testing/selftests/timens/timens.c > create mode 100644 tools/testing/selftests/timens/timer.c > create mode 100644 tools/testing/selftests/timens/timerfd.c