Oren Laadan [orenl@xxxxxxxxxxxxxxx] wrote: > > > john stultz wrote: >> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote: >>> Sukadev Bhattiprolu wrote: >>>> Oren Laadan [orenl@xxxxxxxxxxxxxxx] wrote: >>>> | | | > h->fl_type = lock->fl_type; >>>> | > + h->fl_type_prev = lock->fl_type_prev; >>>> | > h->fl_flags = lock->fl_flags; >>>> | > + if (h->fl_type & F_INPROGRESS && | > >>>> + (lock->fl_break_time > jiffies)) >>>> | > + h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ; >>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the >>>> checkpoint. >>>> | It is used for relative-time calculations, for example, the expiry of >>>> | restart-blocks and timeouts. I suggest that we use it here too to be >>>> | consistent. >>>> >>>> Well, I started off using ktime_begin but discussed this with John Stultz >>>> (CC'd here) who pointed out that mixing different domains of time is likely >>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies - >>>> a relative time. >>>> >>>> I think use of ktime_begin for restart_blocks is fine (since they use >>>> ktime_t) but using ktime_t for file leases and converting between jiffies >>>> and nanoseconds could be a problem, unless we convert fl_break_time to >>>> seconds. >>>> >>>> IOW, can we leave the above computation of ->fl_rem_lease for now ? >>> The data on restart_blocks keep relative time - it's the the time >>> to expiry relative to ktime_begin (which is absolute). >>> >>> ktime_begin merely gives a reference point in time against which >>> all other time-related values should be saved. The advantage is >>> that all time computation are relative to the same point in time >>> at checkpoint/restart - the time when the ktime_begin is set. It's >>> more consistent this way. >> >> First, forgive me for not being very aware of the checkpoint/restart >> code. > > On the contrary, forgive me if I'm stating the obvious below ... > >> >> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the >> time the system booted (more or less). And it represents the checkpoint >> time, correct? > > As a rule, all time measurements in the checkpoint image are > saved as relative values, using the start-of-checkpoint as the > reference point in time (*). > > So at checkpoint, every absolute time value should be converted > to a value relative to the start-of-checkpoint. At restart, every > relative time value from the image is converted back (if needed) > to an absolute time value using a respective start-of-restart. One general observation, slightly off-topic. You mention "start-of-restart" here and ... > > This makes sense for the most common case, where if a process > had 5 more seconds to sleep at the time of checkpoint, we would > like it to have 5 more seconds to sleep after it restarts. ... "after it restarts" here. These two can be quite different if, as you mention below, the C/R is a lengthy operation. If application had 5 seconds remaining on the lease, and C/R takes more than 5 seconds, the application would have spent the remaining lease frozen. Of course trying to compute the values relative to the "end-of-restart" is theoretically impossible :-). Maybe the user can adjust the lease-expiry in the checkpoint-image to allow for this as well. > > (For the rare case where you care about the absolute timeout, > userspace can modify the checkpoint image on-the-fly during > restart to adjust the timeout value to be 0, or close enough > to zero that the timeout will happen as soon as the restart > completes). > > That said, ktime_begin has two purposes: > > 1) It stores the start-of-{checkpoint,restart} during checkpoint > and restart respectively, and used for all conversions from > absolute to relative time values. This is purely internal use > during the checkpoint/restart operations. > It make sense to use a single value, as opposed to compute the > deltas "on the go", because checkpoint/restart can be lengthy > operations (depending on the image size and IO speed). Doing > so against current time, which is a moving target, can yield > inconsistent results. > > (*) Looking at the code, I see that the hrtimer expiry times > are actually calculated against current time - I'll fix that... > > 2) It provides information for the admin/user about what was > the absolute time when the checkpoint operation began (maybe > we should also end ktime_end...). Otherwise, we would have to > wrap each checkpoint image with additional meta-data from > userspace. > (In particular, having that absolute time in the checkpoint > image will allow a userspace filter to modify timeout values > as needed in the rare case mentioned above). > >> Is there a similar checkpointed jiffies value? > > No. but ... > >>> I don't see the problem with jiffies vs ktime - there are functions >>> to convert between different units (see jiffies.h). Even if you are >>> concerned about reducing the resolution because of a conversion - >>> well, it isn't realistic to expect nano-sec resolution after restart... >> >> You're right, there are functions to convert from relative jiffies to >> relative nanoseconds, but there really aren't good functions to convert >> from absolute jiffies to absolute CLOCK_MONOTONIC time. >> >> One can make a rough approximation, but its not a very precise method >> with a number of complications (ie: 32bit jiffies wraps every ~50 days, >> the INITIAL_JIFFIES offset has to be remembered, and the larger issue of >> the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may >> not be). So I'd strongly advise against trying to directly convert abs >> jiffies values to abs CLOCK_MONOTONIC time. > > ... given this, to be able to compute the relative time where > jiffies are involved (e.g. the remaining lease time relative to > ktime_begin) we could also keep a ctx->jiffies_begin, calculate > the delta, and then convert to ktime_t delta. > > Suka: if that works for you, can you please add this piece to > your patch ? Yes, sure I can do that. Thanks John, Oren. Sukadev -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html