Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Oren Laadan [orenl@xxxxxxxxxxxxxxx] wrote:
>
>
> john stultz wrote:
>> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>>> Sukadev Bhattiprolu wrote:
>>>> Oren Laadan [orenl@xxxxxxxxxxxxxxx] wrote:
>>>> | | | >  		h->fl_type = lock->fl_type;
>>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>>> | >  		h->fl_flags = lock->fl_flags;
>>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
>>>> +					(lock->fl_break_time > jiffies))
>>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
>>>> checkpoint.
>>>> | It is used for relative-time calculations, for example, the expiry of
>>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>>> | consistent.
>>>>
>>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>>> a relative time. 
>>>>
>>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>>> seconds.
>>>>
>>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>>> The data on restart_blocks keep relative time - it's the the time
>>> to expiry relative to ktime_begin (which is absolute).
>>>
>>> ktime_begin merely gives a reference point in time against which
>>> all other time-related values should be saved. The advantage is
>>> that all time computation are relative to the same point in time
>>> at checkpoint/restart - the time when the ktime_begin is set. It's
>>> more consistent this way.
>>
>> First, forgive me for not being very aware of the checkpoint/restart
>> code. 
>
> On the contrary, forgive me if I'm stating the obvious below ...
>
>>
>> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
>> time the system booted (more or less). And it represents the checkpoint
>> time, correct?
>
> As a rule, all time measurements in the checkpoint image are
> saved as relative values, using the start-of-checkpoint as the
> reference point in time (*).
>
> So at checkpoint, every absolute time value should be converted
> to a value relative to the start-of-checkpoint. At restart, every
> relative time value from the image is converted back (if needed)
> to an absolute time value using a respective start-of-restart.

One general observation, slightly off-topic. You mention 
"start-of-restart" here and ...
>
> This makes sense for the most common case, where if a process
> had 5 more seconds to sleep at the time of checkpoint, we would
> like it to have 5 more seconds to sleep after it restarts.

... "after it restarts" here. These two can be quite different if,
as you mention below, the C/R is a lengthy operation.

If application had 5 seconds remaining on the lease, and C/R takes
more than 5 seconds, the application would have spent the remaining
lease frozen.  Of course trying to compute the values relative to the
"end-of-restart" is theoretically impossible :-).

Maybe the user can adjust the lease-expiry in the checkpoint-image 
to allow for this as well.

>
> (For the rare case where you care about the absolute timeout,
> userspace can modify the checkpoint image on-the-fly during
> restart to adjust the timeout value to be 0, or close enough
> to zero that the timeout will happen as soon as the restart
> completes).
>
> That said, ktime_begin has two purposes:
>
> 1) It stores the start-of-{checkpoint,restart} during checkpoint
> and restart respectively, and used for all conversions from
> absolute to relative time values.  This is purely internal use
> during the checkpoint/restart operations.
> It make sense to use a single value, as opposed to compute the
> deltas "on the go", because checkpoint/restart can be lengthy
> operations (depending on the image size and IO speed). Doing
> so against current time, which is a moving target, can yield
> inconsistent results.
>
> (*) Looking at the code, I see that the hrtimer expiry times
> are actually calculated against current time - I'll fix that...
>
> 2) It provides information for the admin/user about what was
> the absolute time when the checkpoint operation began (maybe
> we should also end ktime_end...). Otherwise, we would have to
> wrap each checkpoint image with additional meta-data from
> userspace.
> (In particular, having that absolute time in the checkpoint
> image will allow a userspace filter to modify timeout values
> as needed in the rare case mentioned above).
>
>> Is there a similar checkpointed jiffies value?
>
> No. but ...
>
>>> I don't see the problem with jiffies vs ktime - there are functions
>>> to convert between different units (see jiffies.h). Even if you are
>>> concerned about reducing the resolution because of a conversion -
>>> well, it isn't realistic to expect nano-sec resolution after restart...
>>
>> You're right, there are functions to convert from relative jiffies to
>> relative nanoseconds, but there really aren't good functions to convert
>> from absolute jiffies to absolute CLOCK_MONOTONIC time.
>>
>> One can make a rough approximation, but its not a very precise method
>> with a number of complications (ie: 32bit jiffies wraps every ~50 days,
>> the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
>> the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
>> not be). So I'd strongly advise against trying to directly convert abs
>> jiffies values to abs CLOCK_MONOTONIC time.
>
> ... given this, to be able to compute the relative time where
> jiffies are involved (e.g. the remaining lease time relative to
> ktime_begin) we could also keep a ctx->jiffies_begin, calculate
> the delta, and then convert to ktime_t delta.
>
> Suka: if that works for you, can you please add this piece to
> your patch ?

Yes, sure I can do that.

Thanks John, Oren.

Sukadev
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux