Re: Copying Data Blocks

"Greg Freemyer" <greg.freemyer@xxxxxxxxx> · Thu, 15 Jan 2009 09:49:57 -0500

On Wed, Jan 14, 2009 at 7:59 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
> Frankly, I am quite losts in the sea of argument :-).....but let try
> sharing my points:
>
>>>
>>> I can see the inode locking for the entire duration of a file reorg
>>> would be unacceptable at some later point in the development / release
>>> cycle.
>>>
>> That's exactly we are intending. We are planning to get down to block
>> level in our later milestones. I have mentioned that earlier as well.
>>
>
> Yes, locking can be done either at the block or inode level.   But I
> would like to suggest is Oracle's mechanism - NO LOCKS at all!!!
>
> To copy something from A to B, u either freeze all changes to A, and
> copy it, or,
> u just go straight and copy it, AND UNDO any unnecessary changes that
> have been done.   The latter u can get from the journalling logs
> (errr....does not apply to ext2) - whereby all changes are in terms of
> transaction.   So any data chnages which is not closed.....is
> considered incomplete, and therefore will be undone.   Another
> criteria is TIME (explained later).
>
> But then again journalling have two types:   it either hold all the
> latest data changes, or it does not, but just an indication of WHERE
> changes are made (only the metadata) - called writeback or data
> journalling (http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html).
>
> So based on all these it is possible to reconstruct the data changes.
>
> So now u compare - in your imagination - one operation with (locks +
> copy) x 1000000 times repeated, and another one
> of (copy) x 1000000 + undoing changes based on journalling......in
> scenario where there is very little chagnes.....the Oracle mechanism
> wins in term of performance.   The emphasis here is that there is NO
> LOCKING OVERHEAD now.
>
> That is how "online backup in Oracle" works.   And another criteria
> for slicing the journalling logs is time.   Based on a certain point
> in time, all transactions before will be flushed out (meaning done),
> and after that undone, if data changes have been made.   Oracle can do
> that is because ALL data changes are recorded in journalling (FULL
> journalling, and is always the case).   In ext3 case, if we have data
> journalling (not writeback) then this is possible....this is default
> in my distro (Fedora) and mentioned in the ext3-faq as well.
>

I'm surprised Fedora enables data journaling by default.  Interesting.

> Another emphasis is point-in-time.   Everything at a particular point
> in time is always consistent.   But if u lock one file and unlock it
> and lock the other files.....different files copied at different
> times....u may end up one file being much more FORWARDED in its
> recency (or times of "updatedness" than another).   Ie, data
> inconsistency.   Which is why in Oracle backup procedures, it is never
> never locking at the per-file level - which is your inode locking.
> All files are always backup at the same time.

I dont' think the above paragraph is an issue with re-org as currently
designed.  Neither for the ext4_defrag patchset that is under
consideration for acceptance, nor the work the OHSM team is doing.

> Hopefully I have explained myself clear enough?

You have and I agree that the "best" solution from a purely technical
perspective would be for a file re-org tool to utilize a data-journal.
 The issue is that I suspect it is a significant amount of work to
implement and personally I doubt the performance payoff is
significant.

Especially with rotational media, the call stack at the filesystem
layer is just so much faster than the drive, that blocking access to
the write queue for a few milliseconds while some block level re-org
is happening should not slow down data getting to disk.  ie. During
the lock the lower level block queues will drain some, but as soon as
the lock is released the queues should fill back up.

The key in my mind is eventually getting the file re-org to only hold
the inode lock for a few milliseconds at a time.  Once that is
achieved, I don't believe it is worth the extra work to implement an
Oracle like solution.

I believe Oracle has to do so because they allow the lock to be put in
place for extended periods.  And obviously a enterprise DB cannot shut
down for extended periods.

Not to be snide, but if you truly feel a design that does use inode
locking to get the job done is unacceptable, then you should post your
objections on the ext4 list.

I reviewed the ext4_defrag patch and it holds the lock for 64MB at a
time.  So not for the full file, but not for just a few blocks either.

Personnaly I think 64MB is to much data to process under one lock.
That will require 128MB of working space if you try to keep both the
source and dest in ram at the same time.  Especially for the Netbooks
that are causing so much developer activity right now, using 128MB for
this is unacceptable.

I am not on that list, so I don't know if they have discussed the
issue or not.  Possibly they will rework the lock holding time in the
future.  Or possibly there is more mutex logic going on than what I
saw at first glance.

> --
> Regards,
> Peter Teoh
>

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ