Re: Copying Data Blocks

"Greg Freemyer" <greg.freemyer@xxxxxxxxx> · Wed, 14 Jan 2009 12:13:38 -0500

On Wed, Jan 14, 2009 at 10:42 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>
>>
>> Well, Well, Well
>> Firstly, the source inode remians intact only the pointers in the
>> source inodes are updated.
>> And we don't freeze the whole FS, we just take a lock on the inode.
>> So, you can operate on all other inodes. We intend to reduce the
>> granularity of locking to per block from per inode, sometime in newar
>> future for sure.
>>
>
> inode level or block level locking is not good for performance, and i
> suspect it is also highly susceptible to deadlock scenario.   normally
> it should be point in time...like that of Oracle...

Peter,

I can see the inode locking for the entire duration of a file reorg
would be unacceptable at some later point in the development / release
cycle.

But why would block level be bad.  I would envision it something like:

===
preallocate all blocks in destination file

for each block in file
  prefetch block into cache
  lock updates to all blocks for the given inode
  copy data from pre-fetched source block to pre-allocated dest block
  release lock

Thus the lock is only held for the duration of the copy data which is
a ram based activity.

I'm curious how long ext4 holds a lock during its defrag implementation?

> first read this email:
>
> https://kerneltrap.org/mailarchive/linux-fsdevel/2008/4/21/1523184/thread
>
> (I elaborated on filesystem level integrity)

I read it but don't see the relevancy.

As I understand what Sandeep and team have done, there is not a
integrity issue that I see.  I would like to see the actual code, but
conceptually it sounds fine.

And I think I saw a message that said the write path is unmodified.
That can be true with a single lock around the entire re-org process.
Once you move to a block level lock, the write path will have to be
modified.  See my much earlier pseudo code email.  My pseudo code does
not address writes that extend the file length.  Obviously that would
have to be handled as well.

> then read the direct reply from a Oracle guy to me:
>
> Peter Teoh wrote:
>
>    invalidate the earlier fsck results.   This idea has its equivalence
>    in the Oracle database world - "online datafile backup" feature, where
>    all transactions goes to memory + journal logs (a physical file
>    itself), and datafile is frozen for writing, enabling it to be
>    physically copied):
>
>
> Sunil Mushran from Oracle==>
>
> Actually, no. The dbfile is not frozen. What happens is that the
> redo generated in such a way that fractured blocks can be fixed
> on restore. You will notice the redo size increase when online
> backup is enabled.

Most DBs use a full transaction log (journal).  They can freeze the DB
/ caches / etc. by utilizing the log.

Ext3 is typically run with only a metadata log.  It does have support
for a data log (journal).

In the very long term, I can see that the re-org might be optimized by
leveraging the data logging code, but I think that can wait.

I again wonder if the ext4 defrag logic uses it?  If so, HSM could
leverage their design and basically piggy back on top of it.  But only
for a ext4 implementation.

>> Secondly, for files which are already opened, if it tries to do a
>> read/write it sleeps, but doesnt break.
>> The period of locking will depend on the file size.
>> See, we take a lock, we read the source inode size, allocate required
>> number of block in dest inode, and exchange 15 block pointers,release
>> the lock, mark source inode dirty, and delete dummy inode.
>> As there is no copy of data, the time will not be much. for a 10GB
>> file the time for relocation was in seconds.
>>
>>> If you choose second you might freeze the FS for a long time and if
>>> you choose first then how do you plan to handle the below case.
>>>
>> The cost will be very high, If I freeze the FS. Inode lock saves us to
>> some extent here.
>>
>>
>>> a) Application opens a file for writes (remember space checks are done
>>> at this place and blocks are preallocated only in memory).
>>> b) During relocation and before deletion of your destination inode you
>>> are using 2X size of your inode.
>>> c) Now if you unfreeze your FS, it might get ENOSPC in this window.
>>>

When it says blocks are only preallocated in memory, what does that mean?

I would think you would "allocate" real sectors on real drives.  The
fact that you don't flush the allocation info out to disk is
unimportant.  You own the blocks, and no other filesystem activity can
steal it from you.  Thus the ENOSPC should occur during the
preallocate.

And it should be relatively easy to have a min freespace check prior
to doing the preallocate.

ie. Require room for one full copy of file being re-organized, plus
X%.  Eventually X% will need to be a userspace settable value, but at
this early R&D stage it could be hardcoded to 0%.

>
> The above is just one possible problem....deadlock are many possible...

Deadlocks are always possible.  The HSM team just has to find them and
squash them.

The deadlock issues in this process seem pretty minimal to me, but
maybe I'm missing something.

> But I am amazed by Oracle's strategy....it is really good for performance.
>

File system backup has been doing:

Quiesce Application
Quiesce Filesystem
Create readonly Snapshot
Release Filesystem
Release Application

for at least a decade.

==>Quiesce Application

Enterprise class databases and other applications have the Quiesce
feature implemented in various ways.

The strategy you describe for Oracle is their implementation of Quiesce Oracle.

Once the HSM team has a solid implementation done for generic
applications, I agree they should evaluate having user space calls
into the major applications that tell them to quiesce themselves.  By
doing that, those applications may experience even less impact from
any locking activity the HSM kernel has to implement.

==>Quiesce Filesystem

The Quiesce Filesystem concept above has a design goal of creating a
stable block device, but that is not needed by the HSM team.  Instead
they need to only worry about the in memory representation of the
inodes and data blocks associated with the file under re-org.

The block device does not need to be static / locked for that to happen.

> Check this product:
>
> http://www.redbooks.ibm.com/abstracts/redp4065.html?Open
>
> and this one:
>
> http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246786.html?Open
>
> and within the above document:
>
> FlashCopy
> The primary objective of FlashCopy is to very quickly create a
> point-in-time copy of a source
> volume on a target volume. The benefits of FlashCopy are that the
> point-in-time target copy is
> immediately available for use for backups or testing and that the
> source volume is
> immediately released so that applications can continue processing with
> minimal application
> downtime. The target volume can be either a logical or physical copy
> of the data, with the
> latter copying the data as a background process. In a z/OS
> environment, FlashCopy can also
> operate at a data set level.
>
> So same features like yours....done point-in-time.
>
> comments?

Tools like FlashCopy block i/o leaving the filesystem code and going
into the underlying block device.

FYI: Device Mapper snapshots are very similar to the above.

Not useful for this application in my opinion.

> --
> Regards,
> Peter Teoh

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ