Re: Copying Data Blocks

"Sandeep K Sinha" <sandeepksinha@xxxxxxxxx> · Thu, 15 Jan 2009 03:48:49 +0530

Hi,

On Wed, Jan 14, 2009 at 10:43 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
> On Wed, Jan 14, 2009 at 10:42 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>
>>>
>>> Well, Well, Well
>>> Firstly, the source inode remians intact only the pointers in the
>>> source inodes are updated.
>>> And we don't freeze the whole FS, we just take a lock on the inode.
>>> So, you can operate on all other inodes. We intend to reduce the
>>> granularity of locking to per block from per inode, sometime in newar
>>> future for sure.
>>>
>>
>> inode level or block level locking is not good for performance, and i
>> suspect it is also highly susceptible to deadlock scenario.   normally
>> it should be point in time...like that of Oracle...
>
> Peter,
>
> I can see the inode locking for the entire duration of a file reorg
> would be unacceptable at some later point in the development / release
> cycle.
>
That's exactly we are intending. We are planning to get down to block
level in our later milestones. I have mentioned that earlier as well.

> But why would block level be bad.  I would envision it something like:
>
> ===
> preallocate all blocks in destination file
>
> for each block in file
>  prefetch block into cache
>  lock updates to all blocks for the given inode
>  copy data from pre-fetched source block to pre-allocated dest block
>  release lock
>
> Thus the lock is only held for the duration of the copy data which is
> a ram based activity.
>

The only problem that we saw here was a change in the length of a file
which is already open. So, inorder to aviod that in the first
implementation we are taking a lock for the complete re-org.
We will have to avoid this for sure.

> I'm curious how long ext4 holds a lock during its defrag implementation?
>

Not very sure, but will look into it and update you. It can be of
great use to us.

>> first read this email:
>>
>> https://kerneltrap.org/mailarchive/linux-fsdevel/2008/4/21/1523184/thread
>>
>> (I elaborated on filesystem level integrity)
>
> I read it but don't see the relevancy.
>
> As I understand what Sandeep and team have done, there is not a
> integrity issue that I see.  I would like to see the actual code, but
> conceptually it sounds fine.
>
> And I think I saw a message that said the write path is unmodified.

Yes that true.

> That can be true with a single lock around the entire re-org process.
> Once you move to a block level lock, the write path will have to be
> modified.  See my much earlier pseudo code email.  My pseudo code does
> not address writes that extend the file length.  Obviously that would
> have to be handled as well.
>

I will tell you frankly, after our analysis and team discussions, the
algorithm that is given by Greg will work perfectly fine.

But what about the increase in the file length ? The problem is that
we need to allocate the required number of blocks before actually
copying the data. And while copying these blocks, we need a lock on
the inode.

My point is that OHSM works on files as object and keeping that in
mind. Is it possible to completely move everything to block level ?

Kindly suggest ?

I think if our granularity would have been a "block", then it would
have been the best to move everything to the block level. I have heard
of systems, where the whole concept is implied at the disk level too.

>> then read the direct reply from a Oracle guy to me:
>>
>> Peter Teoh wrote:
>>
>>    invalidate the earlier fsck results.   This idea has its equivalence
>>    in the Oracle database world - "online datafile backup" feature, where
>>    all transactions goes to memory + journal logs (a physical file
>>    itself), and datafile is frozen for writing, enabling it to be
>>    physically copied):
>>
>>
>> Sunil Mushran from Oracle==>
>>
>> Actually, no. The dbfile is not frozen. What happens is that the
>> redo generated in such a way that fractured blocks can be fixed
>> on restore. You will notice the redo size increase when online
>> backup is enabled.
>
> Most DBs use a full transaction log (journal).  They can freeze the DB
> / caches / etc. by utilizing the log.
>
> Ext3 is typically run with only a metadata log.  It does have support
> for a data log (journal).
>
> In the very long term, I can see that the re-org might be optimized by
> leveraging the data logging code, but I think that can wait.
>
> I again wonder if the ext4 defrag logic uses it?  If so, HSM could
> leverage their design and basically piggy back on top of it.  But only
> for a ext4 implementation.
>
Makes sense.
Yes, We have started analysis this. We will update you on this too.
I think, it should be helpful for us.

>>> Secondly, for files which are already opened, if it tries to do a
>>> read/write it sleeps, but doesnt break.
>>> The period of locking will depend on the file size.
>>> See, we take a lock, we read the source inode size, allocate required
>>> number of block in dest inode, and exchange 15 block pointers,release
>>> the lock, mark source inode dirty, and delete dummy inode.
>>> As there is no copy of data, the time will not be much. for a 10GB
>>> file the time for relocation was in seconds.
>>>
>>>> If you choose second you might freeze the FS for a long time and if
>>>> you choose first then how do you plan to handle the below case.
>>>>
>>> The cost will be very high, If I freeze the FS. Inode lock saves us to
>>> some extent here.
>>>
>>>
>>>> a) Application opens a file for writes (remember space checks are done
>>>> at this place and blocks are preallocated only in memory).
>>>> b) During relocation and before deletion of your destination inode you
>>>> are using 2X size of your inode.
>>>> c) Now if you unfreeze your FS, it might get ENOSPC in this window.
>>>>
>
> When it says blocks are only preallocated in memory, what does that mean?
>

Preallocation is just an information to tell you that where you can
start hunting for new blocks.
Not very sure what Peter meant ?

> I would think you would "allocate" real sectors on real drives.  The
> fact that you don't flush the allocation info out to disk is
> unimportant.  You own the blocks, and no other filesystem activity can
> steal it from you.  Thus the ENOSPC should occur during the
> preallocate.
>
> And it should be relatively easy to have a min freespace check prior
> to doing the preallocate.
>

We warn the admin at two stages 70% (WARNING)  and 90%(CRITICAL) full.
Also, admins can get the state of tiers at any point in time through
/proc or /sys interfaces.
> ie. Require room for one full copy of file being re-organized, plus
> X%.  Eventually X% will need to be a userspace settable value, but at
> this early R&D stage it could be hardcoded to 0%.
>

This can be done, marking it for later work.
>>
>> The above is just one possible problem....deadlock are many possible...
>
> Deadlocks are always possible.  The HSM team just has to find them and
> squash them.
>
:-)

> The deadlock issues in this process seem pretty minimal to me, but
> maybe I'm missing something.
>

I have my fingers crossed.

>> But I am amazed by Oracle's strategy....it is really good for performance.
>>
>
> File system backup has been doing:
>
> Quiesce Application
> Quiesce Filesystem
> Create readonly Snapshot
> Release Filesystem
> Release Application
>
> for at least a decade.
>
> ==>Quiesce Application
>
> Enterprise class databases and other applications have the Quiesce
> feature implemented in various ways.
>
> The strategy you describe for Oracle is their implementation of Quiesce Oracle.
>
> Once the HSM team has a solid implementation done for generic
> applications, I agree they should evaluate having user space calls
> into the major applications that tell them to quiesce themselves.  By
> doing that, those applications may experience even less impact from
> any locking activity the HSM kernel has to implement.
>
> ==>Quiesce Filesystem
>
> The Quiesce Filesystem concept above has a design goal of creating a
> stable block device, but that is not needed by the HSM team.  Instead
> they need to only worry about the in memory representation of the
> inodes and data blocks associated with the file under re-org.
>
> The block device does not need to be static / locked for that to happen.
>

Thanks for this info. This is great.

We are surely looking forward to this.

>> Check this product:
>>
>> http://www.redbooks.ibm.com/abstracts/redp4065.html?Open
>>
>> and this one:
>>
>> http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246786.html?Open
>>
>> and within the above document:
>>
>> FlashCopy
>> The primary objective of FlashCopy is to very quickly create a
>> point-in-time copy of a source
>> volume on a target volume. The benefits of FlashCopy are that the
>> point-in-time target copy is
>> immediately available for use for backups or testing and that the
>> source volume is
>> immediately released so that applications can continue processing with
>> minimal application
>> downtime. The target volume can be either a logical or physical copy
>> of the data, with the
>> latter copying the data as a background process. In a z/OS
>> environment, FlashCopy can also
>> operate at a data set level.
>>
>> So same features like yours....done point-in-time.
>>
>> comments?
>
> Tools like FlashCopy block i/o leaving the filesystem code and going
> into the underlying block device.
>
> FYI: Device Mapper snapshots are very similar to the above.
>
> Not useful for this application in my opinion.
>
>> --
>> Regards,
>> Peter Teoh
>
> Greg
> --
> Greg Freemyer
> Litigation Triage Solutions Specialist
> http://www.linkedin.com/in/gregfreemyer
> First 99 Days Litigation White Paper -
> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
>
> The Norcross Group
> The Intersection of Evidence & Technology
> http://www.norcrossgroup.com
>

-- 
Regards,
Sandeep.

"To learn is to change. Education is a process that changes the learner."

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ