Re: Copying Data Blocks

"Sandeep K Sinha" <sandeepksinha@xxxxxxxxx> · Thu, 15 Jan 2009 23:25:25 +0530

Hi Manish,

On Thu, Jan 15, 2009 at 10:31 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>> On Thu, Jan 15, 2009 at 10:41 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>> On Thu, Jan 15, 2009 at 10:49 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>>
>>>> I dont' think the above paragraph is an issue with re-org as currently
>>>> designed.  Neither for the ext4_defrag patchset that is under
>>>> consideration for acceptance, nor the work the OHSM team is doing.
>>>>
>>>
>>> well...it boils down to probability....the lower level the locks, the
>>> more complex it gets....and Nick Piggin echoed this, to quote from
>>> article:
>>>
>>> http://lwn.net/Articles/275185/:  (Toward better direct I/O scalability)
>>>
>>> "There are two common approaches to take when faced with this sort of
>>> scalability problem. One is to go with more fine-grained locking,
>>> where each lock covers a smaller part of the kernel. Splitting up
>>> locks has been happening since the initial creation of the Big Kernel
>>> Lock, which is the definitive example of coarse-grained locking. There
>>> are limits to how much fine-grained locking can help, though, and the
>>> addition of more locks comes at the cost of more complexity and more
>>> opportunities to create deadlocks. "
>>>
>>>>
>>>> Especially with rotational media, the call stack at the filesystem
>>>
>>> be aware of SSD....and they are coming down very fast in terms of
>>> cost.   right now....IBM is testing 4TB SSD.......discussed in a
>>> separate thread.   (not really sure about properties of SSD....but I
>>> think physical contiguity of data may not matter any more, as there
>>> are no moving heads to read the data?)
>>
>> I'm very aware of SSD.  I've been actively researching it for the last
>> week or so.  That is why I was careful to say rotation media is slow.
>>
>> Third generation SSD is spec'ing it random i/o speed and its
>> sequential i/o speed separately.
>>
>> The first couple generations tended to only spec. sequential because
>> random was so bad they did not want to advertise it.
>>
>>>> layer is just so much faster than the drive, that blocking access to
>>>> the write queue for a few milliseconds while some block level re-org
>>>
>>> how about doing it in-memory?  ie, reading the inode blocks (which can
>>> be scattered all over the place) into memory as a contiguous chunk.
>>> then allocate the inodes sequence...physically contiguously....and
>>> then write to it in sequence.   so there exists COPY + PHYSICAL-REORG
>>> at the same time.....partly through memory?   so while this is
>>> happening, and the source blocks got modified....then the memory for
>>> destination blocks will be updated immediately....no time delay.
>>>
>> Doing it in memory is what I think the goal should be.
>>
>> I don't think the ext4_defrag patchset accomplishes that, but maybe
>> I'm missing something.
>>
>> I think I've said it before, but I would think the best real world
>> implementation would be:
>>
>> ===
>> pre-allocate destination data blocks
>>
>> For each block
>>  prefetch source data block
>>  lock inode
>>  copy source data block to dest data block IN MEMORY ONLY and put in
>> block queue for delivery to disk
>>  release lock
>> end
>>
>> perform_inode_level_block_pointer_swap
>> ===
>>
>> thus the lock is only held long enough to perform a memory copy of one block.
>
> No, till the destination buffer hits disk, because HSM is not copying
> the block,  instead just pointing to the buffer from source inode. So
> we need to ensure that it doesn't get modified till the destination
> buffer is written. Of course this contention can be limited only till
> assignment of the pointers , if you can set some flag on the parent
> inode/page which tells that anyone who needs to modify needs to do a
> copy on write.
>

I totally agree to what you are saying that we will have to wait till
the data hit the disk.

>  Of course this contention can be limited only till
> assignment of the pointers ,

What do you mean by this ??

If we take lock for the whole re-org, then still we cannot ensure that
data hits the disk by the time we hold a lock. How can we avoid that ?
 If we implement cow, the write logic would be required to be changed.
But we can surely try that too.

Soon, I will be able to provide you the time estimate of these copy
operations under normal conditions. May be that will provide Greg some
information to take a call on what exactly would be better to go ahead
with.

> Thanks -
> Manish
>
>
>>
>>
>>>> Not to be snide, but if you truly feel a design that does use inode
>>>> locking to get the job done is unacceptable, then you should post your
>>>> objections on the ext4 list.
>>>
>>> sorry.....I am just a newbie....and I enjoy discussing all these with
>>> those at my level.....for the ext4 list? well....they already know
>>> that - and I quote from the same article above:
>>>
>>> http://lwn.net/Articles/275185/
>>>
>>> "The other approach is to do away with locking altogether; this has
>>> been the preferred way of improving scalability in recent years. That
>>> is, for example, what all of the work around read-copy-update has been
>>> doing. And this is the direction Nick has chosen to improve
>>> get_user_pages()."
>>>
>>> I will discuss in the list if i can understand 80% to 90% of this
>>> article, which is still far from true :-(.
>>>
>>> Thanks.....
>>>
>>> --
>>> Regards,
>>> Peter Teoh
>>
>> Good Luck and I'm glad your enjoying the discussion.
>>
>> Personally, I'm just very excited about the idea of a HSM in Linux
>> that will allow SSDs to be more highly leveraged in a tiered storage
>> environment.  As a linux user I think that is one of the most
>> interesting things I've seen discussed in a while.
>>
>> Greg
>> --
>> Greg Freemyer
>> Litigation Triage Solutions Specialist
>> http://www.linkedin.com/in/gregfreemyer
>> First 99 Days Litigation White Paper -
>> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
>>
>> The Norcross Group
>> The Intersection of Evidence & Technology
>> http://www.norcrossgroup.com
>>
>

-- 
Regards,
Sandeep.

"To learn is to change. Education is a process that changes the learner."

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ