Re: Copying Data Blocks

"Manish Katiyar" <mkatiyar@xxxxxxxxx> · Fri, 16 Jan 2009 00:13:09 +0530

On Fri, Jan 16, 2009 at 12:03 AM, Sandeep K Sinha
<sandeepksinha@xxxxxxxxx> wrote:
> Hi Manish,
>
> On Thu, Jan 15, 2009 at 11:54 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>> On Thu, Jan 15, 2009 at 11:25 PM, Sandeep K Sinha
>> <sandeepksinha@xxxxxxxxx> wrote:
>>> Hi Manish,
>>>
>>> On Thu, Jan 15, 2009 at 10:31 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>>> On Thu, Jan 15, 2009 at 10:41 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>> On Thu, Jan 15, 2009 at 10:49 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>>>>>
>>>>>>> I dont' think the above paragraph is an issue with re-org as currently
>>>>>>> designed.  Neither for the ext4_defrag patchset that is under
>>>>>>> consideration for acceptance, nor the work the OHSM team is doing.
>>>>>>>
>>>>>>
>>>>>> well...it boils down to probability....the lower level the locks, the
>>>>>> more complex it gets....and Nick Piggin echoed this, to quote from
>>>>>> article:
>>>>>>
>>>>>> http://lwn.net/Articles/275185/:  (Toward better direct I/O scalability)
>>>>>>
>>>>>> "There are two common approaches to take when faced with this sort of
>>>>>> scalability problem. One is to go with more fine-grained locking,
>>>>>> where each lock covers a smaller part of the kernel. Splitting up
>>>>>> locks has been happening since the initial creation of the Big Kernel
>>>>>> Lock, which is the definitive example of coarse-grained locking. There
>>>>>> are limits to how much fine-grained locking can help, though, and the
>>>>>> addition of more locks comes at the cost of more complexity and more
>>>>>> opportunities to create deadlocks. "
>>>>>>
>>>>>>>
>>>>>>> Especially with rotational media, the call stack at the filesystem
>>>>>>
>>>>>> be aware of SSD....and they are coming down very fast in terms of
>>>>>> cost.   right now....IBM is testing 4TB SSD.......discussed in a
>>>>>> separate thread.   (not really sure about properties of SSD....but I
>>>>>> think physical contiguity of data may not matter any more, as there
>>>>>> are no moving heads to read the data?)
>>>>>
>>>>> I'm very aware of SSD.  I've been actively researching it for the last
>>>>> week or so.  That is why I was careful to say rotation media is slow.
>>>>>
>>>>> Third generation SSD is spec'ing it random i/o speed and its
>>>>> sequential i/o speed separately.
>>>>>
>>>>> The first couple generations tended to only spec. sequential because
>>>>> random was so bad they did not want to advertise it.
>>>>>
>>>>>>> layer is just so much faster than the drive, that blocking access to
>>>>>>> the write queue for a few milliseconds while some block level re-org
>>>>>>
>>>>>> how about doing it in-memory?  ie, reading the inode blocks (which can
>>>>>> be scattered all over the place) into memory as a contiguous chunk.
>>>>>> then allocate the inodes sequence...physically contiguously....and
>>>>>> then write to it in sequence.   so there exists COPY + PHYSICAL-REORG
>>>>>> at the same time.....partly through memory?   so while this is
>>>>>> happening, and the source blocks got modified....then the memory for
>>>>>> destination blocks will be updated immediately....no time delay.
>>>>>>
>>>>> Doing it in memory is what I think the goal should be.
>>>>>
>>>>> I don't think the ext4_defrag patchset accomplishes that, but maybe
>>>>> I'm missing something.
>>>>>
>>>>> I think I've said it before, but I would think the best real world
>>>>> implementation would be:
>>>>>
>>>>> ===
>>>>> pre-allocate destination data blocks
>>>>>
>>>>> For each block
>>>>>  prefetch source data block
>>>>>  lock inode
>>>>>  copy source data block to dest data block IN MEMORY ONLY and put in
>>>>> block queue for delivery to disk
>>>>>  release lock
>>>>> end
>>>>>
>>>>> perform_inode_level_block_pointer_swap
>>>>> ===
>>>>>
>>>>> thus the lock is only held long enough to perform a memory copy of one block.
>>>>
>>>> No, till the destination buffer hits disk, because HSM is not copying
>>>> the block,  instead just pointing to the buffer from source inode. So
>>>> we need to ensure that it doesn't get modified till the destination
>>>> buffer is written. Of course this contention can be limited only till
>>>> assignment of the pointers , if you can set some flag on the parent
>>>> inode/page which tells that anyone who needs to modify needs to do a
>>>> copy on write.
>>>>
>>>
>>> I totally agree to what you are saying that we will have to wait till
>>> the data hit the disk.
>>>
>>>>  Of course this contention can be limited only till
>>>> assignment of the pointers ,
>>>
>>> What do you mean by this ??
>>
>> What I meant was that if you have some implementation of COW, you only
>> need to hold the lock till you read the source inode and assign the
>> pointers.
>>
> Do we really need it.
>
>> But after a bit more of thinking , I feel that COW is not going to
>> work . BTW I hope you also plan to invalidate the caches and
>> repopulate them right ? What about clients who do caching on their
>> side (I am not sure if this really makes sense because they would be
>> dealing with file offsets and not real file block numbers).
>>
> Thats right. They will work on file offsets and not block numbers.
>
>> Also I think i missed this point , how are new writes (hole fills
>> etc.) to a relocated file go to tier-2 disk ? or should they go or not
>> ? what kind of policy ensures this ?
>>
>
> for holes I have a reference mail:
> From: Chris Snook <csnook@...>
> To: Lars Noschinski <lklml@...>
> Cc: <linux-kernel@...>
> Subject: Re: How does ext2 implement sparse files?
> Date: Thursday, January 31, 2008 - 1:18 pm
>
> In ext2 (and most other block filesystems) all files are sparse files.
> If you write to an address in the file for which no block is allocated,
> the filesystem allocates a block and writes the contents to disk,

Correct, so my question is after relocation of a file to tier-2, what
policy/method/code guarantees that the new blocks allocated will be
from tier-2 disk, or should they be allocated from tier-2 at all ?

Basically what I am asking is the HSM's action under following
sequence of events assuming you have a hole of 4K at offset 4K, so
i_data[1] = 0

a) Application reads second block and finds a hole.
b) HSM reads , finds a hole and skips it
c) Application fills the block <<== On which disk does this go ?

> regardless of whether that block is at the end of the file (the usual
> case of lengthening a non-sparse file), in the middle of the file
> (filling in holes in a sparse file), or past the the end of the file
> (making a file sparse).
>
>       -- Chris
>> Also what are you doing with buffers which are already dirty ? Do you
>
> Even if they are dirty, we copy the pointers, and we will point to the
> same actual data, which will be dirty.
> And after swap we mark the new data buffer as dirty, neways.
>
>> wait for them to hit the disk and then reallocate, or move them
>> directly to the new disk ? (I think if you just check page_uptodate
>> flag and see if it is not locked we can directly write it to new
>> destination, but i am not very sure of all the concurrent access
>> issues here).
>>
> IMHO, this is not required. Remember that if you take a lock on inode
> itself, the user application can only queue jobs. They can't access
> data for the locking period.

Correct, but there might be pages which are already in flight to disk,
and you need to check for them. (pdflush ???)

Thanks -
Manish

>
>> More questions in queue :-)
>>
>
> Sure.... :)
>> Thanks -
>> Manish
>>
>>>
>>> If we take lock for the whole re-org, then still we cannot ensure that
>>> data hits the disk by the time we hold a lock. How can we avoid that ?
>>>  If we implement cow, the write logic would be required to be changed.
>>> But we can surely try that too.
>>>
>>> Soon, I will be able to provide you the time estimate of these copy
>>> operations under normal conditions. May be that will provide Greg some
>>> information to take a call on what exactly would be better to go ahead
>>> with.
>>>
>>>> Thanks -
>>>> Manish
>>>>
>>>>
>>>>>
>>>>>
>>>>>>> Not to be snide, but if you truly feel a design that does use inode
>>>>>>> locking to get the job done is unacceptable, then you should post your
>>>>>>> objections on the ext4 list.
>>>>>>
>>>>>> sorry.....I am just a newbie....and I enjoy discussing all these with
>>>>>> those at my level.....for the ext4 list? well....they already know
>>>>>> that - and I quote from the same article above:
>>>>>>
>>>>>> http://lwn.net/Articles/275185/
>>>>>>
>>>>>> "The other approach is to do away with locking altogether; this has
>>>>>> been the preferred way of improving scalability in recent years. That
>>>>>> is, for example, what all of the work around read-copy-update has been
>>>>>> doing. And this is the direction Nick has chosen to improve
>>>>>> get_user_pages()."
>>>>>>
>>>>>> I will discuss in the list if i can understand 80% to 90% of this
>>>>>> article, which is still far from true :-(.
>>>>>>
>>>>>> Thanks.....
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Peter Teoh
>>>>>
>>>>> Good Luck and I'm glad your enjoying the discussion.
>>>>>
>>>>> Personally, I'm just very excited about the idea of a HSM in Linux
>>>>> that will allow SSDs to be more highly leveraged in a tiered storage
>>>>> environment.  As a linux user I think that is one of the most
>>>>> interesting things I've seen discussed in a while.
>>>>>
>>>>> Greg
>>>>> --
>>>>> Greg Freemyer
>>>>> Litigation Triage Solutions Specialist
>>>>> http://www.linkedin.com/in/gregfreemyer
>>>>> First 99 Days Litigation White Paper -
>>>>> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
>>>>>
>>>>> The Norcross Group
>>>>> The Intersection of Evidence & Technology
>>>>> http://www.norcrossgroup.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sandeep.
>>>
>>>
>>>
>>>
>>>
>>>
>>> "To learn is to change. Education is a process that changes the learner."
>>>
>>
>
>
>
> --
> Regards,
> Sandeep.
>
>
>
>
>
>
> "To learn is to change. Education is a process that changes the learner."
>

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ