Re: Copying Data Blocks

"Sandeep K Sinha" <sandeepksinha@xxxxxxxxx> · Fri, 16 Jan 2009 00:42:26 +0530

On Fri, Jan 16, 2009 at 12:13 AM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
> On Fri, Jan 16, 2009 at 12:03 AM, Sandeep K Sinha
> <sandeepksinha@xxxxxxxxx> wrote:
>> Hi Manish,
>>
>> On Thu, Jan 15, 2009 at 11:54 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>> On Thu, Jan 15, 2009 at 11:25 PM, Sandeep K Sinha
>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>> Hi Manish,
>>>>
>>>> On Thu, Jan 15, 2009 at 10:31 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>>>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>>>> On Thu, Jan 15, 2009 at 10:41 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>> On Thu, Jan 15, 2009 at 10:49 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> I dont' think the above paragraph is an issue with re-org as currently
>>>>>>>> designed.  Neither for the ext4_defrag patchset that is under
>>>>>>>> consideration for acceptance, nor the work the OHSM team is doing.
>>>>>>>>
>>>>>>>
>>>>>>> well...it boils down to probability....the lower level the locks, the
>>>>>>> more complex it gets....and Nick Piggin echoed this, to quote from
>>>>>>> article:
>>>>>>>
>>>>>>> http://lwn.net/Articles/275185/:  (Toward better direct I/O scalability)
>>>>>>>
>>>>>>> "There are two common approaches to take when faced with this sort of
>>>>>>> scalability problem. One is to go with more fine-grained locking,
>>>>>>> where each lock covers a smaller part of the kernel. Splitting up
>>>>>>> locks has been happening since the initial creation of the Big Kernel
>>>>>>> Lock, which is the definitive example of coarse-grained locking. There
>>>>>>> are limits to how much fine-grained locking can help, though, and the
>>>>>>> addition of more locks comes at the cost of more complexity and more
>>>>>>> opportunities to create deadlocks. "
>>>>>>>
>>>>>>>>
>>>>>>>> Especially with rotational media, the call stack at the filesystem
>>>>>>>
>>>>>>> be aware of SSD....and they are coming down very fast in terms of
>>>>>>> cost.   right now....IBM is testing 4TB SSD.......discussed in a
>>>>>>> separate thread.   (not really sure about properties of SSD....but I
>>>>>>> think physical contiguity of data may not matter any more, as there
>>>>>>> are no moving heads to read the data?)
>>>>>>
>>>>>> I'm very aware of SSD.  I've been actively researching it for the last
>>>>>> week or so.  That is why I was careful to say rotation media is slow.
>>>>>>
>>>>>> Third generation SSD is spec'ing it random i/o speed and its
>>>>>> sequential i/o speed separately.
>>>>>>
>>>>>> The first couple generations tended to only spec. sequential because
>>>>>> random was so bad they did not want to advertise it.
>>>>>>
>>>>>>>> layer is just so much faster than the drive, that blocking access to
>>>>>>>> the write queue for a few milliseconds while some block level re-org
>>>>>>>
>>>>>>> how about doing it in-memory?  ie, reading the inode blocks (which can
>>>>>>> be scattered all over the place) into memory as a contiguous chunk.
>>>>>>> then allocate the inodes sequence...physically contiguously....and
>>>>>>> then write to it in sequence.   so there exists COPY + PHYSICAL-REORG
>>>>>>> at the same time.....partly through memory?   so while this is
>>>>>>> happening, and the source blocks got modified....then the memory for
>>>>>>> destination blocks will be updated immediately....no time delay.
>>>>>>>
>>>>>> Doing it in memory is what I think the goal should be.
>>>>>>
>>>>>> I don't think the ext4_defrag patchset accomplishes that, but maybe
>>>>>> I'm missing something.
>>>>>>
>>>>>> I think I've said it before, but I would think the best real world
>>>>>> implementation would be:
>>>>>>
>>>>>> ===
>>>>>> pre-allocate destination data blocks
>>>>>>
>>>>>> For each block
>>>>>>  prefetch source data block
>>>>>>  lock inode
>>>>>>  copy source data block to dest data block IN MEMORY ONLY and put in
>>>>>> block queue for delivery to disk
>>>>>>  release lock
>>>>>> end
>>>>>>
>>>>>> perform_inode_level_block_pointer_swap
>>>>>> ===
>>>>>>
>>>>>> thus the lock is only held long enough to perform a memory copy of one block.
>>>>>
>>>>> No, till the destination buffer hits disk, because HSM is not copying
>>>>> the block,  instead just pointing to the buffer from source inode. So
>>>>> we need to ensure that it doesn't get modified till the destination
>>>>> buffer is written. Of course this contention can be limited only till
>>>>> assignment of the pointers , if you can set some flag on the parent
>>>>> inode/page which tells that anyone who needs to modify needs to do a
>>>>> copy on write.
>>>>>
>>>>
>>>> I totally agree to what you are saying that we will have to wait till
>>>> the data hit the disk.
>>>>
>>>>>  Of course this contention can be limited only till
>>>>> assignment of the pointers ,
>>>>
>>>> What do you mean by this ??
>>>
>>> What I meant was that if you have some implementation of COW, you only
>>> need to hold the lock till you read the source inode and assign the
>>> pointers.
>>>
>> Do we really need it.
>>
>>> But after a bit more of thinking , I feel that COW is not going to
>>> work . BTW I hope you also plan to invalidate the caches and
>>> repopulate them right ? What about clients who do caching on their
>>> side (I am not sure if this really makes sense because they would be
>>> dealing with file offsets and not real file block numbers).
>>>
>> Thats right. They will work on file offsets and not block numbers.
>>
>>> Also I think i missed this point , how are new writes (hole fills
>>> etc.) to a relocated file go to tier-2 disk ? or should they go or not
>>> ? what kind of policy ensures this ?
>>>
>>
>> for holes I have a reference mail:
>> From: Chris Snook <csnook@...>
>> To: Lars Noschinski <lklml@...>
>> Cc: <linux-kernel@...>
>> Subject: Re: How does ext2 implement sparse files?
>> Date: Thursday, January 31, 2008 - 1:18 pm
>>
>> In ext2 (and most other block filesystems) all files are sparse files.
>> If you write to an address in the file for which no block is allocated,
>> the filesystem allocates a block and writes the contents to disk,
>
> Correct, so my question is after relocation of a file to tier-2, what
> policy/method/code guarantees that the new blocks allocated will be
> from tier-2 disk, or should they be allocated from tier-2 at all ?
>
> Basically what I am asking is the HSM's action under following
> sequence of events assuming you have a hole of 4K at offset 4K, so
> i_data[1] = 0
>
> a) Application reads second block and finds a hole.
> b) HSM reads , finds a hole and skips it
> c) Application fills the block <<== On which disk does this go ?
>

Good question. I was expecting this long back.

Well, we have home_tier_id and destination_tier_id stored in inode.
The home_tier_id is set at the time of allocation/creation of the
file, depending upon the allocation policy specified.
Suppose the file, doesn't match any allocation policy its home_tier_id
becomes 0.
And it can be allocated across the file system, any tier.

Now, once we relocate the file, we change the home_tier_id and any
further allocation of blocks for that file, will be in the same
home_tier_id only.

Also, for relocation we need the home_tier_id as the user can have a
relocation like this : Move *.mp3 from tier2 to tier3

During the FS-scan
We read all the inodes, first match the policy-specified-from-tier
with the inode's home_tier_id. Then we check the condition, if its a
*.mp3 file, and if it matches then we set the destination_tier_id to
the policy-specified-to-tier and send it for reallocation.
The rellocation code allocates block only from the destination_tier_id.

Also, in the normal case, we allocate data blocks only from the
home_tier_id of the inode.

For general files, the value of home_tier_id  value is initalized to
0, which means to the allocator that you can allocate from across the
file system.

We will soon be publishing
1. Allocation FAQ's
2.Relocation Implementaion details.

Which will answer most of these queries.

>> regardless of whether that block is at the end of the file (the usual
>> case of lengthening a non-sparse file), in the middle of the file
>> (filling in holes in a sparse file), or past the the end of the file
>> (making a file sparse).
>>
>>       -- Chris
>>> Also what are you doing with buffers which are already dirty ? Do you
>>
>> Even if they are dirty, we copy the pointers, and we will point to the
>> same actual data, which will be dirty.
>> And after swap we mark the new data buffer as dirty, neways.
>>
>>> wait for them to hit the disk and then reallocate, or move them
>>> directly to the new disk ? (I think if you just check page_uptodate
>>> flag and see if it is not locked we can directly write it to new
>>> destination, but i am not very sure of all the concurrent access
>>> issues here).
>>>
>> IMHO, this is not required. Remember that if you take a lock on inode
>> itself, the user application can only queue jobs. They can't access
>> data for the locking period.
>
> Correct, but there might be pages which are already in flight to disk,
> and you need to check for them. (pdflush ???)
>
> Thanks -
> Manish
>
>>
>>> More questions in queue :-)
>>>
>>
>> Sure.... :)
>>> Thanks -
>>> Manish
>>>
>>>>
>>>> If we take lock for the whole re-org, then still we cannot ensure that
>>>> data hits the disk by the time we hold a lock. How can we avoid that ?
>>>>  If we implement cow, the write logic would be required to be changed.
>>>> But we can surely try that too.
>>>>
>>>> Soon, I will be able to provide you the time estimate of these copy
>>>> operations under normal conditions. May be that will provide Greg some
>>>> information to take a call on what exactly would be better to go ahead
>>>> with.
>>>>
>>>>> Thanks -
>>>>> Manish
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>> Not to be snide, but if you truly feel a design that does use inode
>>>>>>>> locking to get the job done is unacceptable, then you should post your
>>>>>>>> objections on the ext4 list.
>>>>>>>
>>>>>>> sorry.....I am just a newbie....and I enjoy discussing all these with
>>>>>>> those at my level.....for the ext4 list? well....they already know
>>>>>>> that - and I quote from the same article above:
>>>>>>>
>>>>>>> http://lwn.net/Articles/275185/
>>>>>>>
>>>>>>> "The other approach is to do away with locking altogether; this has
>>>>>>> been the preferred way of improving scalability in recent years. That
>>>>>>> is, for example, what all of the work around read-copy-update has been
>>>>>>> doing. And this is the direction Nick has chosen to improve
>>>>>>> get_user_pages()."
>>>>>>>
>>>>>>> I will discuss in the list if i can understand 80% to 90% of this
>>>>>>> article, which is still far from true :-(.
>>>>>>>
>>>>>>> Thanks.....
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Peter Teoh
>>>>>>
>>>>>> Good Luck and I'm glad your enjoying the discussion.
>>>>>>
>>>>>> Personally, I'm just very excited about the idea of a HSM in Linux
>>>>>> that will allow SSDs to be more highly leveraged in a tiered storage
>>>>>> environment.  As a linux user I think that is one of the most
>>>>>> interesting things I've seen discussed in a while.
>>>>>>
>>>>>> Greg
>>>>>> --
>>>>>> Greg Freemyer
>>>>>> Litigation Triage Solutions Specialist
>>>>>> http://www.linkedin.com/in/gregfreemyer
>>>>>> First 99 Days Litigation White Paper -
>>>>>> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
>>>>>>
>>>>>> The Norcross Group
>>>>>> The Intersection of Evidence & Technology
>>>>>> http://www.norcrossgroup.com
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Sandeep.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> "To learn is to change. Education is a process that changes the learner."
>>>>
>>>
>>
>>
>>
>> --
>> Regards,
>> Sandeep.
>>
>>
>>
>>
>>
>>
>> "To learn is to change. Education is a process that changes the learner."
>>
>

-- 
Regards,
Sandeep.

"To learn is to change. Education is a process that changes the learner."

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ