Re: Copying Data Blocks

"Greg Freemyer" <greg.freemyer@xxxxxxxxx> · Fri, 16 Jan 2009 07:49:51 -0500

On Thu, Jan 15, 2009 at 6:24 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
> On Thu, Jan 15, 2009 at 12:55 PM, Sandeep K Sinha
> <sandeepksinha@xxxxxxxxx> wrote:
>> Hi Manish,
>>
>> On Thu, Jan 15, 2009 at 10:31 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>> On Thu, Jan 15, 2009 at 10:41 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>> On Thu, Jan 15, 2009 at 10:49 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>>>>>
>>>>>> I dont' think the above paragraph is an issue with re-org as currently
>>>>>> designed.  Neither for the ext4_defrag patchset that is under
>>>>>> consideration for acceptance, nor the work the OHSM team is doing.
>>>>>>
>>>>>
>>>>> well...it boils down to probability....the lower level the locks, the
>>>>> more complex it gets....and Nick Piggin echoed this, to quote from
>>>>> article:
>>>>>
>>>>> http://lwn.net/Articles/275185/:  (Toward better direct I/O scalability)
>>>>>
>>>>> "There are two common approaches to take when faced with this sort of
>>>>> scalability problem. One is to go with more fine-grained locking,
>>>>> where each lock covers a smaller part of the kernel. Splitting up
>>>>> locks has been happening since the initial creation of the Big Kernel
>>>>> Lock, which is the definitive example of coarse-grained locking. There
>>>>> are limits to how much fine-grained locking can help, though, and the
>>>>> addition of more locks comes at the cost of more complexity and more
>>>>> opportunities to create deadlocks. "
>>>>>
>>>>>>
>>>>>> Especially with rotational media, the call stack at the filesystem
>>>>>
>>>>> be aware of SSD....and they are coming down very fast in terms of
>>>>> cost.   right now....IBM is testing 4TB SSD.......discussed in a
>>>>> separate thread.   (not really sure about properties of SSD....but I
>>>>> think physical contiguity of data may not matter any more, as there
>>>>> are no moving heads to read the data?)
>>>>
>>>> I'm very aware of SSD.  I've been actively researching it for the last
>>>> week or so.  That is why I was careful to say rotation media is slow.
>>>>
>>>> Third generation SSD is spec'ing it random i/o speed and its
>>>> sequential i/o speed separately.
>>>>
>>>> The first couple generations tended to only spec. sequential because
>>>> random was so bad they did not want to advertise it.
>>>>
>>>>>> layer is just so much faster than the drive, that blocking access to
>>>>>> the write queue for a few milliseconds while some block level re-org
>>>>>
>>>>> how about doing it in-memory?  ie, reading the inode blocks (which can
>>>>> be scattered all over the place) into memory as a contiguous chunk.
>>>>> then allocate the inodes sequence...physically contiguously....and
>>>>> then write to it in sequence.   so there exists COPY + PHYSICAL-REORG
>>>>> at the same time.....partly through memory?   so while this is
>>>>> happening, and the source blocks got modified....then the memory for
>>>>> destination blocks will be updated immediately....no time delay.
>>>>>
>>>> Doing it in memory is what I think the goal should be.
>>>>
>>>> I don't think the ext4_defrag patchset accomplishes that, but maybe
>>>> I'm missing something.
>>>>
>>>> I think I've said it before, but I would think the best real world
>>>> implementation would be:
>>>>
>>>> ===
>>>> pre-allocate destination data blocks
>>>>
>>>> For each block
>>>>  prefetch source data block
>>>>  lock inode
>>>>  copy source data block to dest data block IN MEMORY ONLY and put in
>>>> block queue for delivery to disk
>>>>  release lock
>>>> end
>>>>
>>>> perform_inode_level_block_pointer_swap
>>>> ===
>>>>
>>>> thus the lock is only held long enough to perform a memory copy of one block.
>>>
>>> No, till the destination buffer hits disk, because HSM is not copying
>>> the block,  instead just pointing to the buffer from source inode. So
>>> we need to ensure that it doesn't get modified till the destination
>>> buffer is written. Of course this contention can be limited only till
>>> assignment of the pointers , if you can set some flag on the parent
>>> inode/page which tells that anyone who needs to modify needs to do a
>>> copy on write.
>>>
>>
>> I totally agree to what you are saying that we will have to wait till
>> the data hit the disk.
>
> I must be missing something.  I have not seen the actual code so that
> could be my main confusion.  It is still not available anywhere,
> right.  Could you at least post the current patch of the high-level
> loop that is equivalent to patch 1 of the ext4_defrag() patchset.  I
> realize it is not done, and may not een be functional yet, but it is
> getting hard to continue the conversation without seeing the actual
> code.
>
> IN THE LONG RUN:
>
> I can think of no fundamental reason why the code should be written
> such that it requires the lock be held until the data hits the disk.
> If this is being done to to allow the read buffer to be reused as the
> write buffer thus saving a memcpy, I think it is a bad design choice.
>
> Memory operations are very cheap compared to disk operations.
>
> For now doing the entire file while under the lock makes sense.  But
> even there I don't see why the lock has to be held until the very last
> block has hit disk.
>
> Greg

I re-read most of this thread.  I see that it was repeatedly said that
only a pointer was being set in the write buffer.

I think you have two goals:
1) Perform the block move with the least CPU cycles you can.

2) Maintain the semantic purity of the block queues.

Hopefully a way can be found to do both.  If not, honoring the design
of the queues needs to be the priority.

I really hope to see OHSM in the vanilla kernel at some point, and
forcing the lock to be in place until the data goes to disk will not
be accepted by the kernel maintainers in my opinion.  It violates the
conceptual design of the block layer.

FYI: I'm sure there were other questions posed in this thread you
hoped I would answer.  If they are still unanswered.  Just "ping" me
about those again.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ