Re: Copying Data Blocks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Greg,

Thanks for such great insights.

On Fri, Jan 16, 2009 at 11:41 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
> On Fri, Jan 16, 2009 at 8:26 AM, Sandeep K Sinha
> <sandeepksinha@xxxxxxxxx> wrote:
>> Hi Greg,
>>
>> On Fri, Jan 16, 2009 at 5:50 AM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
>>> On Thu, Jan 15, 2009 at 12:47 PM, Sandeep K Sinha
>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>> Hey,
>>>>
>>>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer
>>> <snip>
>>>>> I think I've said it before, but I would think the best real world
>>>>> implementation would be:
>>>>>
>>>>> ===
>>>>> pre-allocate destination data blocks
>>>>>
>>>>> For each block
>>>>>  prefetch source data block
>>>>>  lock inode
>>>>>  copy source data block to dest data block IN MEMORY ONLY and put in
>>>>> block queue for delivery to disk
>>>>>  release lock
>>>>> end
>>>>>
>>>>> perform_inode_level_block_pointer_swap
>>>>> ===
>>>>>
>>>>
>>>> I would be more than very happy if I am able to accomplish this. Greg,
>>>> the only problem that I see here is somebody who has already opened
>>>> the file is making the size of the file to increase, once I
>>>> preallocate destination data blocks.
>>>> And I don;t see a way to avoid that. But surely looking forward to.
>>>>
>>>> I have seen many similar implementations and most of them suffer from
>>>> this issue. But surely there can be a way to optimize it, if not avoid
>>>> it.
>>>
>>> The way ext4_defrag works I believe is to put a lock around the
>>> inode's block list every 64MB and I assume that under that lock it has
>>> a static list of inode block pointers to work with.
>>>
>>> At the conclusion of the 64MB chunk, it releases the lock and allows
>>> writes to occur.  That includes writes that extend the file.
>>>
>>
>> For us this granularity initially the size of the file. Meaning
>> whatever number of data blocks it has.
>> We can also break about relocation of blocks of file in 64MB chunks,
>> but then my question would be why not 100MB and why not 20MB ?
>>
>> Its just a granularity that has been taken by ext4_defrag and I don't
>> think there would be any performance philosophy behind that. I would
>> say it will have extra cost of taking/giving locks every 64MB. And
>> what if someone else takes a lock and doesn't give up soon. Your
>> relocation process would be delayed for that reason. I know, above all
>> the lock period should be shorter for all reasons.
>>
> Ultimately, I think the granularity should be user configurable.  As
> should the "priority" from scheduling perspective.
>
> Personally, I would like to see the unit of work be a time slice and
> then have the ioctl return to user space.  That is conceptually
> similar to what ext4_defrag() does, but as you say the 64MB value
> seems arbitrary.
>
> By returning to user space between each chunk, the normal task
> scheduler gets to get in the loop.
>
> If a user then wants to ensure the re-org is done ASAP, he can use
> nice etc. to raise the user space tools priority.
>
> If the user wants another app using the files under re-org to have
> priority, then he can lower the user space tools priority.
>
We can do that.

> No fancy in kernel stuff has to be done.
>
> If there is no contention for the inode lock, then that files re-org
> goes as fast as the normal task scheduler will schedule the user space
> tool.
>
>>> Then it locks the inode again and once again gets a full fresh list of
>>> the inode block pointers.  If the file has grown between release and
>>> the next lock, then the new inode block pointer list will reflect
>>> those new blocks as well.
>>>
>> What if you don't get a lock again ?
>
> Then the file does not get migrated any further.  The key is that
> after each chunk you put the original inode back into a fully
> operational state and delete the ghost inode.
>
> Then on the next chunk recreate everything and do another chunks worth of work.
>
>> How are the linux kernel  maintainers accepting a lock for a 64MB
>> block copy ?
>
> I have not read all the ext4 messages about ext4_defrag, but it
> appears that locking the inode for each 64 MB chunk is what was
> proposed in Sept. and I did not see any one argueing about it.  My
> theory is that 64MB is less than or equal to what the kernel can do in
> a single timeslice, so locking a inode for a single timeslice is very
> acceptable.
>
>> If thats allowed by would they have issues with us
>> locking it for a granularity of some X.
>
> If X is the same as whatever ext4_defrag uses, then you have a strong
> argument that other parts of the kernel are already using it.
>
> If X is 10x what ext4_defrag uses, you have a much bigger argument to make.
>
>> But, first I will see the performance metrics of dividing the copy
>> operation in some chunks.
>>
> Agreed.
>
> Somewhere I think I read you were doing 1 GB in less than a second or
> something like that.
>
> Am I remembering right?
>
Yes that is true. The approx figure for 512M file was 230 milliseconds.
The code is currently in testing phase so we will let you know
the exact figures.

> I don't see how that could be true if you are meaning the full
> transfer from one disk to the other.  For simple disks, the fastest I
> have seen is about 5GB/min, or 12 seconds per GB.
>
>>> I think you said ext4_defrag() is using 2 different locks.  Maybe one
>>> is just to stop updates to the inode data block pointers, and the
>>> other is finer grained and deals with individual blocks being locked?
>>>
>>
>> Thats very true, that they talk two locks. But if the inode is locked
>> how can the size of the file increase. Is that possible ?
>
> Maybe it only changes between 64MB chunks.  If so, I like that
> behavior very much.
>
>> As I mentioned you telling that they check the size after every 64MB copy ?
>
> That makes sense to me.  Lock out any writes that require new data
> blocks to be allocated for the entire chunk.  Then put the inode /
> file back into a consistent state and release the lock.
>
> Let the scheduler run another task and if the task causes new data
> blocks to allocated, thats fine.
>
> Then lock the inode and handle the next chunk.
>
>>
>>> That would make me happier and seems like a more reasonable
>>> implementation than locking the file for all writes for the full 64MB
>>> move.
>>>
>>
>> No, they are locking the inode with both the locks in ext4_defrag. As
>> any read/write would go through the inode. This will protect any
>> updates to the inodes and to all the existing data blocks.
>>
>
> Too bad, but again as long as the chunk is small enough to be handled
> in a single time slice, I think you are golden.
>
>>> This brings up a question.  Are you always "moving" a data block, or
>>> do you have a test in the loop to verify it is not already on the
>>> correct teir of storage?
>>
>> See, I will tell you a bit in detail. we have two fields in the inode,
>> home_tier_id and destination_tier_id.
>> home_tier_id is set if a file qualifies a file allocation policy. If
>> it doesnt qualify any of the policies, its data can be allocated
>> anywhere in the FS, we actually default to the original block
>> allocation method of the FS.
>
>> If a file qualifies, we set its home_tier_id to the respective tier as
>> mentioned in the policy. And restrict the block allocation to that
>> particular tier.
>>
>> Now, at the time of relocation,
>> if the policy was (in XML policy file )  SELECT *.mp3 from TIER 1,
>> RELOCATE  to TIER 4, When file Access temp(FAT) > 200
>>
>> We do a FS scan and read each inode one by one,
>> now check if it's home tier id != 0, as that means that it has been
>> allocated by OHSM, else we leave that inode.
>> Now we check for the type of the file, if its mp3 we set the
>> destination_tier_id = the dest_tier_in policy.
>> And pass it for relocation. And the relocation function fetched the
>> destination tier_id from inode and allocated new block from that tier.
>> And then set the home_tier_id to dest_tier_id.
>> Does that answer you question sir ?
>>
>
> Not quite.  Assume that the mp3 files do not have a policy set, so
> they are randomly spread across 2 tiers.
>
> Then you assign a policy to all mp3s to move them to tier 2, thus
> freeing up tier 1.
>
> You will have mp3 files in three states originally:
>
> 1) Fully on tier 1
> 2) Fully on tier 2
> 3) Some data blocks on tier1 and some on tier2
>
> My question is if you try to recognize the data blocks that are
> already on tier2 and not move them.  Or do you move them all
> regardless of where they happen to be currently sitting when the
> policy is set.
>
> I don't think it is necessarily bad to always move the data blocks
> when a new policy is set.  I'm just curious.

For your curiosity :)
OHSM sets the home tid of such files as -1.
Current implementation will move files regardless of where they happen to be,
but we will definitely come up with a better solution.

Soon we will be uploading the codes, then you can review it better.

Thanks.
>
>> --
>> Regards,
>> Sandeep.
>
> Greg
> --
> Greg Freemyer
> Litigation Triage Solutions Specialist
> http://www.linkedin.com/in/gregfreemyer
> First 99 Days Litigation White Paper -
> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
>
> The Norcross Group
> The Intersection of Evidence & Technology
> http://www.norcrossgroup.com
>

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ


[Index of Archives]     [Newbies FAQ]     [Linux Kernel Mentors]     [Linux Kernel Development]     [IETF Annouce]     [Git]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux SCSI]     [Linux ACPI]
  Powered by Linux