Hi Greg, Thanks for such great insights. On Fri, Jan 16, 2009 at 11:41 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: > On Fri, Jan 16, 2009 at 8:26 AM, Sandeep K Sinha > <sandeepksinha@xxxxxxxxx> wrote: >> Hi Greg, >> >> On Fri, Jan 16, 2009 at 5:50 AM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: >>> On Thu, Jan 15, 2009 at 12:47 PM, Sandeep K Sinha >>> <sandeepksinha@xxxxxxxxx> wrote: >>>> Hey, >>>> >>>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer >>> <snip> >>>>> I think I've said it before, but I would think the best real world >>>>> implementation would be: >>>>> >>>>> === >>>>> pre-allocate destination data blocks >>>>> >>>>> For each block >>>>> prefetch source data block >>>>> lock inode >>>>> copy source data block to dest data block IN MEMORY ONLY and put in >>>>> block queue for delivery to disk >>>>> release lock >>>>> end >>>>> >>>>> perform_inode_level_block_pointer_swap >>>>> === >>>>> >>>> >>>> I would be more than very happy if I am able to accomplish this. Greg, >>>> the only problem that I see here is somebody who has already opened >>>> the file is making the size of the file to increase, once I >>>> preallocate destination data blocks. >>>> And I don;t see a way to avoid that. But surely looking forward to. >>>> >>>> I have seen many similar implementations and most of them suffer from >>>> this issue. But surely there can be a way to optimize it, if not avoid >>>> it. >>> >>> The way ext4_defrag works I believe is to put a lock around the >>> inode's block list every 64MB and I assume that under that lock it has >>> a static list of inode block pointers to work with. >>> >>> At the conclusion of the 64MB chunk, it releases the lock and allows >>> writes to occur. That includes writes that extend the file. >>> >> >> For us this granularity initially the size of the file. Meaning >> whatever number of data blocks it has. >> We can also break about relocation of blocks of file in 64MB chunks, >> but then my question would be why not 100MB and why not 20MB ? >> >> Its just a granularity that has been taken by ext4_defrag and I don't >> think there would be any performance philosophy behind that. I would >> say it will have extra cost of taking/giving locks every 64MB. And >> what if someone else takes a lock and doesn't give up soon. Your >> relocation process would be delayed for that reason. I know, above all >> the lock period should be shorter for all reasons. >> > Ultimately, I think the granularity should be user configurable. As > should the "priority" from scheduling perspective. > > Personally, I would like to see the unit of work be a time slice and > then have the ioctl return to user space. That is conceptually > similar to what ext4_defrag() does, but as you say the 64MB value > seems arbitrary. > > By returning to user space between each chunk, the normal task > scheduler gets to get in the loop. > > If a user then wants to ensure the re-org is done ASAP, he can use > nice etc. to raise the user space tools priority. > > If the user wants another app using the files under re-org to have > priority, then he can lower the user space tools priority. > We can do that. > No fancy in kernel stuff has to be done. > > If there is no contention for the inode lock, then that files re-org > goes as fast as the normal task scheduler will schedule the user space > tool. > >>> Then it locks the inode again and once again gets a full fresh list of >>> the inode block pointers. If the file has grown between release and >>> the next lock, then the new inode block pointer list will reflect >>> those new blocks as well. >>> >> What if you don't get a lock again ? > > Then the file does not get migrated any further. The key is that > after each chunk you put the original inode back into a fully > operational state and delete the ghost inode. > > Then on the next chunk recreate everything and do another chunks worth of work. > >> How are the linux kernel maintainers accepting a lock for a 64MB >> block copy ? > > I have not read all the ext4 messages about ext4_defrag, but it > appears that locking the inode for each 64 MB chunk is what was > proposed in Sept. and I did not see any one argueing about it. My > theory is that 64MB is less than or equal to what the kernel can do in > a single timeslice, so locking a inode for a single timeslice is very > acceptable. > >> If thats allowed by would they have issues with us >> locking it for a granularity of some X. > > If X is the same as whatever ext4_defrag uses, then you have a strong > argument that other parts of the kernel are already using it. > > If X is 10x what ext4_defrag uses, you have a much bigger argument to make. > >> But, first I will see the performance metrics of dividing the copy >> operation in some chunks. >> > Agreed. > > Somewhere I think I read you were doing 1 GB in less than a second or > something like that. > > Am I remembering right? > Yes that is true. The approx figure for 512M file was 230 milliseconds. The code is currently in testing phase so we will let you know the exact figures. > I don't see how that could be true if you are meaning the full > transfer from one disk to the other. For simple disks, the fastest I > have seen is about 5GB/min, or 12 seconds per GB. > >>> I think you said ext4_defrag() is using 2 different locks. Maybe one >>> is just to stop updates to the inode data block pointers, and the >>> other is finer grained and deals with individual blocks being locked? >>> >> >> Thats very true, that they talk two locks. But if the inode is locked >> how can the size of the file increase. Is that possible ? > > Maybe it only changes between 64MB chunks. If so, I like that > behavior very much. > >> As I mentioned you telling that they check the size after every 64MB copy ? > > That makes sense to me. Lock out any writes that require new data > blocks to be allocated for the entire chunk. Then put the inode / > file back into a consistent state and release the lock. > > Let the scheduler run another task and if the task causes new data > blocks to allocated, thats fine. > > Then lock the inode and handle the next chunk. > >> >>> That would make me happier and seems like a more reasonable >>> implementation than locking the file for all writes for the full 64MB >>> move. >>> >> >> No, they are locking the inode with both the locks in ext4_defrag. As >> any read/write would go through the inode. This will protect any >> updates to the inodes and to all the existing data blocks. >> > > Too bad, but again as long as the chunk is small enough to be handled > in a single time slice, I think you are golden. > >>> This brings up a question. Are you always "moving" a data block, or >>> do you have a test in the loop to verify it is not already on the >>> correct teir of storage? >> >> See, I will tell you a bit in detail. we have two fields in the inode, >> home_tier_id and destination_tier_id. >> home_tier_id is set if a file qualifies a file allocation policy. If >> it doesnt qualify any of the policies, its data can be allocated >> anywhere in the FS, we actually default to the original block >> allocation method of the FS. > >> If a file qualifies, we set its home_tier_id to the respective tier as >> mentioned in the policy. And restrict the block allocation to that >> particular tier. >> >> Now, at the time of relocation, >> if the policy was (in XML policy file ) SELECT *.mp3 from TIER 1, >> RELOCATE to TIER 4, When file Access temp(FAT) > 200 >> >> We do a FS scan and read each inode one by one, >> now check if it's home tier id != 0, as that means that it has been >> allocated by OHSM, else we leave that inode. >> Now we check for the type of the file, if its mp3 we set the >> destination_tier_id = the dest_tier_in policy. >> And pass it for relocation. And the relocation function fetched the >> destination tier_id from inode and allocated new block from that tier. >> And then set the home_tier_id to dest_tier_id. >> Does that answer you question sir ? >> > > Not quite. Assume that the mp3 files do not have a policy set, so > they are randomly spread across 2 tiers. > > Then you assign a policy to all mp3s to move them to tier 2, thus > freeing up tier 1. > > You will have mp3 files in three states originally: > > 1) Fully on tier 1 > 2) Fully on tier 2 > 3) Some data blocks on tier1 and some on tier2 > > My question is if you try to recognize the data blocks that are > already on tier2 and not move them. Or do you move them all > regardless of where they happen to be currently sitting when the > policy is set. > > I don't think it is necessarily bad to always move the data blocks > when a new policy is set. I'm just curious. For your curiosity :) OHSM sets the home tid of such files as -1. Current implementation will move files regardless of where they happen to be, but we will definitely come up with a better solution. Soon we will be uploading the codes, then you can review it better. Thanks. > >> -- >> Regards, >> Sandeep. > > Greg > -- > Greg Freemyer > Litigation Triage Solutions Specialist > http://www.linkedin.com/in/gregfreemyer > First 99 Days Litigation White Paper - > http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf > > The Norcross Group > The Intersection of Evidence & Technology > http://www.norcrossgroup.com > -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ