On Fri, Jan 16, 2009 at 8:26 AM, Sandeep K Sinha <sandeepksinha@xxxxxxxxx> wrote: > Hi Greg, > > On Fri, Jan 16, 2009 at 5:50 AM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: >> On Thu, Jan 15, 2009 at 12:47 PM, Sandeep K Sinha >> <sandeepksinha@xxxxxxxxx> wrote: >>> Hey, >>> >>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer >> <snip> >>>> I think I've said it before, but I would think the best real world >>>> implementation would be: >>>> >>>> === >>>> pre-allocate destination data blocks >>>> >>>> For each block >>>> prefetch source data block >>>> lock inode >>>> copy source data block to dest data block IN MEMORY ONLY and put in >>>> block queue for delivery to disk >>>> release lock >>>> end >>>> >>>> perform_inode_level_block_pointer_swap >>>> === >>>> >>> >>> I would be more than very happy if I am able to accomplish this. Greg, >>> the only problem that I see here is somebody who has already opened >>> the file is making the size of the file to increase, once I >>> preallocate destination data blocks. >>> And I don;t see a way to avoid that. But surely looking forward to. >>> >>> I have seen many similar implementations and most of them suffer from >>> this issue. But surely there can be a way to optimize it, if not avoid >>> it. >> >> The way ext4_defrag works I believe is to put a lock around the >> inode's block list every 64MB and I assume that under that lock it has >> a static list of inode block pointers to work with. >> >> At the conclusion of the 64MB chunk, it releases the lock and allows >> writes to occur. That includes writes that extend the file. >> > > For us this granularity initially the size of the file. Meaning > whatever number of data blocks it has. > We can also break about relocation of blocks of file in 64MB chunks, > but then my question would be why not 100MB and why not 20MB ? > > Its just a granularity that has been taken by ext4_defrag and I don't > think there would be any performance philosophy behind that. I would > say it will have extra cost of taking/giving locks every 64MB. And > what if someone else takes a lock and doesn't give up soon. Your > relocation process would be delayed for that reason. I know, above all > the lock period should be shorter for all reasons. > Ultimately, I think the granularity should be user configurable. As should the "priority" from scheduling perspective. Personally, I would like to see the unit of work be a time slice and then have the ioctl return to user space. That is conceptually similar to what ext4_defrag() does, but as you say the 64MB value seems arbitrary. By returning to user space between each chunk, the normal task scheduler gets to get in the loop. If a user then wants to ensure the re-org is done ASAP, he can use nice etc. to raise the user space tools priority. If the user wants another app using the files under re-org to have priority, then he can lower the user space tools priority. No fancy in kernel stuff has to be done. If there is no contention for the inode lock, then that files re-org goes as fast as the normal task scheduler will schedule the user space tool. >> Then it locks the inode again and once again gets a full fresh list of >> the inode block pointers. If the file has grown between release and >> the next lock, then the new inode block pointer list will reflect >> those new blocks as well. >> > What if you don't get a lock again ? Then the file does not get migrated any further. The key is that after each chunk you put the original inode back into a fully operational state and delete the ghost inode. Then on the next chunk recreate everything and do another chunks worth of work. > How are the linux kernel maintainers accepting a lock for a 64MB > block copy ? I have not read all the ext4 messages about ext4_defrag, but it appears that locking the inode for each 64 MB chunk is what was proposed in Sept. and I did not see any one argueing about it. My theory is that 64MB is less than or equal to what the kernel can do in a single timeslice, so locking a inode for a single timeslice is very acceptable. > If thats allowed by would they have issues with us > locking it for a granularity of some X. If X is the same as whatever ext4_defrag uses, then you have a strong argument that other parts of the kernel are already using it. If X is 10x what ext4_defrag uses, you have a much bigger argument to make. > But, first I will see the performance metrics of dividing the copy > operation in some chunks. > Agreed. Somewhere I think I read you were doing 1 GB in less than a second or something like that. Am I remembering right? I don't see how that could be true if you are meaning the full transfer from one disk to the other. For simple disks, the fastest I have seen is about 5GB/min, or 12 seconds per GB. >> I think you said ext4_defrag() is using 2 different locks. Maybe one >> is just to stop updates to the inode data block pointers, and the >> other is finer grained and deals with individual blocks being locked? >> > > Thats very true, that they talk two locks. But if the inode is locked > how can the size of the file increase. Is that possible ? Maybe it only changes between 64MB chunks. If so, I like that behavior very much. > As I mentioned you telling that they check the size after every 64MB copy ? That makes sense to me. Lock out any writes that require new data blocks to be allocated for the entire chunk. Then put the inode / file back into a consistent state and release the lock. Let the scheduler run another task and if the task causes new data blocks to allocated, thats fine. Then lock the inode and handle the next chunk. > >> That would make me happier and seems like a more reasonable >> implementation than locking the file for all writes for the full 64MB >> move. >> > > No, they are locking the inode with both the locks in ext4_defrag. As > any read/write would go through the inode. This will protect any > updates to the inodes and to all the existing data blocks. > Too bad, but again as long as the chunk is small enough to be handled in a single time slice, I think you are golden. >> This brings up a question. Are you always "moving" a data block, or >> do you have a test in the loop to verify it is not already on the >> correct teir of storage? > > See, I will tell you a bit in detail. we have two fields in the inode, > home_tier_id and destination_tier_id. > home_tier_id is set if a file qualifies a file allocation policy. If > it doesnt qualify any of the policies, its data can be allocated > anywhere in the FS, we actually default to the original block > allocation method of the FS. > If a file qualifies, we set its home_tier_id to the respective tier as > mentioned in the policy. And restrict the block allocation to that > particular tier. > > Now, at the time of relocation, > if the policy was (in XML policy file ) SELECT *.mp3 from TIER 1, > RELOCATE to TIER 4, When file Access temp(FAT) > 200 > > We do a FS scan and read each inode one by one, > now check if it's home tier id != 0, as that means that it has been > allocated by OHSM, else we leave that inode. > Now we check for the type of the file, if its mp3 we set the > destination_tier_id = the dest_tier_in policy. > And pass it for relocation. And the relocation function fetched the > destination tier_id from inode and allocated new block from that tier. > And then set the home_tier_id to dest_tier_id. > Does that answer you question sir ? > Not quite. Assume that the mp3 files do not have a policy set, so they are randomly spread across 2 tiers. Then you assign a policy to all mp3s to move them to tier 2, thus freeing up tier 1. You will have mp3 files in three states originally: 1) Fully on tier 1 2) Fully on tier 2 3) Some data blocks on tier1 and some on tier2 My question is if you try to recognize the data blocks that are already on tier2 and not move them. Or do you move them all regardless of where they happen to be currently sitting when the policy is set. I don't think it is necessarily bad to always move the data blocks when a new policy is set. I'm just curious. > -- > Regards, > Sandeep. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ