On Tue, Jan 13, 2009 at 1:21 PM, Sandeep K Sinha <sandeepksinha@xxxxxxxxx> wrote: > Hi Manish, > > > On Mon, Jan 12, 2009 at 11:48 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote: >> On Mon, Jan 12, 2009 at 11:31 PM, Sandeep K Sinha >> <sandeepksinha@xxxxxxxxx> wrote: >>> Hi Peter, >>> >>> On Mon, Jan 12, 2009 at 9:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>>> On Mon, Jan 12, 2009 at 4:26 PM, Sandeep K Sinha >>>> <sandeepksinha@xxxxxxxxx> wrote: >>>>> Hi Peter, >>>>> >>>>> Don't you think that if will restrict this to a specific file system. >>>>> VFS inode should be used rather than the FS incore inode ? >>>>> >>>> >>>> vfs have an API: fsync_buffer_list(), and >>>> invalidate_inode_buffers(), and these API seemed to used spinlock for >>>> syncing: >>>> >>>> void invalidate_inode_buffers(struct inode *inode) >>>> { >>>> if (inode_has_buffers(inode)) { >>>> struct address_space *mapping = &inode->i_data; >>>> struct list_head *list = &mapping->private_list; >>>> struct address_space *buffer_mapping = mapping->assoc_mapping; >>>> >>>> spin_lock(&buffer_mapping->private_lock); >>>> while (!list_empty(list)) >>>> >>>> __remove_assoc_queue(BH_ENTRY(list->next));======> modify this for >>>> writing out the data instead. >>>> spin_unlock(&buffer_mapping->private_lock); >>>> } >>>> } >>>> EXPORT_SYMBOL(invalidate_inode_buffers); >>>> >>>> >>>>> The purpose if to sleep all the i/o's when we are updating the i_data >>>>> from the new inode to the old inode ( updation of the data blocks ). >>>>> >>>>> I think i_alloc_sem should work here, but could not find any instance >>>>> of its use in the code. >>>> >>>> for the case of ext3's blcok allocation, the lock seemed to be >>>> truncate_mutex - read the remark: >>>> >>>> /* >>>> * From here we block out all ext3_get_block() callers who want to >>>> * modify the block allocation tree. >>>> */ >>>> mutex_lock(&ei->truncate_mutex); >>>> >>>> So while it is building the tree, the mutex will lock it. >>>> >>>> And the remarks for ext3_get_blocks_handle() are: >>>> >>>> /* >>>> * Allocation strategy is simple: if we have to allocate something, we will >>>> * have to go the whole way to leaf. So let's do it before attaching anything >>>> * to tree, set linkage between the newborn blocks, write them if sync is >>>> * required, recheck the path, free and repeat if check fails, otherwise >>>> * set the last missing link (that will protect us from any truncate-generated >>>> ... >>>> >>>> reading the source....go down and see the mutex_lock() (where >>>> multiblock allocation are needed) and after the lock, all the blocks >>>> allocation/merging etc are done: >>>> >>>> /* Next simple case - plain lookup or failed read of indirect block */ >>>> if (!create || err == -EIO) >>>> goto cleanup; >>>> >>>> mutex_lock(&ei->truncate_mutex); >>>> <snip> >>>> count = ext3_blks_to_allocate(partial, indirect_blks, >>>> maxblocks, blocks_to_boundary); >>>> <snip> >>>> err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal, >>>> offsets + (partial - chain), partial); >>>> >>>> >>>>> It's working fine currently with i_mutex, meaning if we hold a i_mutex >>>> >>>> as far as i know, i_mutex are used for modifying inode's structures information: >>>> >>>> grep for i_mutex in fs/ext3/ioctl.c and everytime there is a need to >>>> maintain inode's structural info, the lock on i_mutex is called. >>>> >>>>> lock on the inode while updating the i_data pointers. >>>>> And try to perform i/o from user space, they are queued. The file was >>>>> opened in r/w mode prior to taking the lock inside the kernel. >>>>> >>>>> But, I still feel i_alloc_sem would be the right option to go ahead with. >>>>> >>>>> On Mon, Jan 12, 2009 at 1:11 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>>>>> If u grep for spinlock, mutex, or "sem" in the fs/ext4 directory, u >>>>>> can find all three types of lock are used - for different class of >>>>>> object. >>>>>> >>>>>> For data blocks I guessed is semaphore - read this >>>>>> fs/ext4/inode.c:ext4_get_branch(): >>>>>> >>>>>> /** >>>>>> * ext4_get_branch - read the chain of indirect blocks leading to data >>>>>> <snip> >>>>>> * >>>>>> * Need to be called with >>>>>> * down_read(&EXT4_I(inode)->i_data_sem) >>>>>> */ >>>>>> >>>>>> i guess u have no choice, as it is semaphore, have to follow the rest >>>>>> of kernel for consistency - don't create your own semaphore :-). >>>>>> >>>>>> There exists i_lock as spinlock - which so far i know is for i_blocks >>>>>> counting purposes: >>>>>> >>>>>> spin_lock(&inode->i_lock); >>>>>> inode->i_blocks += tmp_inode->i_blocks; >>>>>> spin_unlock(&inode->i_lock); >>>>>> up_write(&EXT4_I(inode)->i_data_sem); >>>>>> >>>>>> But for data it should be i_data_sem. Is that correct? >>>>>> >>>>>> On Mon, Jan 12, 2009 at 2:18 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I am having some issues in locking inode while copying data blocks. >>>>>>> We are trying to keep file system live during this operation, so >>>>>>> both read and write operations should work. >>>>>>> In this case what type of lock on inode should be used, semaphore, >>>>>>> mutex or spinlock? >>>>>>> >>>>>>> >>>>>>> On Sun, Jan 11, 2009 at 8:45 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>>>>>>> Sorry.....some mistakes...a resent: >>>>>>>> >>>>>>>> Here are some tips on the blockdevice API: >>>>>>>> >>>>>>>> http://lkml.org/lkml/2006/1/24/287 >>>>>>>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-01/msg09388.html >>>>>>>> >>>>>>>> as indicated, documentation is rather sparse in this area. >>>>>>>> >>>>>>>> not sure if anyone else have a summary list of blockdevice API and its >>>>>>>> explanation? >>>>>>>> >>>>>>>> not wrt the following "cleanup patch", i am not sure how the API will change: >>>>>>>> >>>>>>>> http://lwn.net/Articles/304485/ >>>>>>>> >>>>>>>> thanks. >>>>>>>> >>>>>>>> On Tue, Jan 6, 2009 at 6:36 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> I want to read data blocks from one inode >>>>>>>>> and copy it to other inode. >>>>>>>>> >>>>>>>>> I mean to copy data from data blocks associated with one inode >>>>>>>>> to the data blocks associated with other inode. >>>>>>>>> >>>>>>>>> Is that possible in kernel space.? >>>>>>>>> -- >>>>>> >>>> >>>> comments ???? >>> >>> Thats very right !!! >>> >>> So, finally we were able to perform the copy operation successfully. >>> >>> We did something like this and we named it "ohsm's tricky copy". >>> Rohit will soon be uploading a new doc soon on the fscops page which >>> will detail it further. >> >> Thanks let us know when the docs and the *source code* is available ;-) >> >>> >>> 1. Read the source inode. >>> 2. Allocate a new ghost inode. >>> 3. Take a lock on the source inode. /* mutex , because the nr_blocks >>> can change if write comes now from user space */ >>> 4. Read the number of blocks. >>>> >>> 5. Allocate the same number of blocks for the dummy ghost inode. /* >>> the chain will be created automatically */ >>> 6. Read the source buffer head of the blocks from source inode and >>> destination buffer head of the blocks of the destination inode. >>> >>> 7. dest_buffer->b_data = source_buffer->b_data ; /* its a char * and >>> this is where the trick is */ >>> 8. mark the destination buffer dirty. >>> >>> perform 6,7,8 for all the blocks. >>> >>> 9. swap the src_inode->i_data[15] and dest_dummy_inode->i_data[15]; /* >>> This helps us to simply avoid copying the block number back from >>> destination dummy inode to source inode */ >> >> I don't know anything about LVM, so this might be a dumb question. Why >> is this required ? >> > See, the point here is that we will have a single namespace, meaning > single file system over all this underlying storage. LVM is a tool > which provides you API to create logical devices over phycial ones. It > uses Device Mapper inside the Kernel. This device mapper keeps tha > mapping between the logical device and all the other underlying > physical devices. > > Now, we require this for defining our storage classes( tiers). At the > time of defining the allocation and relocation policy itself , we > accept information about (dev, tier) list. > And, we pass this information to our OHSM module inside the kernel and > extract the mapping from the device mapper and keep it in the OHSM > metadata. Which is later reffered for all allocations and > relocation processes. > >> Did you mean swapping all the block numbers rather than just the [15] ?? > See the point here is that if we copy each and every new block number > to the old inode Ohh yes.... I got thourougly confused by your word "swap". I thought it is just like your "tricky swap" :-) and not the "copy and swap" which you actually meant. > and try to free each block from the old inode then we > will have the overhead of freeing each and every old block and at the > end freeing the dummy inode that would be created. > > So, what we did was that we swapped the [15]pointers of both the inodes. > See, on linux the arrangement is something like this. > > i_data[0] -> direct pointer to a data block > i_data[1] -> direct pointer to a data block > i_data[2] -> direct pointer to a data block > i_data[3] -> direct pointer to a data block > i_data[4] -> direct pointer to a data block > i_data[5] -> direct pointer to a data block > i_data[6] -> direct pointer to a data block > i_data[7] -> direct pointer to a data block > i_data[8] -> direct pointer to a data block > i_data[9] -> direct pointer to a data block > i_data[10] -> direct pointer to a data block > i_data[11] -> direct pointer to a data block > i_data[12] -> direct pointer to a data block > i_data[13] -> Single Indirect block > i_data[14] -> double indirect block > > > For all the pointers, we just swap between inodes. Now, it works > because, for direct blocks its pretty trivial. for Indirect blocks, > its just like swapping the roots of the chain of blocks. Which > eventually changes everything. Of course........another thing which you might want is to just copy the block numbers which fall in range of your inode->i_size. This might help in case of corruptions. > > Now, we simply free the dummy inode by a standard FS function, which > perform the clean for the inodes and the blocks as well, which we > wanted to free. > It reduces our work and obviously the cleanup code existing in FS > would be more trustworthy :P > >> Here src_inode is the >> vfs "struct inode" or the >> FS specific struct FS_inode_info ??? i didn't get this completely, >> can you explain this point a bit more. >> > > See, what we do is that we take a lock at the VFS inode and then we > perform the job of moving the data blocks from FS incore inode ( > FS_inode_info) . > > So, this will be the incore inode. > Also, the VFS inode doesn't have pointers to the data blocks. The data > blocks pointers (i_data) is present on incore and on disk inode > structures. > > Hope this answers your query. Let me know if you have more. Unless I missed , I didn't get answers to my earlier question about *special writing* the inode and maintaining consistency. Thanks - Manish > > > >> Thanks - >> Manish >> >> >>> /* This also helps to simply destroy the inode, which will eventually >>> free all the blocks, which otherwise we would have been doing >>> separately */ >>> >>> 9.1 Release the mutex on the src inode. >>> >>> 10. set the bit for I_FREEING in dest_inode->i_state. >>> >>> '11. call FS_delete_inode(dest_inode); >>> >>> Any application which is already opened this inode for read/write, >>> tries to do read/write when the mutex lock is taken, it will be >>> queued. >>> >>>> >>> >>> Thanks a lot Greg,Manish, Peter and all others for all your valuable >>> inputs and help. >>> >>>> -- >>>> Regards, >>>> Peter Teoh >>>> >>> >>> -- >>> Regards, >>> Sandeep. >>> >>> >>> >>> >>> >>> "To learn is to change. Education is a process that changes the learner." >>> >>> -- >>> To unsubscribe from this list: send an email with >>> "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx >>> Please read the FAQ at http://kernelnewbies.org/FAQ >>> >>> >> > > > > -- > Regards, > Sandeep. > > > > > > > "To learn is to change. Education is a process that changes the learner." > -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ