Re: Copying Data Blocks

"Manish Katiyar" <mkatiyar@xxxxxxxxx> · Tue, 13 Jan 2009 14:00:21 +0530

On Tue, Jan 13, 2009 at 1:21 PM, Sandeep K Sinha
<sandeepksinha@xxxxxxxxx> wrote:
> Hi Manish,
>
>
> On Mon, Jan 12, 2009 at 11:48 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>> On Mon, Jan 12, 2009 at 11:31 PM, Sandeep K Sinha
>> <sandeepksinha@xxxxxxxxx> wrote:
>>> Hi Peter,
>>>
>>> On Mon, Jan 12, 2009 at 9:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>> On Mon, Jan 12, 2009 at 4:26 PM, Sandeep K Sinha
>>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>>> Hi Peter,
>>>>>
>>>>> Don't you think that if will restrict this to a specific file system.
>>>>> VFS inode should be used rather than the FS incore inode ?
>>>>>
>>>>
>>>> vfs have an API:   fsync_buffer_list(), and
>>>> invalidate_inode_buffers(), and these API seemed to used spinlock for
>>>> syncing:
>>>>
>>>> void invalidate_inode_buffers(struct inode *inode)
>>>> {
>>>>        if (inode_has_buffers(inode)) {
>>>>                struct address_space *mapping = &inode->i_data;
>>>>                struct list_head *list = &mapping->private_list;
>>>>                struct address_space *buffer_mapping = mapping->assoc_mapping;
>>>>
>>>>                spin_lock(&buffer_mapping->private_lock);
>>>>                while (!list_empty(list))
>>>>
>>>> __remove_assoc_queue(BH_ENTRY(list->next));======> modify this for
>>>> writing out the data instead.
>>>>                spin_unlock(&buffer_mapping->private_lock);
>>>>        }
>>>> }
>>>> EXPORT_SYMBOL(invalidate_inode_buffers);
>>>>
>>>>
>>>>> The purpose if to sleep all the i/o's when we are updating the i_data
>>>>> from the new inode to the old inode ( updation of the data blocks ).
>>>>>
>>>>> I think i_alloc_sem should work here, but could not find any instance
>>>>> of its use in the code.
>>>>
>>>> for the case of ext3's blcok allocation, the lock seemed to be
>>>> truncate_mutex - read the remark:
>>>>
>>>>        /*
>>>>         * From here we block out all ext3_get_block() callers who want to
>>>>         * modify the block allocation tree.
>>>>         */
>>>>        mutex_lock(&ei->truncate_mutex);
>>>>
>>>> So while it is building the tree, the mutex will lock it.
>>>>
>>>> And the remarks for ext3_get_blocks_handle() are:
>>>>
>>>> /*
>>>>  * Allocation strategy is simple: if we have to allocate something, we will
>>>>  * have to go the whole way to leaf. So let's do it before attaching anything
>>>>  * to tree, set linkage between the newborn blocks, write them if sync is
>>>>  * required, recheck the path, free and repeat if check fails, otherwise
>>>>  * set the last missing link (that will protect us from any truncate-generated
>>>> ...
>>>>
>>>> reading the source....go down and see the mutex_lock() (where
>>>> multiblock allocation are needed) and after the lock, all the blocks
>>>> allocation/merging etc are done:
>>>>
>>>>        /* Next simple case - plain lookup or failed read of indirect block */
>>>>        if (!create || err == -EIO)
>>>>                goto cleanup;
>>>>
>>>>        mutex_lock(&ei->truncate_mutex);
>>>> <snip>
>>>>        count = ext3_blks_to_allocate(partial, indirect_blks,
>>>>                                        maxblocks, blocks_to_boundary);
>>>> <snip>
>>>>        err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
>>>>                                offsets + (partial - chain), partial);
>>>>
>>>>
>>>>> It's working fine currently with i_mutex, meaning if we hold a i_mutex
>>>>
>>>> as far as i know, i_mutex are used for modifying inode's structures information:
>>>>
>>>> grep for i_mutex in fs/ext3/ioctl.c and everytime there is a need to
>>>> maintain inode's structural info, the lock on i_mutex is called.
>>>>
>>>>> lock on the inode while updating the i_data pointers.
>>>>> And try to perform i/o from user space, they are queued. The file was
>>>>> opened in r/w mode prior to taking the lock inside the kernel.
>>>>>
>>>>> But, I still feel i_alloc_sem would be the right option to go ahead with.
>>>>>
>>>>> On Mon, Jan 12, 2009 at 1:11 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>> If u grep for spinlock, mutex, or "sem" in the fs/ext4 directory, u
>>>>>> can find all three types of lock are used - for different class of
>>>>>> object.
>>>>>>
>>>>>> For data blocks I guessed is semaphore - read this
>>>>>> fs/ext4/inode.c:ext4_get_branch():
>>>>>>
>>>>>> /**
>>>>>>  *      ext4_get_branch - read the chain of indirect blocks leading to data
>>>>>> <snip>
>>>>>>  *
>>>>>>  *      Need to be called with
>>>>>>  *      down_read(&EXT4_I(inode)->i_data_sem)
>>>>>>  */
>>>>>>
>>>>>> i guess u have no choice, as it is semaphore, have to follow the rest
>>>>>> of kernel for consistency - don't create your own semaphore :-).
>>>>>>
>>>>>> There exists i_lock as spinlock - which so far i know is for i_blocks
>>>>>> counting purposes:
>>>>>>
>>>>>>       spin_lock(&inode->i_lock);
>>>>>>        inode->i_blocks += tmp_inode->i_blocks;
>>>>>>        spin_unlock(&inode->i_lock);
>>>>>>        up_write(&EXT4_I(inode)->i_data_sem);
>>>>>>
>>>>>> But for data it should be i_data_sem.   Is that correct?
>>>>>>
>>>>>> On Mon, Jan 12, 2009 at 2:18 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am having some issues in locking inode while copying data blocks.
>>>>>>> We are trying to keep file system live during this operation, so
>>>>>>> both read and write operations should work.
>>>>>>> In this case what type of lock on inode should be used, semaphore,
>>>>>>> mutex or spinlock?
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jan 11, 2009 at 8:45 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>>> Sorry.....some mistakes...a resent:
>>>>>>>>
>>>>>>>> Here are some tips on the blockdevice API:
>>>>>>>>
>>>>>>>> http://lkml.org/lkml/2006/1/24/287
>>>>>>>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-01/msg09388.html
>>>>>>>>
>>>>>>>> as indicated, documentation is rather sparse in this area.
>>>>>>>>
>>>>>>>> not sure if anyone else have a summary list of blockdevice API and its
>>>>>>>> explanation?
>>>>>>>>
>>>>>>>> not wrt the following "cleanup patch", i am not sure how the API will change:
>>>>>>>>
>>>>>>>> http://lwn.net/Articles/304485/
>>>>>>>>
>>>>>>>> thanks.
>>>>>>>>
>>>>>>>> On Tue, Jan 6, 2009 at 6:36 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> I want to read data blocks from one inode
>>>>>>>>> and copy it to other inode.
>>>>>>>>>
>>>>>>>>> I mean to copy data from data blocks associated with one inode
>>>>>>>>> to the data blocks associated with other inode.
>>>>>>>>>
>>>>>>>>> Is that possible in kernel space.?
>>>>>>>>> --
>>>>>>
>>>>
>>>> comments ????
>>>
>>> Thats very right !!!
>>>
>>> So, finally we were able to perform the copy operation successfully.
>>>
>>> We did something like this and we named it "ohsm's tricky copy".
>>> Rohit will soon be uploading a new doc soon on the fscops page which
>>> will detail it further.
>>
>> Thanks let us know when the docs and the *source code* is available ;-)
>>
>>>
>>> 1. Read the source inode.
>>> 2. Allocate a new ghost inode.
>>> 3. Take a lock on the source inode. /* mutex , because the nr_blocks
>>> can change if write comes now from user space */
>>> 4. Read the number of blocks.
>>>>
>>> 5. Allocate the same number of blocks for the dummy ghost inode. /*
>>> the chain will be created automatically */
>>> 6. Read the source buffer head of the blocks from source inode and
>>> destination buffer head of the blocks of the destination inode.
>>>
>>> 7. dest_buffer->b_data = source_buffer->b_data ; /* its a char * and
>>> this is where the trick is */
>>> 8. mark the destination buffer dirty.
>>>
>>> perform 6,7,8 for all the blocks.
>>>
>>> 9. swap the src_inode->i_data[15] and dest_dummy_inode->i_data[15]; /*
>>> This helps us to simply avoid copying the block number back from
>>> destination dummy inode to source inode */
>>
>> I don't know anything about LVM, so this might be a dumb question. Why
>> is this required ?
>>
> See, the point here is that we will have a single namespace, meaning
> single file system over all this underlying storage. LVM is a tool
> which provides you API to create logical devices over phycial ones. It
> uses Device Mapper inside the Kernel. This device mapper keeps tha
> mapping between the logical device and all the other underlying
> physical devices.
>
> Now, we require this for defining our storage classes( tiers).  At the
> time of defining the allocation and relocation policy itself , we
> accept information about (dev, tier) list.
> And, we pass this information to our OHSM module inside the kernel and
> extract the mapping from the device mapper and keep it in the OHSM
> metadata. Which is later reffered for all allocations and
> relocation processes.
>
>> Did you mean swapping all the block numbers rather than just the [15] ??
> See the point here is that if we copy each and every new block number
> to the old inode

Ohh yes.... I got thourougly confused by your word "swap". I thought
it is just like your "tricky swap" :-) and not the "copy and swap"
which you actually meant.

> and try to free each block from the old inode then we
> will have the overhead of freeing each and every old block and at the
> end freeing the dummy inode that would be created.
>
> So, what we did was that we swapped the [15]pointers of both the inodes.
> See, on linux the arrangement is something like this.
>
> i_data[0] -> direct pointer to a data block
> i_data[1] -> direct pointer to a data block
> i_data[2] -> direct pointer to a data block
> i_data[3] -> direct pointer to a data block
> i_data[4] -> direct pointer to a data block
> i_data[5] -> direct pointer to a data block
> i_data[6] -> direct pointer to a data block
> i_data[7] -> direct pointer to a data block
> i_data[8] -> direct pointer to a data block
> i_data[9] -> direct pointer to a data block
> i_data[10] -> direct pointer to a data block
> i_data[11] -> direct pointer to a data block
> i_data[12] -> direct pointer to a data block
> i_data[13] -> Single Indirect block
> i_data[14] -> double indirect block
>
>
> For all the pointers, we just swap between inodes. Now, it works
> because, for direct blocks its pretty trivial. for Indirect blocks,
> its just like swapping the roots of the chain of blocks. Which
> eventually changes everything.

Of course........another thing which you might want is to just copy
the block numbers which fall in range of your inode->i_size. This
might help in case of corruptions.

>
> Now, we simply free the dummy inode by a standard FS function, which
> perform the clean for the inodes and the blocks as well, which we
> wanted to free.
> It reduces our work and obviously the cleanup code existing in FS
> would be more trustworthy :P
>
>> Here src_inode is the
>> vfs "struct inode" or the
>> FS specific struct FS_inode_info ???  i didn't get this completely,
>> can you explain this point a bit more.
>>
>
> See, what we do is that we take a lock at the VFS inode and then we
> perform the job of moving the data blocks from FS incore inode (
> FS_inode_info) .
>
> So, this will be the incore inode.
> Also, the VFS inode doesn't have pointers to the data blocks. The data
> blocks pointers (i_data) is present on incore and on disk inode
> structures.
>
> Hope this answers your query. Let me know if you have more.

Unless I missed , I didn't get answers to my earlier question about
*special writing* the inode and maintaining consistency.

Thanks -
Manish
>
>
>
>> Thanks -
>> Manish
>>
>>
>>> /* This also helps to simply destroy the inode, which will eventually
>>> free all the blocks, which otherwise we would have been doing
>>> separately */
>>>
>>> 9.1 Release the mutex on the src inode.
>>>
>>> 10. set the bit for I_FREEING in dest_inode->i_state.
>>>
>>> '11. call FS_delete_inode(dest_inode);
>>>
>>>  Any application which is already opened this inode for read/write,
>>> tries to do read/write when the mutex lock is taken, it will be
>>> queued.
>>>
>>>>
>>>
>>> Thanks a lot Greg,Manish, Peter and all others for all your valuable
>>> inputs and help.
>>>
>>>> --
>>>> Regards,
>>>> Peter Teoh
>>>>
>>>
>>> --
>>> Regards,
>>> Sandeep.
>>>
>>>
>>>
>>>
>>>
>>> "To learn is to change. Education is a process that changes the learner."
>>>
>>> --
>>> To unsubscribe from this list: send an email with
>>> "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
>>> Please read the FAQ at http://kernelnewbies.org/FAQ
>>>
>>>
>>
>
>
>
> --
> Regards,
> Sandeep.
>
>
>
>
>
>
> "To learn is to change. Education is a process that changes the learner."
>

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ