Re: Copying Data Blocks

"Sandeep K Sinha" <sandeepksinha@xxxxxxxxx> · Tue, 13 Jan 2009 13:21:32 +0530

Hi Manish,

On Mon, Jan 12, 2009 at 11:48 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
> On Mon, Jan 12, 2009 at 11:31 PM, Sandeep K Sinha
> <sandeepksinha@xxxxxxxxx> wrote:
>> Hi Peter,
>>
>> On Mon, Jan 12, 2009 at 9:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>> On Mon, Jan 12, 2009 at 4:26 PM, Sandeep K Sinha
>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>> Hi Peter,
>>>>
>>>> Don't you think that if will restrict this to a specific file system.
>>>> VFS inode should be used rather than the FS incore inode ?
>>>>
>>>
>>> vfs have an API:   fsync_buffer_list(), and
>>> invalidate_inode_buffers(), and these API seemed to used spinlock for
>>> syncing:
>>>
>>> void invalidate_inode_buffers(struct inode *inode)
>>> {
>>>        if (inode_has_buffers(inode)) {
>>>                struct address_space *mapping = &inode->i_data;
>>>                struct list_head *list = &mapping->private_list;
>>>                struct address_space *buffer_mapping = mapping->assoc_mapping;
>>>
>>>                spin_lock(&buffer_mapping->private_lock);
>>>                while (!list_empty(list))
>>>
>>> __remove_assoc_queue(BH_ENTRY(list->next));======> modify this for
>>> writing out the data instead.
>>>                spin_unlock(&buffer_mapping->private_lock);
>>>        }
>>> }
>>> EXPORT_SYMBOL(invalidate_inode_buffers);
>>>
>>>
>>>> The purpose if to sleep all the i/o's when we are updating the i_data
>>>> from the new inode to the old inode ( updation of the data blocks ).
>>>>
>>>> I think i_alloc_sem should work here, but could not find any instance
>>>> of its use in the code.
>>>
>>> for the case of ext3's blcok allocation, the lock seemed to be
>>> truncate_mutex - read the remark:
>>>
>>>        /*
>>>         * From here we block out all ext3_get_block() callers who want to
>>>         * modify the block allocation tree.
>>>         */
>>>        mutex_lock(&ei->truncate_mutex);
>>>
>>> So while it is building the tree, the mutex will lock it.
>>>
>>> And the remarks for ext3_get_blocks_handle() are:
>>>
>>> /*
>>>  * Allocation strategy is simple: if we have to allocate something, we will
>>>  * have to go the whole way to leaf. So let's do it before attaching anything
>>>  * to tree, set linkage between the newborn blocks, write them if sync is
>>>  * required, recheck the path, free and repeat if check fails, otherwise
>>>  * set the last missing link (that will protect us from any truncate-generated
>>> ...
>>>
>>> reading the source....go down and see the mutex_lock() (where
>>> multiblock allocation are needed) and after the lock, all the blocks
>>> allocation/merging etc are done:
>>>
>>>        /* Next simple case - plain lookup or failed read of indirect block */
>>>        if (!create || err == -EIO)
>>>                goto cleanup;
>>>
>>>        mutex_lock(&ei->truncate_mutex);
>>> <snip>
>>>        count = ext3_blks_to_allocate(partial, indirect_blks,
>>>                                        maxblocks, blocks_to_boundary);
>>> <snip>
>>>        err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
>>>                                offsets + (partial - chain), partial);
>>>
>>>
>>>> It's working fine currently with i_mutex, meaning if we hold a i_mutex
>>>
>>> as far as i know, i_mutex are used for modifying inode's structures information:
>>>
>>> grep for i_mutex in fs/ext3/ioctl.c and everytime there is a need to
>>> maintain inode's structural info, the lock on i_mutex is called.
>>>
>>>> lock on the inode while updating the i_data pointers.
>>>> And try to perform i/o from user space, they are queued. The file was
>>>> opened in r/w mode prior to taking the lock inside the kernel.
>>>>
>>>> But, I still feel i_alloc_sem would be the right option to go ahead with.
>>>>
>>>> On Mon, Jan 12, 2009 at 1:11 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>> If u grep for spinlock, mutex, or "sem" in the fs/ext4 directory, u
>>>>> can find all three types of lock are used - for different class of
>>>>> object.
>>>>>
>>>>> For data blocks I guessed is semaphore - read this
>>>>> fs/ext4/inode.c:ext4_get_branch():
>>>>>
>>>>> /**
>>>>>  *      ext4_get_branch - read the chain of indirect blocks leading to data
>>>>> <snip>
>>>>>  *
>>>>>  *      Need to be called with
>>>>>  *      down_read(&EXT4_I(inode)->i_data_sem)
>>>>>  */
>>>>>
>>>>> i guess u have no choice, as it is semaphore, have to follow the rest
>>>>> of kernel for consistency - don't create your own semaphore :-).
>>>>>
>>>>> There exists i_lock as spinlock - which so far i know is for i_blocks
>>>>> counting purposes:
>>>>>
>>>>>       spin_lock(&inode->i_lock);
>>>>>        inode->i_blocks += tmp_inode->i_blocks;
>>>>>        spin_unlock(&inode->i_lock);
>>>>>        up_write(&EXT4_I(inode)->i_data_sem);
>>>>>
>>>>> But for data it should be i_data_sem.   Is that correct?
>>>>>
>>>>> On Mon, Jan 12, 2009 at 2:18 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am having some issues in locking inode while copying data blocks.
>>>>>> We are trying to keep file system live during this operation, so
>>>>>> both read and write operations should work.
>>>>>> In this case what type of lock on inode should be used, semaphore,
>>>>>> mutex or spinlock?
>>>>>>
>>>>>>
>>>>>> On Sun, Jan 11, 2009 at 8:45 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>> Sorry.....some mistakes...a resent:
>>>>>>>
>>>>>>> Here are some tips on the blockdevice API:
>>>>>>>
>>>>>>> http://lkml.org/lkml/2006/1/24/287
>>>>>>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-01/msg09388.html
>>>>>>>
>>>>>>> as indicated, documentation is rather sparse in this area.
>>>>>>>
>>>>>>> not sure if anyone else have a summary list of blockdevice API and its
>>>>>>> explanation?
>>>>>>>
>>>>>>> not wrt the following "cleanup patch", i am not sure how the API will change:
>>>>>>>
>>>>>>> http://lwn.net/Articles/304485/
>>>>>>>
>>>>>>> thanks.
>>>>>>>
>>>>>>> On Tue, Jan 6, 2009 at 6:36 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> I want to read data blocks from one inode
>>>>>>>> and copy it to other inode.
>>>>>>>>
>>>>>>>> I mean to copy data from data blocks associated with one inode
>>>>>>>> to the data blocks associated with other inode.
>>>>>>>>
>>>>>>>> Is that possible in kernel space.?
>>>>>>>> --
>>>>>
>>>
>>> comments ????
>>
>> Thats very right !!!
>>
>> So, finally we were able to perform the copy operation successfully.
>>
>> We did something like this and we named it "ohsm's tricky copy".
>> Rohit will soon be uploading a new doc soon on the fscops page which
>> will detail it further.
>
> Thanks let us know when the docs and the *source code* is available ;-)
>
>>
>> 1. Read the source inode.
>> 2. Allocate a new ghost inode.
>> 3. Take a lock on the source inode. /* mutex , because the nr_blocks
>> can change if write comes now from user space */
>> 4. Read the number of blocks.
>>>
>> 5. Allocate the same number of blocks for the dummy ghost inode. /*
>> the chain will be created automatically */
>> 6. Read the source buffer head of the blocks from source inode and
>> destination buffer head of the blocks of the destination inode.
>>
>> 7. dest_buffer->b_data = source_buffer->b_data ; /* its a char * and
>> this is where the trick is */
>> 8. mark the destination buffer dirty.
>>
>> perform 6,7,8 for all the blocks.
>>
>> 9. swap the src_inode->i_data[15] and dest_dummy_inode->i_data[15]; /*
>> This helps us to simply avoid copying the block number back from
>> destination dummy inode to source inode */
>
> I don't know anything about LVM, so this might be a dumb question. Why
> is this required ?
>
See, the point here is that we will have a single namespace, meaning
single file system over all this underlying storage. LVM is a tool
which provides you API to create logical devices over phycial ones. It
uses Device Mapper inside the Kernel. This device mapper keeps tha
mapping between the logical device and all the other underlying
physical devices.

Now, we require this for defining our storage classes( tiers).  At the
time of defining the allocation and relocation policy itself , we
accept information about (dev, tier) list.
And, we pass this information to our OHSM module inside the kernel and
extract the mapping from the device mapper and keep it in the OHSM
metadata. Which is later reffered for all allocations and
relocation processes.

> Did you mean swapping all the block numbers rather than just the [15] ??
See the point here is that if we copy each and every new block number
to the old inode and try to free each block from the old inode then we
will have the overhead of freeing each and every old block and at the
end freeing the dummy inode that would be created.

So, what we did was that we swapped the [15]pointers of both the inodes.
See, on linux the arrangement is something like this.

i_data[0] -> direct pointer to a data block
i_data[1] -> direct pointer to a data block
i_data[2] -> direct pointer to a data block
i_data[3] -> direct pointer to a data block
i_data[4] -> direct pointer to a data block
i_data[5] -> direct pointer to a data block
i_data[6] -> direct pointer to a data block
i_data[7] -> direct pointer to a data block
i_data[8] -> direct pointer to a data block
i_data[9] -> direct pointer to a data block
i_data[10] -> direct pointer to a data block
i_data[11] -> direct pointer to a data block
i_data[12] -> direct pointer to a data block
i_data[13] -> Single Indirect block
i_data[14] -> double indirect block

For all the pointers, we just swap between inodes. Now, it works
because, for direct blocks its pretty trivial. for Indirect blocks,
its just like swapping the roots of the chain of blocks. Which
eventually changes everything.

Now, we simply free the dummy inode by a standard FS function, which
perform the clean for the inodes and the blocks as well, which we
wanted to free.
It reduces our work and obviously the cleanup code existing in FS
would be more trustworthy :P

> Here src_inode is the
> vfs "struct inode" or the
> FS specific struct FS_inode_info ???  i didn't get this completely,
> can you explain this point a bit more.
>

See, what we do is that we take a lock at the VFS inode and then we
perform the job of moving the data blocks from FS incore inode (
FS_inode_info) .

So, this will be the incore inode.
Also, the VFS inode doesn't have pointers to the data blocks. The data
blocks pointers (i_data) is present on incore and on disk inode
structures.

Hope this answers your query. Let me know if you have more.

> Thanks -
> Manish
>
>
>> /* This also helps to simply destroy the inode, which will eventually
>> free all the blocks, which otherwise we would have been doing
>> separately */
>>
>> 9.1 Release the mutex on the src inode.
>>
>> 10. set the bit for I_FREEING in dest_inode->i_state.
>>
>> '11. call FS_delete_inode(dest_inode);
>>
>>  Any application which is already opened this inode for read/write,
>> tries to do read/write when the mutex lock is taken, it will be
>> queued.
>>
>>>
>>
>> Thanks a lot Greg,Manish, Peter and all others for all your valuable
>> inputs and help.
>>
>>> --
>>> Regards,
>>> Peter Teoh
>>>
>>
>> --
>> Regards,
>> Sandeep.
>>
>>
>>
>>
>>
>> "To learn is to change. Education is a process that changes the learner."
>>
>> --
>> To unsubscribe from this list: send an email with
>> "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
>> Please read the FAQ at http://kernelnewbies.org/FAQ
>>
>>
>

-- 
Regards,
Sandeep.

"To learn is to change. Education is a process that changes the learner."

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ