Re: Copying Data Blocks

"Sandeep K Sinha" <sandeepksinha@xxxxxxxxx> · Wed, 14 Jan 2009 12:39:20 +0530

On Wed, Jan 14, 2009 at 10:44 AM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
> On Wed, Jan 14, 2009 at 10:32 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>> thinking deeper, i have a concern for you.
>>
>> based on the VFS layering concept, u have no problem if the internal
>> of ext3 is not touch.   but now u seemed to be doing a lot of stuff at
>> the ext2/ext3 layer.
>
> IIRC they are dealing only with ext2 currently
>
Thats very right !
The code level work is being done for ext2. But, ya nothing much to be
done for ext3.
We have looked onto its design.

>>
>> i was reading about the reservation concept in ext2/ext3 and found it
>> quite complex.   blocks can be preallocated and put on the reservation
>> list, but at the same time, it can and should be possible to be taken
>> away by another file, if the file needed that block for use.   ie,
>> reservation does not guarantee u ownership of that block, but based on
>> "polite courtesy" (as i read somewhere), other parts of ext2/ext3 is
>> supposed to avoid using that blocks.   but if storage space are really
>> low....well...and since that block is not being used...it should be
>> reassigned to someone.
>
> Correct.... but that also raises few more questions. Sandeep , do you
> have any pre-requisites about the sizing of disks for OHSM to work ??
> For example lets say I have 3 disks d1, d2 & d3 in descending order of
> speed. Do all of them have to be of same size ? If they are then you

No, They don't need to be. We would never like to have such
restrictions. Obviously, th cheaper disks would be bigger than
expensive disks, same based on speed too.
Also, the TIER's will never be identical at any instance
logically,even if the disk sizes are same.
Suppose during first relocation only few files qualify and they are
relocated to some other teir. Then your assumption will fail for the
same size, right ?
Correct me if I got you wrong.

> really don't need to worry much about space preallocation, because you
> know that if you had space in d1 to allocate an inode in first place,
> you can replicate the same layout in d2.
>

Manish, if you can refer
http://fscops.googlecode.com/files/OHSM_Relocation_faqs_0.1.pdf

It mentions that we check for the amount of space required on the
destination tier to proceed with relocation.
If its not there we ask the admin to free some space and issue an
IOCTL ( which will be a command for him) that he is done. So, that we
can start relocating. He will have the facility to either re-trigger
or just say that we has freed space. The reason is that re-triggering
will again cause the whole FS to be scanned again.

> Problem comes when d2 is less than d1. Is it possible that you migrate
> only some of the blocks to d2 and leave some in d1 if d2 runs out of
> space ?
>

No, that won't make sense at all. See, the home_tier_id of a file
signifies its property based on allocation policy initially. Say, the
admin applies one allocation policy, now the admin will think that I
have allocated mp3 files on TIER1 and after trigger relocation ( the
policy was to move all mp3 to tier 3), now he will have a feeling that
all mp3's are fetched from tier 3 where as it would be a mixture.
So, in design we decided, two things,
first, recalled, "fail but shout clearly". And secondly, we warn a
user if the tier is 80% full and also at the time of reloc if the
destination tier has less space then required.

Does that sound OK to you guys ?

> Thanks -
> Manish
>

>>
>> ie.....when u allocate blocks and use it....do u actually update any
>> reservation list?   or is necessary to do so?   or are u supposed to
>> read the reservation list before allocation of blocks?   i am not
>> sure.   all these are protocols obeyed internally within ext3.   and
>> since block allocation is not part of ext3 but at the blocks level,
>> the API will not care about the existence of any reservation lists
>> which is part of ext3.
>>
>> In general, if your software is not going into mainline kernel, my
>> personal preference is NOT to do it at the ext3 layer....but higher
>> than that.......noticed that fs/ext2 fs/ext3 and fs/ext4 all does not
>> have any EXPORT API for other subsystem to call?   well....this is for
>> internal consistency as described above.
>>
>> but ext3 used jbd, so fs/jbd does have exported API.   so anyone can
>> call these exported API without messing up the internal consistency of
>> jbd.
>>
>> end of the day, i may be plain wrong :-).
>>
>> comments?
>>
>> On Tue, Jan 13, 2009 at 6:41 PM, Sandeep K Sinha
>> <sandeepksinha@xxxxxxxxx> wrote:
>>> Hey Manish,
>>>
>>> On Tue, Jan 13, 2009 at 2:00 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>>> On Tue, Jan 13, 2009 at 1:21 PM, Sandeep K Sinha
>>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>>> Hi Manish,
>>>>>
>>>>>
>>>>> On Mon, Jan 12, 2009 at 11:48 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>>>>> On Mon, Jan 12, 2009 at 11:31 PM, Sandeep K Sinha
>>>>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On Mon, Jan 12, 2009 at 9:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>>> On Mon, Jan 12, 2009 at 4:26 PM, Sandeep K Sinha
>>>>>>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> Don't you think that if will restrict this to a specific file system.
>>>>>>>>> VFS inode should be used rather than the FS incore inode ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> vfs have an API:   fsync_buffer_list(), and
>>>>>>>> invalidate_inode_buffers(), and these API seemed to used spinlock for
>>>>>>>> syncing:
>>>>>>>>
>>>>>>>> void invalidate_inode_buffers(struct inode *inode)
>>>>>>>> {
>>>>>>>>        if (inode_has_buffers(inode)) {
>>>>>>>>                struct address_space *mapping = &inode->i_data;
>>>>>>>>                struct list_head *list = &mapping->private_list;
>>>>>>>>                struct address_space *buffer_mapping = mapping->assoc_mapping;
>>>>>>>>
>>>>>>>>                spin_lock(&buffer_mapping->private_lock);
>>>>>>>>                while (!list_empty(list))
>>>>>>>>
>>>>>>>> __remove_assoc_queue(BH_ENTRY(list->next));======> modify this for
>>>>>>>> writing out the data instead.
>>>>>>>>                spin_unlock(&buffer_mapping->private_lock);
>>>>>>>>        }
>>>>>>>> }
>>>>>>>> EXPORT_SYMBOL(invalidate_inode_buffers);
>>>>>>>>
>>>>>>>>
>>>>>>>>> The purpose if to sleep all the i/o's when we are updating the i_data
>>>>>>>>> from the new inode to the old inode ( updation of the data blocks ).
>>>>>>>>>
>>>>>>>>> I think i_alloc_sem should work here, but could not find any instance
>>>>>>>>> of its use in the code.
>>>>>>>>
>>>>>>>> for the case of ext3's blcok allocation, the lock seemed to be
>>>>>>>> truncate_mutex - read the remark:
>>>>>>>>
>>>>>>>>        /*
>>>>>>>>         * From here we block out all ext3_get_block() callers who want to
>>>>>>>>         * modify the block allocation tree.
>>>>>>>>         */
>>>>>>>>        mutex_lock(&ei->truncate_mutex);
>>>>>>>>
>>>>>>>> So while it is building the tree, the mutex will lock it.
>>>>>>>>
>>>>>>>> And the remarks for ext3_get_blocks_handle() are:
>>>>>>>>
>>>>>>>> /*
>>>>>>>>  * Allocation strategy is simple: if we have to allocate something, we will
>>>>>>>>  * have to go the whole way to leaf. So let's do it before attaching anything
>>>>>>>>  * to tree, set linkage between the newborn blocks, write them if sync is
>>>>>>>>  * required, recheck the path, free and repeat if check fails, otherwise
>>>>>>>>  * set the last missing link (that will protect us from any truncate-generated
>>>>>>>> ...
>>>>>>>>
>>>>>>>> reading the source....go down and see the mutex_lock() (where
>>>>>>>> multiblock allocation are needed) and after the lock, all the blocks
>>>>>>>> allocation/merging etc are done:
>>>>>>>>
>>>>>>>>        /* Next simple case - plain lookup or failed read of indirect block */
>>>>>>>>        if (!create || err == -EIO)
>>>>>>>>                goto cleanup;
>>>>>>>>
>>>>>>>>        mutex_lock(&ei->truncate_mutex);
>>>>>>>> <snip>
>>>>>>>>        count = ext3_blks_to_allocate(partial, indirect_blks,
>>>>>>>>                                        maxblocks, blocks_to_boundary);
>>>>>>>> <snip>
>>>>>>>>        err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
>>>>>>>>                                offsets + (partial - chain), partial);
>>>>>>>>
>>>>>>>>
>>>>>>>>> It's working fine currently with i_mutex, meaning if we hold a i_mutex
>>>>>>>>
>>>>>>>> as far as i know, i_mutex are used for modifying inode's structures information:
>>>>>>>>
>>>>>>>> grep for i_mutex in fs/ext3/ioctl.c and everytime there is a need to
>>>>>>>> maintain inode's structural info, the lock on i_mutex is called.
>>>>>>>>
>>>>>>>>> lock on the inode while updating the i_data pointers.
>>>>>>>>> And try to perform i/o from user space, they are queued. The file was
>>>>>>>>> opened in r/w mode prior to taking the lock inside the kernel.
>>>>>>>>>
>>>>>>>>> But, I still feel i_alloc_sem would be the right option to go ahead with.
>>>>>>>>>
>>>>>>>>> On Mon, Jan 12, 2009 at 1:11 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>>>>> If u grep for spinlock, mutex, or "sem" in the fs/ext4 directory, u
>>>>>>>>>> can find all three types of lock are used - for different class of
>>>>>>>>>> object.
>>>>>>>>>>
>>>>>>>>>> For data blocks I guessed is semaphore - read this
>>>>>>>>>> fs/ext4/inode.c:ext4_get_branch():
>>>>>>>>>>
>>>>>>>>>> /**
>>>>>>>>>>  *      ext4_get_branch - read the chain of indirect blocks leading to data
>>>>>>>>>> <snip>
>>>>>>>>>>  *
>>>>>>>>>>  *      Need to be called with
>>>>>>>>>>  *      down_read(&EXT4_I(inode)->i_data_sem)
>>>>>>>>>>  */
>>>>>>>>>>
>>>>>>>>>> i guess u have no choice, as it is semaphore, have to follow the rest
>>>>>>>>>> of kernel for consistency - don't create your own semaphore :-).
>>>>>>>>>>
>>>>>>>>>> There exists i_lock as spinlock - which so far i know is for i_blocks
>>>>>>>>>> counting purposes:
>>>>>>>>>>
>>>>>>>>>>       spin_lock(&inode->i_lock);
>>>>>>>>>>        inode->i_blocks += tmp_inode->i_blocks;
>>>>>>>>>>        spin_unlock(&inode->i_lock);
>>>>>>>>>>        up_write(&EXT4_I(inode)->i_data_sem);
>>>>>>>>>>
>>>>>>>>>> But for data it should be i_data_sem.   Is that correct?
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 12, 2009 at 2:18 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am having some issues in locking inode while copying data blocks.
>>>>>>>>>>> We are trying to keep file system live during this operation, so
>>>>>>>>>>> both read and write operations should work.
>>>>>>>>>>> In this case what type of lock on inode should be used, semaphore,
>>>>>>>>>>> mutex or spinlock?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Jan 11, 2009 at 8:45 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>>>>>>> Sorry.....some mistakes...a resent:
>>>>>>>>>>>>
>>>>>>>>>>>> Here are some tips on the blockdevice API:
>>>>>>>>>>>>
>>>>>>>>>>>> http://lkml.org/lkml/2006/1/24/287
>>>>>>>>>>>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-01/msg09388.html
>>>>>>>>>>>>
>>>>>>>>>>>> as indicated, documentation is rather sparse in this area.
>>>>>>>>>>>>
>>>>>>>>>>>> not sure if anyone else have a summary list of blockdevice API and its
>>>>>>>>>>>> explanation?
>>>>>>>>>>>>
>>>>>>>>>>>> not wrt the following "cleanup patch", i am not sure how the API will change:
>>>>>>>>>>>>
>>>>>>>>>>>> http://lwn.net/Articles/304485/
>>>>>>>>>>>>
>>>>>>>>>>>> thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 6, 2009 at 6:36 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I want to read data blocks from one inode
>>>>>>>>>>>>> and copy it to other inode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I mean to copy data from data blocks associated with one inode
>>>>>>>>>>>>> to the data blocks associated with other inode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is that possible in kernel space.?
>>>>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>
>>>>>>>> comments ????
>>>>>>>
>>>>>>> Thats very right !!!
>>>>>>>
>>>>>>> So, finally we were able to perform the copy operation successfully.
>>>>>>>
>>>>>>> We did something like this and we named it "ohsm's tricky copy".
>>>>>>> Rohit will soon be uploading a new doc soon on the fscops page which
>>>>>>> will detail it further.
>>>>>>
>>>>>> Thanks let us know when the docs and the *source code* is available ;-)
>>>>>>
>>>>>>>
>>>>>>> 1. Read the source inode.
>>>>>>> 2. Allocate a new ghost inode.
>>>>>>> 3. Take a lock on the source inode. /* mutex , because the nr_blocks
>>>>>>> can change if write comes now from user space */
>>>>>>> 4. Read the number of blocks.
>>>>>>>>
>>>>>>> 5. Allocate the same number of blocks for the dummy ghost inode. /*
>>>>>>> the chain will be created automatically */
>>>>>>> 6. Read the source buffer head of the blocks from source inode and
>>>>>>> destination buffer head of the blocks of the destination inode.
>>>>>>>
>>>>>>> 7. dest_buffer->b_data = source_buffer->b_data ; /* its a char * and
>>>>>>> this is where the trick is */
>>>>>>> 8. mark the destination buffer dirty.
>>>>>>>
>>>>>>> perform 6,7,8 for all the blocks.
>>>>>>>
>>>>>>> 9. swap the src_inode->i_data[15] and dest_dummy_inode->i_data[15]; /*
>>>>>>> This helps us to simply avoid copying the block number back from
>>>>>>> destination dummy inode to source inode */
>>>>>>
>>>>>> I don't know anything about LVM, so this might be a dumb question. Why
>>>>>> is this required ?
>>>>>>
>>>>> See, the point here is that we will have a single namespace, meaning
>>>>> single file system over all this underlying storage. LVM is a tool
>>>>> which provides you API to create logical devices over phycial ones. It
>>>>> uses Device Mapper inside the Kernel. This device mapper keeps tha
>>>>> mapping between the logical device and all the other underlying
>>>>> physical devices.
>>>>>
>>>>> Now, we require this for defining our storage classes( tiers).  At the
>>>>> time of defining the allocation and relocation policy itself , we
>>>>> accept information about (dev, tier) list.
>>>>> And, we pass this information to our OHSM module inside the kernel and
>>>>> extract the mapping from the device mapper and keep it in the OHSM
>>>>> metadata. Which is later reffered for all allocations and
>>>>> relocation processes.
>>>>>
>>>>>> Did you mean swapping all the block numbers rather than just the [15] ??
>>>>> See the point here is that if we copy each and every new block number
>>>>> to the old inode
>>>>
>>>> Ohh yes.... I got thourougly confused by your word "swap". I thought
>>>> it is just like your "tricky swap" :-) and not the "copy and swap"
>>>> which you actually meant.
>>>>
>>>>> and try to free each block from the old inode then we
>>>>> will have the overhead of freeing each and every old block and at the
>>>>> end freeing the dummy inode that would be created.
>>>>>
>>>>> So, what we did was that we swapped the [15]pointers of both the inodes.
>>>>> See, on linux the arrangement is something like this.
>>>>>
>>>>> i_data[0] -> direct pointer to a data block
>>>>> i_data[1] -> direct pointer to a data block
>>>>> i_data[2] -> direct pointer to a data block
>>>>> i_data[3] -> direct pointer to a data block
>>>>> i_data[4] -> direct pointer to a data block
>>>>> i_data[5] -> direct pointer to a data block
>>>>> i_data[6] -> direct pointer to a data block
>>>>> i_data[7] -> direct pointer to a data block
>>>>> i_data[8] -> direct pointer to a data block
>>>>> i_data[9] -> direct pointer to a data block
>>>>> i_data[10] -> direct pointer to a data block
>>>>> i_data[11] -> direct pointer to a data block
>>>>> i_data[12] -> direct pointer to a data block
>>>>> i_data[13] -> Single Indirect block
>>>>> i_data[14] -> double indirect block
>>>>>
>>>>>
>>>>> For all the pointers, we just swap between inodes. Now, it works
>>>>> because, for direct blocks its pretty trivial. for Indirect blocks,
>>>>> its just like swapping the roots of the chain of blocks. Which
>>>>> eventually changes everything.
>>>>
>>>> Of course........another thing which you might want is to just copy
>>>> the block numbers which fall in range of your inode->i_size. This
>>>> might help in case of corruptions.
>>>>
>>>>>
>>>>> Now, we simply free the dummy inode by a standard FS function, which
>>>>> perform the clean for the inodes and the blocks as well, which we
>>>>> wanted to free.
>>>>> It reduces our work and obviously the cleanup code existing in FS
>>>>> would be more trustworthy :P
>>>>>
>>>>>> Here src_inode is the
>>>>>> vfs "struct inode" or the
>>>>>> FS specific struct FS_inode_info ???  i didn't get this completely,
>>>>>> can you explain this point a bit more.
>>>>>>
>>>>>
>>>>> See, what we do is that we take a lock at the VFS inode and then we
>>>>> perform the job of moving the data blocks from FS incore inode (
>>>>> FS_inode_info) .
>>>>>
>>>>> So, this will be the incore inode.
>>>>> Also, the VFS inode doesn't have pointers to the data blocks. The data
>>>>> blocks pointers (i_data) is present on incore and on disk inode
>>>>> structures.
>>>>>
>>>>> Hope this answers your query. Let me know if you have more.
>>>>
>>>> Unless I missed , I didn't get answers to my earlier question about
>>>> *special writing* the inode and maintaining consistency.
>>>>
>>>
>>> Here is the question that you asked....
>>>
>>>>>btw how are you going to *special write* the inode ? If i remember
>>>>>correctly you said that you will make the filesystem as readonly. I
>>>>>don't know at what all places in write stack we assert for readonly
>>>>>flag on FS. One of the places IIRC is do_open() when you try opening
>>>>>the file for first time it checks for permission. How do you plan to
>>>>>deal with already open file descriptors which are in write mode. If
>>>>>you have already investigated all the paths for MS_RDONLY flag, it
>>>>>would be great if you can push it somewhere on web. It might be
>>>>>helpful for others. And what about the applications who were happily
>>>>>doing writes till now , if suddenly their operations start failing ?
>>>
>>> Well, now we have a complete change in design here. You will
>>> understand thing better when we release our design doc. Which we will
>>> be doing soon.
>>>
>>> So, as you must have seen by now that we are not creating a new inode
>>> as a replacement of the old one.
>>>
>>> We just create a dummy inode, allocate blocks into it, copy the data
>>> from source blocks and finally swap.
>>>
>>> Here we take a lock on the inode while making any changes to the
>>> inode. Kindly refer the algo that I provided in my previous mails.
>>>
>>> Case 1: Trying to open a file while relocation is going on ?
>>> Case 2: open file descriptor tries to read/write ?
>>>
>>> In both the cases as we have taken a lock on the inode, both the
>>> cases, the user application will queue itself.
>>>
>>> Now, looking at this time, for which the process will have to wait,
>>> As we are not spending time in physically copy data and releasing data
>>> blocks and inode, we expect this time to be quite less.
>>> Vineet is working on the timing and performance stuff. Vinnet can your
>>> provide some kind of time metrics for a say a file that is of 10 Gigs
>>> ?
>>>
>>> PS: We have not made any changes to the write code path at all.
>>> The lock synchronizes everything.
>>>
>>> Manish does that answer your question or I am getting it wrong somewhere ?
>>>
>>>
>>>> Thanks -
>>>> Manish
>>>>>
>>>>>
>>>>>
>>>>>> Thanks -
>>>>>> Manish
>>>>>>
>>>>>>
>>>>>>> /* This also helps to simply destroy the inode, which will eventually
>>>>>>> free all the blocks, which otherwise we would have been doing
>>>>>>> separately */
>>>>>>>
>>>>>>> 9.1 Release the mutex on the src inode.
>>>>>>>
>>>>>>> 10. set the bit for I_FREEING in dest_inode->i_state.
>>>>>>>
>>>>>>> '11. call FS_delete_inode(dest_inode);
>>>>>>>
>>>>>>>  Any application which is already opened this inode for read/write,
>>>>>>> tries to do read/write when the mutex lock is taken, it will be
>>>>>>> queued.
>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot Greg,Manish, Peter and all others for all your valuable
>>>>>>> inputs and help.
>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Peter Teoh
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Sandeep.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> "To learn is to change. Education is a process that changes the learner."
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send an email with
>>>>>>> "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
>>>>>>> Please read the FAQ at http://kernelnewbies.org/FAQ
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Sandeep.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> "To learn is to change. Education is a process that changes the learner."
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sandeep.
>>>
>>>
>>>
>>>
>>>
>>>
>>> "To learn is to change. Education is a process that changes the learner."
>>>
>>
>>
>>
>> --
>> Regards,
>> Peter Teoh
>>
>

-- 
Regards,
Sandeep.

"To learn is to change. Education is a process that changes the learner."

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ