Re: Copying Data Blocks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Peter,

On Wed, Jan 14, 2009 at 10:32 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
> thinking deeper, i have a concern for you.
>
> based on the VFS layering concept, u have no problem if the internal
> of ext3 is not touch.   but now u seemed to be doing a lot of stuff at
> the ext2/ext3 layer.
>

Yes thats true, But I will try to justify  myself.
See, the point is that we will have been fighting with the way
ext2/ext3 is implemented. The major issue is that we are working on a
system which will have files/inodes as their objects. I mean we would
operate on inodes basically inside the kernel. And for that we need to
work closely with the FS, be it ext2/ext3/ext4.

Just for your information, the current design just exports some API
from the FS. And we dont make any modification in the FS code except
for one place.
When we try allocate to allocate new data blocks for any inode.
We restrict it to certain bg ranges, depending upon our tiers.
If you look at ext2/ext3 code, what it does is that, it has a group
goal which decides that which group would be the best to allocate the
blocks from, failing tries, the other block groups in the FS.


We will be working on to make this independent of FS but our iterative
design model has failed in the first two iterations to do so, just
because ext2 was never knowing that such tools would come up and hence
they have not exported any API.

We will working on our own FS down the line, may be it can be a
mixture of features from ext4+zfs+vxfs, I know I am talking big, I am
making efforts towards that as well. I have my fingers crossed.

> i was reading about the reservation concept in ext2/ext3 and found it
> quite complex.   blocks can be preallocated and put on the reservation
> list, but at the same time, it can and should be possible to be taken
> away by another file, if the file needed that block for use.   ie,
> reservation does not guarantee u ownership of that block, but based on
> "polite courtesy" (as i read somewhere), other parts of ext2/ext3 is
> supposed to avoid using that blocks.   but if storage space are really
> low....well...and since that block is not being used...it should be
> reassigned to someone.>

This is very true, to the best of my knowledge.

> ie.....when u allocate blocks and use it....do u actually update any
> reservation list?   or is necessary to do so?   or are u supposed to

The system does that for me.
See, we just ghost the FS block allocator that the file systems starts
from Block group X and ends at  block group Y. Which eventually is the
size on the tier on which it resides. If the tier has more than one
disk and its not continuous ( i mean arranged sequentally while
createting LVM, i.e hda,hdb, hdb. I create a lvm, say /dev/vg/lvol0,
the lVM arranges the mapping linearly as hda|hdb|hdb. Then I define my
teir 1 as hda and hdc and trier 2 as hdb. Here, the  BG range for tier
1 would be say 1to 100 and 200 to 300. Where as 200-300 would be for
tier 2. ) and the first block allocation request fails, we try it with
the other BG ranges. If all fails means that there is no space on the
TIER.

So, we don't need to do any reservation window update. Its all done by
the FS code itself.
So, its not required at all.

> read the reservation list before allocation of blocks?   i am not
> sure.   all these are protocols obeyed internally within ext3.   and

Yes, thats right. Its with ext2 as well. But that not much of our
concern as we have not made any changes to the existing ext3 code and
have used the functionalities very cleanly.

> since block allocation is not part of ext3 but at the blocks level,
> the API will not care about the existence of any reservation lists
> which is part of ext3.

Thats true, but remember that lots of stuffs happen before the control
goes to the block layer. And ya, I hope you are talking about the FS
blocks and not the block device level. They are different and we
should be bothered about block dev level.
>
> In general, if your software is not going into mainline kernel, my
> personal preference is NOT to do it at the ext3 layer....but higher
> than that.......noticed that fs/ext2 fs/ext3 and fs/ext4 all does not
> have any EXPORT API for other subsystem to call?   well....this is for
> internal consistency as described above.
>

See, the point here is that I would require API from FS. And before I
could make to the mainline, I will need to get those API's from the
FIle system. Which is quite a difficult job to do. Once we are code
complete and testing is done. I would try to get those API's done from
ext2/ext3.


> but ext3 used jbd, so fs/jbd does have exported API.   so anyone can
> call these exported API without messing up the internal consistency of
> jbd.
>

Not sure, will have to check this :)
> end of the day, i may be plain wrong :-).
>

No, you are right and thanks a lot for the insight.
> comments?
>

Inline.

We will be publishing a document on oshm allocation FAQ's. It should
give you more details on its view from user's perspective.

And a allocation func spec doc too, which you can refer to get the
design and API related issues.


Btw, Thanks Again.
> On Tue, Jan 13, 2009 at 6:41 PM, Sandeep K Sinha
> <sandeepksinha@xxxxxxxxx> wrote:
>> Hey Manish,
>>
>> On Tue, Jan 13, 2009 at 2:00 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>> On Tue, Jan 13, 2009 at 1:21 PM, Sandeep K Sinha
>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>> Hi Manish,
>>>>
>>>>
>>>> On Mon, Jan 12, 2009 at 11:48 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote:
>>>>> On Mon, Jan 12, 2009 at 11:31 PM, Sandeep K Sinha
>>>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> On Mon, Jan 12, 2009 at 9:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>> On Mon, Jan 12, 2009 at 4:26 PM, Sandeep K Sinha
>>>>>>> <sandeepksinha@xxxxxxxxx> wrote:
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> Don't you think that if will restrict this to a specific file system.
>>>>>>>> VFS inode should be used rather than the FS incore inode ?
>>>>>>>>
>>>>>>>
>>>>>>> vfs have an API:   fsync_buffer_list(), and
>>>>>>> invalidate_inode_buffers(), and these API seemed to used spinlock for
>>>>>>> syncing:
>>>>>>>
>>>>>>> void invalidate_inode_buffers(struct inode *inode)
>>>>>>> {
>>>>>>>        if (inode_has_buffers(inode)) {
>>>>>>>                struct address_space *mapping = &inode->i_data;
>>>>>>>                struct list_head *list = &mapping->private_list;
>>>>>>>                struct address_space *buffer_mapping = mapping->assoc_mapping;
>>>>>>>
>>>>>>>                spin_lock(&buffer_mapping->private_lock);
>>>>>>>                while (!list_empty(list))
>>>>>>>
>>>>>>> __remove_assoc_queue(BH_ENTRY(list->next));======> modify this for
>>>>>>> writing out the data instead.
>>>>>>>                spin_unlock(&buffer_mapping->private_lock);
>>>>>>>        }
>>>>>>> }
>>>>>>> EXPORT_SYMBOL(invalidate_inode_buffers);
>>>>>>>
>>>>>>>
>>>>>>>> The purpose if to sleep all the i/o's when we are updating the i_data
>>>>>>>> from the new inode to the old inode ( updation of the data blocks ).
>>>>>>>>
>>>>>>>> I think i_alloc_sem should work here, but could not find any instance
>>>>>>>> of its use in the code.
>>>>>>>
>>>>>>> for the case of ext3's blcok allocation, the lock seemed to be
>>>>>>> truncate_mutex - read the remark:
>>>>>>>
>>>>>>>        /*
>>>>>>>         * From here we block out all ext3_get_block() callers who want to
>>>>>>>         * modify the block allocation tree.
>>>>>>>         */
>>>>>>>        mutex_lock(&ei->truncate_mutex);
>>>>>>>
>>>>>>> So while it is building the tree, the mutex will lock it.
>>>>>>>
>>>>>>> And the remarks for ext3_get_blocks_handle() are:
>>>>>>>
>>>>>>> /*
>>>>>>>  * Allocation strategy is simple: if we have to allocate something, we will
>>>>>>>  * have to go the whole way to leaf. So let's do it before attaching anything
>>>>>>>  * to tree, set linkage between the newborn blocks, write them if sync is
>>>>>>>  * required, recheck the path, free and repeat if check fails, otherwise
>>>>>>>  * set the last missing link (that will protect us from any truncate-generated
>>>>>>> ...
>>>>>>>
>>>>>>> reading the source....go down and see the mutex_lock() (where
>>>>>>> multiblock allocation are needed) and after the lock, all the blocks
>>>>>>> allocation/merging etc are done:
>>>>>>>
>>>>>>>        /* Next simple case - plain lookup or failed read of indirect block */
>>>>>>>        if (!create || err == -EIO)
>>>>>>>                goto cleanup;
>>>>>>>
>>>>>>>        mutex_lock(&ei->truncate_mutex);
>>>>>>> <snip>
>>>>>>>        count = ext3_blks_to_allocate(partial, indirect_blks,
>>>>>>>                                        maxblocks, blocks_to_boundary);
>>>>>>> <snip>
>>>>>>>        err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
>>>>>>>                                offsets + (partial - chain), partial);
>>>>>>>
>>>>>>>
>>>>>>>> It's working fine currently with i_mutex, meaning if we hold a i_mutex
>>>>>>>
>>>>>>> as far as i know, i_mutex are used for modifying inode's structures information:
>>>>>>>
>>>>>>> grep for i_mutex in fs/ext3/ioctl.c and everytime there is a need to
>>>>>>> maintain inode's structural info, the lock on i_mutex is called.
>>>>>>>
>>>>>>>> lock on the inode while updating the i_data pointers.
>>>>>>>> And try to perform i/o from user space, they are queued. The file was
>>>>>>>> opened in r/w mode prior to taking the lock inside the kernel.
>>>>>>>>
>>>>>>>> But, I still feel i_alloc_sem would be the right option to go ahead with.
>>>>>>>>
>>>>>>>> On Mon, Jan 12, 2009 at 1:11 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>>>> If u grep for spinlock, mutex, or "sem" in the fs/ext4 directory, u
>>>>>>>>> can find all three types of lock are used - for different class of
>>>>>>>>> object.
>>>>>>>>>
>>>>>>>>> For data blocks I guessed is semaphore - read this
>>>>>>>>> fs/ext4/inode.c:ext4_get_branch():
>>>>>>>>>
>>>>>>>>> /**
>>>>>>>>>  *      ext4_get_branch - read the chain of indirect blocks leading to data
>>>>>>>>> <snip>
>>>>>>>>>  *
>>>>>>>>>  *      Need to be called with
>>>>>>>>>  *      down_read(&EXT4_I(inode)->i_data_sem)
>>>>>>>>>  */
>>>>>>>>>
>>>>>>>>> i guess u have no choice, as it is semaphore, have to follow the rest
>>>>>>>>> of kernel for consistency - don't create your own semaphore :-).
>>>>>>>>>
>>>>>>>>> There exists i_lock as spinlock - which so far i know is for i_blocks
>>>>>>>>> counting purposes:
>>>>>>>>>
>>>>>>>>>       spin_lock(&inode->i_lock);
>>>>>>>>>        inode->i_blocks += tmp_inode->i_blocks;
>>>>>>>>>        spin_unlock(&inode->i_lock);
>>>>>>>>>        up_write(&EXT4_I(inode)->i_data_sem);
>>>>>>>>>
>>>>>>>>> But for data it should be i_data_sem.   Is that correct?
>>>>>>>>>
>>>>>>>>> On Mon, Jan 12, 2009 at 2:18 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am having some issues in locking inode while copying data blocks.
>>>>>>>>>> We are trying to keep file system live during this operation, so
>>>>>>>>>> both read and write operations should work.
>>>>>>>>>> In this case what type of lock on inode should be used, semaphore,
>>>>>>>>>> mutex or spinlock?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Jan 11, 2009 at 8:45 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>>>>>>>> Sorry.....some mistakes...a resent:
>>>>>>>>>>>
>>>>>>>>>>> Here are some tips on the blockdevice API:
>>>>>>>>>>>
>>>>>>>>>>> http://lkml.org/lkml/2006/1/24/287
>>>>>>>>>>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-01/msg09388.html
>>>>>>>>>>>
>>>>>>>>>>> as indicated, documentation is rather sparse in this area.
>>>>>>>>>>>
>>>>>>>>>>> not sure if anyone else have a summary list of blockdevice API and its
>>>>>>>>>>> explanation?
>>>>>>>>>>>
>>>>>>>>>>> not wrt the following "cleanup patch", i am not sure how the API will change:
>>>>>>>>>>>
>>>>>>>>>>> http://lwn.net/Articles/304485/
>>>>>>>>>>>
>>>>>>>>>>> thanks.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 6, 2009 at 6:36 PM, Rohit Sharma <imreckless@xxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I want to read data blocks from one inode
>>>>>>>>>>>> and copy it to other inode.
>>>>>>>>>>>>
>>>>>>>>>>>> I mean to copy data from data blocks associated with one inode
>>>>>>>>>>>> to the data blocks associated with other inode.
>>>>>>>>>>>>
>>>>>>>>>>>> Is that possible in kernel space.?
>>>>>>>>>>>> --
>>>>>>>>>
>>>>>>>
>>>>>>> comments ????
>>>>>>
>>>>>> Thats very right !!!
>>>>>>
>>>>>> So, finally we were able to perform the copy operation successfully.
>>>>>>
>>>>>> We did something like this and we named it "ohsm's tricky copy".
>>>>>> Rohit will soon be uploading a new doc soon on the fscops page which
>>>>>> will detail it further.
>>>>>
>>>>> Thanks let us know when the docs and the *source code* is available ;-)
>>>>>
>>>>>>
>>>>>> 1. Read the source inode.
>>>>>> 2. Allocate a new ghost inode.
>>>>>> 3. Take a lock on the source inode. /* mutex , because the nr_blocks
>>>>>> can change if write comes now from user space */
>>>>>> 4. Read the number of blocks.
>>>>>>>
>>>>>> 5. Allocate the same number of blocks for the dummy ghost inode. /*
>>>>>> the chain will be created automatically */
>>>>>> 6. Read the source buffer head of the blocks from source inode and
>>>>>> destination buffer head of the blocks of the destination inode.
>>>>>>
>>>>>> 7. dest_buffer->b_data = source_buffer->b_data ; /* its a char * and
>>>>>> this is where the trick is */
>>>>>> 8. mark the destination buffer dirty.
>>>>>>
>>>>>> perform 6,7,8 for all the blocks.
>>>>>>
>>>>>> 9. swap the src_inode->i_data[15] and dest_dummy_inode->i_data[15]; /*
>>>>>> This helps us to simply avoid copying the block number back from
>>>>>> destination dummy inode to source inode */
>>>>>
>>>>> I don't know anything about LVM, so this might be a dumb question. Why
>>>>> is this required ?
>>>>>
>>>> See, the point here is that we will have a single namespace, meaning
>>>> single file system over all this underlying storage. LVM is a tool
>>>> which provides you API to create logical devices over phycial ones. It
>>>> uses Device Mapper inside the Kernel. This device mapper keeps tha
>>>> mapping between the logical device and all the other underlying
>>>> physical devices.
>>>>
>>>> Now, we require this for defining our storage classes( tiers).  At the
>>>> time of defining the allocation and relocation policy itself , we
>>>> accept information about (dev, tier) list.
>>>> And, we pass this information to our OHSM module inside the kernel and
>>>> extract the mapping from the device mapper and keep it in the OHSM
>>>> metadata. Which is later reffered for all allocations and
>>>> relocation processes.
>>>>
>>>>> Did you mean swapping all the block numbers rather than just the [15] ??
>>>> See the point here is that if we copy each and every new block number
>>>> to the old inode
>>>
>>> Ohh yes.... I got thourougly confused by your word "swap". I thought
>>> it is just like your "tricky swap" :-) and not the "copy and swap"
>>> which you actually meant.
>>>
>>>> and try to free each block from the old inode then we
>>>> will have the overhead of freeing each and every old block and at the
>>>> end freeing the dummy inode that would be created.
>>>>
>>>> So, what we did was that we swapped the [15]pointers of both the inodes.
>>>> See, on linux the arrangement is something like this.
>>>>
>>>> i_data[0] -> direct pointer to a data block
>>>> i_data[1] -> direct pointer to a data block
>>>> i_data[2] -> direct pointer to a data block
>>>> i_data[3] -> direct pointer to a data block
>>>> i_data[4] -> direct pointer to a data block
>>>> i_data[5] -> direct pointer to a data block
>>>> i_data[6] -> direct pointer to a data block
>>>> i_data[7] -> direct pointer to a data block
>>>> i_data[8] -> direct pointer to a data block
>>>> i_data[9] -> direct pointer to a data block
>>>> i_data[10] -> direct pointer to a data block
>>>> i_data[11] -> direct pointer to a data block
>>>> i_data[12] -> direct pointer to a data block
>>>> i_data[13] -> Single Indirect block
>>>> i_data[14] -> double indirect block
>>>>
>>>>
>>>> For all the pointers, we just swap between inodes. Now, it works
>>>> because, for direct blocks its pretty trivial. for Indirect blocks,
>>>> its just like swapping the roots of the chain of blocks. Which
>>>> eventually changes everything.
>>>
>>> Of course........another thing which you might want is to just copy
>>> the block numbers which fall in range of your inode->i_size. This
>>> might help in case of corruptions.
>>>
>>>>
>>>> Now, we simply free the dummy inode by a standard FS function, which
>>>> perform the clean for the inodes and the blocks as well, which we
>>>> wanted to free.
>>>> It reduces our work and obviously the cleanup code existing in FS
>>>> would be more trustworthy :P
>>>>
>>>>> Here src_inode is the
>>>>> vfs "struct inode" or the
>>>>> FS specific struct FS_inode_info ???  i didn't get this completely,
>>>>> can you explain this point a bit more.
>>>>>
>>>>
>>>> See, what we do is that we take a lock at the VFS inode and then we
>>>> perform the job of moving the data blocks from FS incore inode (
>>>> FS_inode_info) .
>>>>
>>>> So, this will be the incore inode.
>>>> Also, the VFS inode doesn't have pointers to the data blocks. The data
>>>> blocks pointers (i_data) is present on incore and on disk inode
>>>> structures.
>>>>
>>>> Hope this answers your query. Let me know if you have more.
>>>
>>> Unless I missed , I didn't get answers to my earlier question about
>>> *special writing* the inode and maintaining consistency.
>>>
>>
>> Here is the question that you asked....
>>
>>>>btw how are you going to *special write* the inode ? If i remember
>>>>correctly you said that you will make the filesystem as readonly. I
>>>>don't know at what all places in write stack we assert for readonly
>>>>flag on FS. One of the places IIRC is do_open() when you try opening
>>>>the file for first time it checks for permission. How do you plan to
>>>>deal with already open file descriptors which are in write mode. If
>>>>you have already investigated all the paths for MS_RDONLY flag, it
>>>>would be great if you can push it somewhere on web. It might be
>>>>helpful for others. And what about the applications who were happily
>>>>doing writes till now , if suddenly their operations start failing ?
>>
>> Well, now we have a complete change in design here. You will
>> understand thing better when we release our design doc. Which we will
>> be doing soon.
>>
>> So, as you must have seen by now that we are not creating a new inode
>> as a replacement of the old one.
>>
>> We just create a dummy inode, allocate blocks into it, copy the data
>> from source blocks and finally swap.
>>
>> Here we take a lock on the inode while making any changes to the
>> inode. Kindly refer the algo that I provided in my previous mails.
>>
>> Case 1: Trying to open a file while relocation is going on ?
>> Case 2: open file descriptor tries to read/write ?
>>
>> In both the cases as we have taken a lock on the inode, both the
>> cases, the user application will queue itself.
>>
>> Now, looking at this time, for which the process will have to wait,
>> As we are not spending time in physically copy data and releasing data
>> blocks and inode, we expect this time to be quite less.
>> Vineet is working on the timing and performance stuff. Vinnet can your
>> provide some kind of time metrics for a say a file that is of 10 Gigs
>> ?
>>
>> PS: We have not made any changes to the write code path at all.
>> The lock synchronizes everything.
>>
>> Manish does that answer your question or I am getting it wrong somewhere ?
>>
>>
>>> Thanks -
>>> Manish
>>>>
>>>>
>>>>
>>>>> Thanks -
>>>>> Manish
>>>>>
>>>>>
>>>>>> /* This also helps to simply destroy the inode, which will eventually
>>>>>> free all the blocks, which otherwise we would have been doing
>>>>>> separately */
>>>>>>
>>>>>> 9.1 Release the mutex on the src inode.
>>>>>>
>>>>>> 10. set the bit for I_FREEING in dest_inode->i_state.
>>>>>>
>>>>>> '11. call FS_delete_inode(dest_inode);
>>>>>>
>>>>>>  Any application which is already opened this inode for read/write,
>>>>>> tries to do read/write when the mutex lock is taken, it will be
>>>>>> queued.
>>>>>>
>>>>>>>
>>>>>>
>>>>>> Thanks a lot Greg,Manish, Peter and all others for all your valuable
>>>>>> inputs and help.
>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Peter Teoh
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Sandeep.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> "To learn is to change. Education is a process that changes the learner."
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send an email with
>>>>>> "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
>>>>>> Please read the FAQ at http://kernelnewbies.org/FAQ
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Sandeep.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> "To learn is to change. Education is a process that changes the learner."
>>>>
>>>
>>
>>
>>
>> --
>> Regards,
>> Sandeep.
>>
>>
>>
>>
>>
>>
>> "To learn is to change. Education is a process that changes the learner."
>>
>
>
>
> --
> Regards,
> Peter Teoh
>



-- 
Regards,
Sandeep.





 	
"To learn is to change. Education is a process that changes the learner."

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ


[Index of Archives]     [Newbies FAQ]     [Linux Kernel Mentors]     [Linux Kernel Development]     [IETF Annouce]     [Git]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux SCSI]     [Linux ACPI]
  Powered by Linux