On Fri, Jan 16, 2009 at 12:03 AM, Sandeep K Sinha <sandeepksinha@xxxxxxxxx> wrote: > Hi Manish, > > On Thu, Jan 15, 2009 at 11:54 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote: >> On Thu, Jan 15, 2009 at 11:25 PM, Sandeep K Sinha >> <sandeepksinha@xxxxxxxxx> wrote: >>> Hi Manish, >>> >>> On Thu, Jan 15, 2009 at 10:31 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote: >>>> On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: >>>>> On Thu, Jan 15, 2009 at 10:41 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>>>>> On Thu, Jan 15, 2009 at 10:49 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: >>>>>>> >>>>>>> I dont' think the above paragraph is an issue with re-org as currently >>>>>>> designed. Neither for the ext4_defrag patchset that is under >>>>>>> consideration for acceptance, nor the work the OHSM team is doing. >>>>>>> >>>>>> >>>>>> well...it boils down to probability....the lower level the locks, the >>>>>> more complex it gets....and Nick Piggin echoed this, to quote from >>>>>> article: >>>>>> >>>>>> http://lwn.net/Articles/275185/: (Toward better direct I/O scalability) >>>>>> >>>>>> "There are two common approaches to take when faced with this sort of >>>>>> scalability problem. One is to go with more fine-grained locking, >>>>>> where each lock covers a smaller part of the kernel. Splitting up >>>>>> locks has been happening since the initial creation of the Big Kernel >>>>>> Lock, which is the definitive example of coarse-grained locking. There >>>>>> are limits to how much fine-grained locking can help, though, and the >>>>>> addition of more locks comes at the cost of more complexity and more >>>>>> opportunities to create deadlocks. " >>>>>> >>>>>>> >>>>>>> Especially with rotational media, the call stack at the filesystem >>>>>> >>>>>> be aware of SSD....and they are coming down very fast in terms of >>>>>> cost. right now....IBM is testing 4TB SSD.......discussed in a >>>>>> separate thread. (not really sure about properties of SSD....but I >>>>>> think physical contiguity of data may not matter any more, as there >>>>>> are no moving heads to read the data?) >>>>> >>>>> I'm very aware of SSD. I've been actively researching it for the last >>>>> week or so. That is why I was careful to say rotation media is slow. >>>>> >>>>> Third generation SSD is spec'ing it random i/o speed and its >>>>> sequential i/o speed separately. >>>>> >>>>> The first couple generations tended to only spec. sequential because >>>>> random was so bad they did not want to advertise it. >>>>> >>>>>>> layer is just so much faster than the drive, that blocking access to >>>>>>> the write queue for a few milliseconds while some block level re-org >>>>>> >>>>>> how about doing it in-memory? ie, reading the inode blocks (which can >>>>>> be scattered all over the place) into memory as a contiguous chunk. >>>>>> then allocate the inodes sequence...physically contiguously....and >>>>>> then write to it in sequence. so there exists COPY + PHYSICAL-REORG >>>>>> at the same time.....partly through memory? so while this is >>>>>> happening, and the source blocks got modified....then the memory for >>>>>> destination blocks will be updated immediately....no time delay. >>>>>> >>>>> Doing it in memory is what I think the goal should be. >>>>> >>>>> I don't think the ext4_defrag patchset accomplishes that, but maybe >>>>> I'm missing something. >>>>> >>>>> I think I've said it before, but I would think the best real world >>>>> implementation would be: >>>>> >>>>> === >>>>> pre-allocate destination data blocks >>>>> >>>>> For each block >>>>> prefetch source data block >>>>> lock inode >>>>> copy source data block to dest data block IN MEMORY ONLY and put in >>>>> block queue for delivery to disk >>>>> release lock >>>>> end >>>>> >>>>> perform_inode_level_block_pointer_swap >>>>> === >>>>> >>>>> thus the lock is only held long enough to perform a memory copy of one block. >>>> >>>> No, till the destination buffer hits disk, because HSM is not copying >>>> the block, instead just pointing to the buffer from source inode. So >>>> we need to ensure that it doesn't get modified till the destination >>>> buffer is written. Of course this contention can be limited only till >>>> assignment of the pointers , if you can set some flag on the parent >>>> inode/page which tells that anyone who needs to modify needs to do a >>>> copy on write. >>>> >>> >>> I totally agree to what you are saying that we will have to wait till >>> the data hit the disk. >>> >>>> Of course this contention can be limited only till >>>> assignment of the pointers , >>> >>> What do you mean by this ?? >> >> What I meant was that if you have some implementation of COW, you only >> need to hold the lock till you read the source inode and assign the >> pointers. >> > Do we really need it. > >> But after a bit more of thinking , I feel that COW is not going to >> work . BTW I hope you also plan to invalidate the caches and >> repopulate them right ? What about clients who do caching on their >> side (I am not sure if this really makes sense because they would be >> dealing with file offsets and not real file block numbers). >> > Thats right. They will work on file offsets and not block numbers. > >> Also I think i missed this point , how are new writes (hole fills >> etc.) to a relocated file go to tier-2 disk ? or should they go or not >> ? what kind of policy ensures this ? >> > > for holes I have a reference mail: > From: Chris Snook <csnook@...> > To: Lars Noschinski <lklml@...> > Cc: <linux-kernel@...> > Subject: Re: How does ext2 implement sparse files? > Date: Thursday, January 31, 2008 - 1:18 pm > > In ext2 (and most other block filesystems) all files are sparse files. > If you write to an address in the file for which no block is allocated, > the filesystem allocates a block and writes the contents to disk, Correct, so my question is after relocation of a file to tier-2, what policy/method/code guarantees that the new blocks allocated will be from tier-2 disk, or should they be allocated from tier-2 at all ? Basically what I am asking is the HSM's action under following sequence of events assuming you have a hole of 4K at offset 4K, so i_data[1] = 0 a) Application reads second block and finds a hole. b) HSM reads , finds a hole and skips it c) Application fills the block <<== On which disk does this go ? > regardless of whether that block is at the end of the file (the usual > case of lengthening a non-sparse file), in the middle of the file > (filling in holes in a sparse file), or past the the end of the file > (making a file sparse). > > -- Chris >> Also what are you doing with buffers which are already dirty ? Do you > > Even if they are dirty, we copy the pointers, and we will point to the > same actual data, which will be dirty. > And after swap we mark the new data buffer as dirty, neways. > >> wait for them to hit the disk and then reallocate, or move them >> directly to the new disk ? (I think if you just check page_uptodate >> flag and see if it is not locked we can directly write it to new >> destination, but i am not very sure of all the concurrent access >> issues here). >> > IMHO, this is not required. Remember that if you take a lock on inode > itself, the user application can only queue jobs. They can't access > data for the locking period. Correct, but there might be pages which are already in flight to disk, and you need to check for them. (pdflush ???) Thanks - Manish > >> More questions in queue :-) >> > > Sure.... :) >> Thanks - >> Manish >> >>> >>> If we take lock for the whole re-org, then still we cannot ensure that >>> data hits the disk by the time we hold a lock. How can we avoid that ? >>> If we implement cow, the write logic would be required to be changed. >>> But we can surely try that too. >>> >>> Soon, I will be able to provide you the time estimate of these copy >>> operations under normal conditions. May be that will provide Greg some >>> information to take a call on what exactly would be better to go ahead >>> with. >>> >>>> Thanks - >>>> Manish >>>> >>>> >>>>> >>>>> >>>>>>> Not to be snide, but if you truly feel a design that does use inode >>>>>>> locking to get the job done is unacceptable, then you should post your >>>>>>> objections on the ext4 list. >>>>>> >>>>>> sorry.....I am just a newbie....and I enjoy discussing all these with >>>>>> those at my level.....for the ext4 list? well....they already know >>>>>> that - and I quote from the same article above: >>>>>> >>>>>> http://lwn.net/Articles/275185/ >>>>>> >>>>>> "The other approach is to do away with locking altogether; this has >>>>>> been the preferred way of improving scalability in recent years. That >>>>>> is, for example, what all of the work around read-copy-update has been >>>>>> doing. And this is the direction Nick has chosen to improve >>>>>> get_user_pages()." >>>>>> >>>>>> I will discuss in the list if i can understand 80% to 90% of this >>>>>> article, which is still far from true :-(. >>>>>> >>>>>> Thanks..... >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Peter Teoh >>>>> >>>>> Good Luck and I'm glad your enjoying the discussion. >>>>> >>>>> Personally, I'm just very excited about the idea of a HSM in Linux >>>>> that will allow SSDs to be more highly leveraged in a tiered storage >>>>> environment. As a linux user I think that is one of the most >>>>> interesting things I've seen discussed in a while. >>>>> >>>>> Greg >>>>> -- >>>>> Greg Freemyer >>>>> Litigation Triage Solutions Specialist >>>>> http://www.linkedin.com/in/gregfreemyer >>>>> First 99 Days Litigation White Paper - >>>>> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf >>>>> >>>>> The Norcross Group >>>>> The Intersection of Evidence & Technology >>>>> http://www.norcrossgroup.com >>>>> >>>> >>> >>> >>> >>> -- >>> Regards, >>> Sandeep. >>> >>> >>> >>> >>> >>> >>> "To learn is to change. Education is a process that changes the learner." >>> >> > > > > -- > Regards, > Sandeep. > > > > > > > "To learn is to change. Education is a process that changes the learner." > -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ