Hi Manish, On Thu, Jan 15, 2009 at 10:31 PM, Manish Katiyar <mkatiyar@xxxxxxxxx> wrote: > On Thu, Jan 15, 2009 at 10:27 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: >> On Thu, Jan 15, 2009 at 10:41 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>> On Thu, Jan 15, 2009 at 10:49 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: >>>> >>>> I dont' think the above paragraph is an issue with re-org as currently >>>> designed. Neither for the ext4_defrag patchset that is under >>>> consideration for acceptance, nor the work the OHSM team is doing. >>>> >>> >>> well...it boils down to probability....the lower level the locks, the >>> more complex it gets....and Nick Piggin echoed this, to quote from >>> article: >>> >>> http://lwn.net/Articles/275185/: (Toward better direct I/O scalability) >>> >>> "There are two common approaches to take when faced with this sort of >>> scalability problem. One is to go with more fine-grained locking, >>> where each lock covers a smaller part of the kernel. Splitting up >>> locks has been happening since the initial creation of the Big Kernel >>> Lock, which is the definitive example of coarse-grained locking. There >>> are limits to how much fine-grained locking can help, though, and the >>> addition of more locks comes at the cost of more complexity and more >>> opportunities to create deadlocks. " >>> >>>> >>>> Especially with rotational media, the call stack at the filesystem >>> >>> be aware of SSD....and they are coming down very fast in terms of >>> cost. right now....IBM is testing 4TB SSD.......discussed in a >>> separate thread. (not really sure about properties of SSD....but I >>> think physical contiguity of data may not matter any more, as there >>> are no moving heads to read the data?) >> >> I'm very aware of SSD. I've been actively researching it for the last >> week or so. That is why I was careful to say rotation media is slow. >> >> Third generation SSD is spec'ing it random i/o speed and its >> sequential i/o speed separately. >> >> The first couple generations tended to only spec. sequential because >> random was so bad they did not want to advertise it. >> >>>> layer is just so much faster than the drive, that blocking access to >>>> the write queue for a few milliseconds while some block level re-org >>> >>> how about doing it in-memory? ie, reading the inode blocks (which can >>> be scattered all over the place) into memory as a contiguous chunk. >>> then allocate the inodes sequence...physically contiguously....and >>> then write to it in sequence. so there exists COPY + PHYSICAL-REORG >>> at the same time.....partly through memory? so while this is >>> happening, and the source blocks got modified....then the memory for >>> destination blocks will be updated immediately....no time delay. >>> >> Doing it in memory is what I think the goal should be. >> >> I don't think the ext4_defrag patchset accomplishes that, but maybe >> I'm missing something. >> >> I think I've said it before, but I would think the best real world >> implementation would be: >> >> === >> pre-allocate destination data blocks >> >> For each block >> prefetch source data block >> lock inode >> copy source data block to dest data block IN MEMORY ONLY and put in >> block queue for delivery to disk >> release lock >> end >> >> perform_inode_level_block_pointer_swap >> === >> >> thus the lock is only held long enough to perform a memory copy of one block. > > No, till the destination buffer hits disk, because HSM is not copying > the block, instead just pointing to the buffer from source inode. So > we need to ensure that it doesn't get modified till the destination > buffer is written. Of course this contention can be limited only till > assignment of the pointers , if you can set some flag on the parent > inode/page which tells that anyone who needs to modify needs to do a > copy on write. > I totally agree to what you are saying that we will have to wait till the data hit the disk. > Of course this contention can be limited only till > assignment of the pointers , What do you mean by this ?? If we take lock for the whole re-org, then still we cannot ensure that data hits the disk by the time we hold a lock. How can we avoid that ? If we implement cow, the write logic would be required to be changed. But we can surely try that too. Soon, I will be able to provide you the time estimate of these copy operations under normal conditions. May be that will provide Greg some information to take a call on what exactly would be better to go ahead with. > Thanks - > Manish > > >> >> >>>> Not to be snide, but if you truly feel a design that does use inode >>>> locking to get the job done is unacceptable, then you should post your >>>> objections on the ext4 list. >>> >>> sorry.....I am just a newbie....and I enjoy discussing all these with >>> those at my level.....for the ext4 list? well....they already know >>> that - and I quote from the same article above: >>> >>> http://lwn.net/Articles/275185/ >>> >>> "The other approach is to do away with locking altogether; this has >>> been the preferred way of improving scalability in recent years. That >>> is, for example, what all of the work around read-copy-update has been >>> doing. And this is the direction Nick has chosen to improve >>> get_user_pages()." >>> >>> I will discuss in the list if i can understand 80% to 90% of this >>> article, which is still far from true :-(. >>> >>> Thanks..... >>> >>> -- >>> Regards, >>> Peter Teoh >> >> Good Luck and I'm glad your enjoying the discussion. >> >> Personally, I'm just very excited about the idea of a HSM in Linux >> that will allow SSDs to be more highly leveraged in a tiered storage >> environment. As a linux user I think that is one of the most >> interesting things I've seen discussed in a while. >> >> Greg >> -- >> Greg Freemyer >> Litigation Triage Solutions Specialist >> http://www.linkedin.com/in/gregfreemyer >> First 99 Days Litigation White Paper - >> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf >> >> The Norcross Group >> The Intersection of Evidence & Technology >> http://www.norcrossgroup.com >> > -- Regards, Sandeep. "To learn is to change. Education is a process that changes the learner." -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ