Hi, On Wed, Jan 14, 2009 at 10:43 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote: > On Wed, Jan 14, 2009 at 10:42 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>>> >>> >>> Well, Well, Well >>> Firstly, the source inode remians intact only the pointers in the >>> source inodes are updated. >>> And we don't freeze the whole FS, we just take a lock on the inode. >>> So, you can operate on all other inodes. We intend to reduce the >>> granularity of locking to per block from per inode, sometime in newar >>> future for sure. >>> >> >> inode level or block level locking is not good for performance, and i >> suspect it is also highly susceptible to deadlock scenario. normally >> it should be point in time...like that of Oracle... > > Peter, > > I can see the inode locking for the entire duration of a file reorg > would be unacceptable at some later point in the development / release > cycle. > That's exactly we are intending. We are planning to get down to block level in our later milestones. I have mentioned that earlier as well. > But why would block level be bad. I would envision it something like: > > === > preallocate all blocks in destination file > > for each block in file > prefetch block into cache > lock updates to all blocks for the given inode > copy data from pre-fetched source block to pre-allocated dest block > release lock > > Thus the lock is only held for the duration of the copy data which is > a ram based activity. > The only problem that we saw here was a change in the length of a file which is already open. So, inorder to aviod that in the first implementation we are taking a lock for the complete re-org. We will have to avoid this for sure. > I'm curious how long ext4 holds a lock during its defrag implementation? > Not very sure, but will look into it and update you. It can be of great use to us. >> first read this email: >> >> https://kerneltrap.org/mailarchive/linux-fsdevel/2008/4/21/1523184/thread >> >> (I elaborated on filesystem level integrity) > > I read it but don't see the relevancy. > > As I understand what Sandeep and team have done, there is not a > integrity issue that I see. I would like to see the actual code, but > conceptually it sounds fine. > > And I think I saw a message that said the write path is unmodified. Yes that true. > That can be true with a single lock around the entire re-org process. > Once you move to a block level lock, the write path will have to be > modified. See my much earlier pseudo code email. My pseudo code does > not address writes that extend the file length. Obviously that would > have to be handled as well. > I will tell you frankly, after our analysis and team discussions, the algorithm that is given by Greg will work perfectly fine. But what about the increase in the file length ? The problem is that we need to allocate the required number of blocks before actually copying the data. And while copying these blocks, we need a lock on the inode. My point is that OHSM works on files as object and keeping that in mind. Is it possible to completely move everything to block level ? Kindly suggest ? I think if our granularity would have been a "block", then it would have been the best to move everything to the block level. I have heard of systems, where the whole concept is implied at the disk level too. >> then read the direct reply from a Oracle guy to me: >> >> Peter Teoh wrote: >> >> invalidate the earlier fsck results. This idea has its equivalence >> in the Oracle database world - "online datafile backup" feature, where >> all transactions goes to memory + journal logs (a physical file >> itself), and datafile is frozen for writing, enabling it to be >> physically copied): >> >> >> Sunil Mushran from Oracle==> >> >> Actually, no. The dbfile is not frozen. What happens is that the >> redo generated in such a way that fractured blocks can be fixed >> on restore. You will notice the redo size increase when online >> backup is enabled. > > Most DBs use a full transaction log (journal). They can freeze the DB > / caches / etc. by utilizing the log. > > Ext3 is typically run with only a metadata log. It does have support > for a data log (journal). > > In the very long term, I can see that the re-org might be optimized by > leveraging the data logging code, but I think that can wait. > > I again wonder if the ext4 defrag logic uses it? If so, HSM could > leverage their design and basically piggy back on top of it. But only > for a ext4 implementation. > Makes sense. Yes, We have started analysis this. We will update you on this too. I think, it should be helpful for us. >>> Secondly, for files which are already opened, if it tries to do a >>> read/write it sleeps, but doesnt break. >>> The period of locking will depend on the file size. >>> See, we take a lock, we read the source inode size, allocate required >>> number of block in dest inode, and exchange 15 block pointers,release >>> the lock, mark source inode dirty, and delete dummy inode. >>> As there is no copy of data, the time will not be much. for a 10GB >>> file the time for relocation was in seconds. >>> >>>> If you choose second you might freeze the FS for a long time and if >>>> you choose first then how do you plan to handle the below case. >>>> >>> The cost will be very high, If I freeze the FS. Inode lock saves us to >>> some extent here. >>> >>> >>>> a) Application opens a file for writes (remember space checks are done >>>> at this place and blocks are preallocated only in memory). >>>> b) During relocation and before deletion of your destination inode you >>>> are using 2X size of your inode. >>>> c) Now if you unfreeze your FS, it might get ENOSPC in this window. >>>> > > When it says blocks are only preallocated in memory, what does that mean? > Preallocation is just an information to tell you that where you can start hunting for new blocks. Not very sure what Peter meant ? > I would think you would "allocate" real sectors on real drives. The > fact that you don't flush the allocation info out to disk is > unimportant. You own the blocks, and no other filesystem activity can > steal it from you. Thus the ENOSPC should occur during the > preallocate. > > And it should be relatively easy to have a min freespace check prior > to doing the preallocate. > We warn the admin at two stages 70% (WARNING) and 90%(CRITICAL) full. Also, admins can get the state of tiers at any point in time through /proc or /sys interfaces. > ie. Require room for one full copy of file being re-organized, plus > X%. Eventually X% will need to be a userspace settable value, but at > this early R&D stage it could be hardcoded to 0%. > This can be done, marking it for later work. >> >> The above is just one possible problem....deadlock are many possible... > > Deadlocks are always possible. The HSM team just has to find them and > squash them. > :-) > The deadlock issues in this process seem pretty minimal to me, but > maybe I'm missing something. > I have my fingers crossed. >> But I am amazed by Oracle's strategy....it is really good for performance. >> > > File system backup has been doing: > > Quiesce Application > Quiesce Filesystem > Create readonly Snapshot > Release Filesystem > Release Application > > for at least a decade. > > ==>Quiesce Application > > Enterprise class databases and other applications have the Quiesce > feature implemented in various ways. > > The strategy you describe for Oracle is their implementation of Quiesce Oracle. > > Once the HSM team has a solid implementation done for generic > applications, I agree they should evaluate having user space calls > into the major applications that tell them to quiesce themselves. By > doing that, those applications may experience even less impact from > any locking activity the HSM kernel has to implement. > > ==>Quiesce Filesystem > > The Quiesce Filesystem concept above has a design goal of creating a > stable block device, but that is not needed by the HSM team. Instead > they need to only worry about the in memory representation of the > inodes and data blocks associated with the file under re-org. > > The block device does not need to be static / locked for that to happen. > Thanks for this info. This is great. We are surely looking forward to this. >> Check this product: >> >> http://www.redbooks.ibm.com/abstracts/redp4065.html?Open >> >> and this one: >> >> http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246786.html?Open >> >> and within the above document: >> >> FlashCopy >> The primary objective of FlashCopy is to very quickly create a >> point-in-time copy of a source >> volume on a target volume. The benefits of FlashCopy are that the >> point-in-time target copy is >> immediately available for use for backups or testing and that the >> source volume is >> immediately released so that applications can continue processing with >> minimal application >> downtime. The target volume can be either a logical or physical copy >> of the data, with the >> latter copying the data as a background process. In a z/OS >> environment, FlashCopy can also >> operate at a data set level. >> >> So same features like yours....done point-in-time. >> >> comments? > > Tools like FlashCopy block i/o leaving the filesystem code and going > into the underlying block device. > > FYI: Device Mapper snapshots are very similar to the above. > > Not useful for this application in my opinion. > >> -- >> Regards, >> Peter Teoh > > Greg > -- > Greg Freemyer > Litigation Triage Solutions Specialist > http://www.linkedin.com/in/gregfreemyer > First 99 Days Litigation White Paper - > http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf > > The Norcross Group > The Intersection of Evidence & Technology > http://www.norcrossgroup.com > -- Regards, Sandeep. "To learn is to change. Education is a process that changes the learner." -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ