On Wed, Jan 14, 2009 at 10:42 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >>> >> >> Well, Well, Well >> Firstly, the source inode remians intact only the pointers in the >> source inodes are updated. >> And we don't freeze the whole FS, we just take a lock on the inode. >> So, you can operate on all other inodes. We intend to reduce the >> granularity of locking to per block from per inode, sometime in newar >> future for sure. >> > > inode level or block level locking is not good for performance, and i > suspect it is also highly susceptible to deadlock scenario. normally > it should be point in time...like that of Oracle... Peter, I can see the inode locking for the entire duration of a file reorg would be unacceptable at some later point in the development / release cycle. But why would block level be bad. I would envision it something like: === preallocate all blocks in destination file for each block in file prefetch block into cache lock updates to all blocks for the given inode copy data from pre-fetched source block to pre-allocated dest block release lock Thus the lock is only held for the duration of the copy data which is a ram based activity. I'm curious how long ext4 holds a lock during its defrag implementation? > first read this email: > > https://kerneltrap.org/mailarchive/linux-fsdevel/2008/4/21/1523184/thread > > (I elaborated on filesystem level integrity) I read it but don't see the relevancy. As I understand what Sandeep and team have done, there is not a integrity issue that I see. I would like to see the actual code, but conceptually it sounds fine. And I think I saw a message that said the write path is unmodified. That can be true with a single lock around the entire re-org process. Once you move to a block level lock, the write path will have to be modified. See my much earlier pseudo code email. My pseudo code does not address writes that extend the file length. Obviously that would have to be handled as well. > then read the direct reply from a Oracle guy to me: > > Peter Teoh wrote: > > invalidate the earlier fsck results. This idea has its equivalence > in the Oracle database world - "online datafile backup" feature, where > all transactions goes to memory + journal logs (a physical file > itself), and datafile is frozen for writing, enabling it to be > physically copied): > > > Sunil Mushran from Oracle==> > > Actually, no. The dbfile is not frozen. What happens is that the > redo generated in such a way that fractured blocks can be fixed > on restore. You will notice the redo size increase when online > backup is enabled. Most DBs use a full transaction log (journal). They can freeze the DB / caches / etc. by utilizing the log. Ext3 is typically run with only a metadata log. It does have support for a data log (journal). In the very long term, I can see that the re-org might be optimized by leveraging the data logging code, but I think that can wait. I again wonder if the ext4 defrag logic uses it? If so, HSM could leverage their design and basically piggy back on top of it. But only for a ext4 implementation. >> Secondly, for files which are already opened, if it tries to do a >> read/write it sleeps, but doesnt break. >> The period of locking will depend on the file size. >> See, we take a lock, we read the source inode size, allocate required >> number of block in dest inode, and exchange 15 block pointers,release >> the lock, mark source inode dirty, and delete dummy inode. >> As there is no copy of data, the time will not be much. for a 10GB >> file the time for relocation was in seconds. >> >>> If you choose second you might freeze the FS for a long time and if >>> you choose first then how do you plan to handle the below case. >>> >> The cost will be very high, If I freeze the FS. Inode lock saves us to >> some extent here. >> >> >>> a) Application opens a file for writes (remember space checks are done >>> at this place and blocks are preallocated only in memory). >>> b) During relocation and before deletion of your destination inode you >>> are using 2X size of your inode. >>> c) Now if you unfreeze your FS, it might get ENOSPC in this window. >>> When it says blocks are only preallocated in memory, what does that mean? I would think you would "allocate" real sectors on real drives. The fact that you don't flush the allocation info out to disk is unimportant. You own the blocks, and no other filesystem activity can steal it from you. Thus the ENOSPC should occur during the preallocate. And it should be relatively easy to have a min freespace check prior to doing the preallocate. ie. Require room for one full copy of file being re-organized, plus X%. Eventually X% will need to be a userspace settable value, but at this early R&D stage it could be hardcoded to 0%. > > The above is just one possible problem....deadlock are many possible... Deadlocks are always possible. The HSM team just has to find them and squash them. The deadlock issues in this process seem pretty minimal to me, but maybe I'm missing something. > But I am amazed by Oracle's strategy....it is really good for performance. > File system backup has been doing: Quiesce Application Quiesce Filesystem Create readonly Snapshot Release Filesystem Release Application for at least a decade. ==>Quiesce Application Enterprise class databases and other applications have the Quiesce feature implemented in various ways. The strategy you describe for Oracle is their implementation of Quiesce Oracle. Once the HSM team has a solid implementation done for generic applications, I agree they should evaluate having user space calls into the major applications that tell them to quiesce themselves. By doing that, those applications may experience even less impact from any locking activity the HSM kernel has to implement. ==>Quiesce Filesystem The Quiesce Filesystem concept above has a design goal of creating a stable block device, but that is not needed by the HSM team. Instead they need to only worry about the in memory representation of the inodes and data blocks associated with the file under re-org. The block device does not need to be static / locked for that to happen. > Check this product: > > http://www.redbooks.ibm.com/abstracts/redp4065.html?Open > > and this one: > > http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246786.html?Open > > and within the above document: > > FlashCopy > The primary objective of FlashCopy is to very quickly create a > point-in-time copy of a source > volume on a target volume. The benefits of FlashCopy are that the > point-in-time target copy is > immediately available for use for backups or testing and that the > source volume is > immediately released so that applications can continue processing with > minimal application > downtime. The target volume can be either a logical or physical copy > of the data, with the > latter copying the data as a background process. In a z/OS > environment, FlashCopy can also > operate at a data set level. > > So same features like yours....done point-in-time. > > comments? Tools like FlashCopy block i/o leaving the filesystem code and going into the underlying block device. FYI: Device Mapper snapshots are very similar to the above. Not useful for this application in my opinion. > -- > Regards, > Peter Teoh Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ