On Wed, Apr 24, 2013 at 11:25 AM, Shishir Gowda <sgowda@xxxxxxxxxx> wrote: > Hi Xavi, > > I would be interested in knowing what the gfsck tool would try to accomplish. > > I have certain scenarios from distribute xlator, which would be ideal candidates to be handled in fsck. > > Please let me know the scope of gfsck, so that I could share the ideas with you. > > With regards, > Shishir > > ----- Original Message ----- > From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > To: "Krishnan Parthasarathi" <kparthas@xxxxxxxxxx> > Cc: gluster-devel@xxxxxxxxxx > Sent: Monday, April 22, 2013 7:20:36 PM > Subject: Re: GlusterFS projects page for GSOC and the likes > > I've just added 'gfsck', a tool to check file system integrity and > repair any detected error. > > I'm already working on it. > > Xavi > > Al 22/04/13 15:03, En/na Krishnan Parthasarathi ha escrit: >> Hi All, >> >> I am trying to collect all GlusterFS project ideas into a single page >> in the wiki, here: >> http://www.gluster.org/community/documentation/index.php/Projects >> >> I have added the first entry. It is about building a diagnostic tool >> like nfsiostat for GlusterFS mounts. I volunteer to mentor anyone >> interested in this. >> >> I hope to see more entries and volunteers :-) >> >> cheers, >> krish >> I have added some my thoughts here. Why did we not implement 'gfsck' in the first place? Traditional 'fsck' approach is not scalable. It may take from days to months to complete one full check. It requires filesystem to be in offline mode (unmounted). Every n'th system boot (mount) requires a full check (GlusterFS is mounted and running all the time). Errors can quickly accumulate in this window. Healing and reliability cannot be an afterthought. GlusterFS self-healing mechanism solves these problems by integrating fsck tightly into the file system core. Errors are expected as normal as file operations. They are noticed and caught then and there. Filesystem has full context of the problem to fix. Healing code is also modular. Each translator implements how to handle broken data with respect to its own context. They why do we need 'gfsck' project now? * Self-healing is inefficient when it comes to full verify (ls -lR). * Self-healing focuses on active data only. It assumes that the rest of the data is immutable and durable. In reality, it is not the case. There are circumstances where the backend brick content can change without notice. (for example, if your disk filesystem ABI changes after a kernel upgrade, your data may get corrupted and left unnoticed. Your fsck.ext4 may do a partial recovery after a power failure. This corruption can confuse self-heal and propagate to other nodes. Admin sometimes fiddles with the backend directly..) * There can be bugs in self-heal code itself. * gfsck is not a replacement for self-heal, but instead provides a secondary additional verification. Users can be fairly confident with the integrity of data if both self-heal and gfsck confirms healthy. Here are some points: * catch errors left unnoticed by self-heal * must perform online fsck * speed is very important. faster means, more frequent gfsck * quick and full scan option. * verify only and very+fix modes * interactive and noninteractive (--yes) modes * quiet and verbose * preferably % completion progress report * ability to resume partial checks from previous runs * ability to scan only a subdirectory (and a recursive option) * cooperate with built-in self-heal and active i/o * ability for non-root user to perform gfsck on his/her content alone * daemon mode (ability to run in a loop under low priority). * concurrent gfsck - from different clients on different folders * one unified UI for both self-heal and gfsck's own mechanism. * incorporate some heuristic checks to speed up. Implementing all of these is beyond the scope your GSoC project. Pick some of them and get your project accepted into gluster official branch. You can do the rest in phases. You will have our full support. -ab Imagination is more important than knowledge --Albert Einstein