glusterfs missing files on ls

stefano.sinigardi at gmail.com (Stefano Sinigardi) · Thu, 30 May 2013 15:24:03 +0900

Dear all,
this is my first message to this mailing list and I also just
subscribed to it. So please, forgive me for my inexperience. I hope
this is also the correct place to ask this question. I'm not a system
administrator, even if I'm requested to do so (phd stud here). I like
to do it, but sometimes I'm lacking the required knowledge. Anyway,
here's my problem that, has always, needs to be solved by me as soon
as possible.
I installed gluster 3.3.1 on Ubuntu 12.10 (from the repository) on 4
machines, all connected together via LAN but two also have a special
Infiniband link between them. On two of them I created a "scratch"
volume (distributed, 8 TB tot), on the other two I created a "storage"
volume (distributed + replicated, 12 TB tot but because of replica
just 6 TB available to users). All of the machines see both volumes,
and for now to use them you have to ssh to one of those (in future it
will be exported: do you suggest nfs or gluster as the mounting type?)
The distributed and _not_ replicated filesystem seems to work (at
least for now) very well and also is perfectly accessible from all
machines, even if is built on them connected by infiniband.
The other replicated _and_ distributed filesystem, on the other hand,
has some problems. In fact, from all nodes, it's missing some files
when asked to list file in a folder with commands like 'ls'. This
happened from one day to the other, because I'm sure that three days
ago it was working perfectly. The configuration didn't change (one
machine got rebooted, but even a global reboot didn't fix anything).
I tried to do a volume rebalance to see if it was going to do anything
(it magically fixed a problem at the very beginning of my gluster
adventure), but it never completed: it grew up to a rebalance of
hundred of million of files, but there should not be so many files in
this volume, we're speaking of order of magnitude less. I tried to
list single bricks and I found that files are still present on them,
and each one on two bricks (because of replica), and perfectly working
if directly accessed to read them, so it seems that it's not a
hardware problem on a particular brick.
As another strategy, I found on the internet that a "find . >
/dev/null" launched as root on the root folder of the glusterfs should
trigger a re-hash of the files, so maybe that could help me.
Unfortunately it hangs almost immediately in a folder that, as said,
is missing some files when listed from the global filesystem.
I tried to read logs, but nothing strange seems to be happening (btw:
analysing logs I found out that also the rebalance got stuck in one of
these folders and just started counting millions and millions of
"nonexistant" files (not even in the single brick, I'm sure that those
folder are not so big), so that's why it wrote hundreds of millions of
files non requiring rebalance in the status)

Do you have any suggestion?
Sorry for the long mail, I hope it's enough to explain my problem.

Thanks a lot for your time and for your help, in advance
Best regards to all,

     Stefano