No we are not using NFS. Our setup is: 1. Two node cluster with two node option. 2. Hitachi SAN (RAID 6) connected to both nodes via 4 Gig. 3. 10TB, two 4TB and one 2TB disks to each node using gfs2 (separate file systems on each disk) with user quota enabled. Only the two nodes in the cluster mount the drives. 4. User fills up their quota on the 10TB disk and the system crashes (which appears to be a consistent outcome). The quota was only 10G for the user so they were not using a vast amount of space. In total 5TB is currently used on the drive: Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_chadwick-LogVol00 9.8G 3.4G 6.0G 36% / tmpfs 253G 47M 253G 1% /dev/shm /dev/mapper/mpathap3 1008M 148M 810M 16% /boot /dev/mapper/vg_chadwick-LogVol06 11T 4.7T 5.8T 45% /home /dev/mapper/vg_chadwick-LogVol05 9.8G 7.5G 1.8G 82% /opt /dev/mapper/vg_chadwick-LogVol01 5.0G 140M 4.6G 3% /tmp /dev/mapper/vg_chadwick-LogVol02 9.8G 8.3G 976M 90% /usr /dev/mapper/vg_chadwick-LogVol03 5.0G 2.7G 2.1G 57% /var /dev/mapper/sanvg1-sanlv1 4.0T 2.9T 1.2T 71% /san1 /dev/mapper/sanvg2-sanlv2 4.0T 3.2T 851G 80% /san2 /dev/mapper/sanvg3-sanlv3 2.0T 1.8T 259G 88% /san3 /dev/mapper/sanvg4-lvol0 10T 5.1T 5.0T 51% /san4 Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_chadwick-LogVol00 647168 54317 592851 9% / tmpfs 66157732 58 66157674 1% /dev/shm /dev/mapper/mpathap3 65536 62 65474 1% /boot /dev/mapper/vg_chadwick-LogVol06 749502464 1002734 748499730 1% /home /dev/mapper/vg_chadwick-LogVol05 647168 236023 411145 37% /opt /dev/mapper/vg_chadwick-LogVol01 327680 378 327302 1% /tmp /dev/mapper/vg_chadwick-LogVol02 647168 318728 328440 50% /usr /dev/mapper/vg_chadwick-LogVol03 327680 7228 320452 3% /var /dev/mapper/sanvg1-sanlv1 320266537 140997 320125540 1% /san1 /dev/mapper/sanvg2-sanlv2 223028034 44074 222983960 1% /san2 /dev/mapper/sanvg3-sanlv3 67820453 8357 67812096 1% /san3 /dev/mapper/sanvg4-lvol0 1336002497 392526 1335609971 1% /san4 Thanks, Stephen. -----Original Message----- From: Abhijith Das [mailto:adas@xxxxxxxxxx] Sent: 10 March 2014 19:38 To: linux clustering Subject: Re: gfs2 and quotas - system crash ----- Original Message ----- > From: "stephen rankin" <stephen.rankin@xxxxxxxxxx> > To: linux-cluster@xxxxxxxxxx > Sent: Monday, March 10, 2014 1:15:08 PM > Subject: gfs2 and quotas - system crash > > Hello, > > > > When using gfs2 with quotas on a SAN that is providing storage to two > clustered systems running CentOS6.5, one of the systems can crash. > This crash appears to be caused when a user tries to add something to > a SAN disk when they have exceeded their quota on that disk. Sometimes > a stack trace is produced in /var/log/messages which appears to > indicate that it was gfs2 that caused the problem. > At the same time you get the gfs2 stack trace you also see problems > with someone exceeding their quota. > > The stack trace is below. > > Has anyone got a solution to this, other than switching of quotas? I > have switched of quotas which appears to have stabilised the system so > far, but I do need the quotas on. > > Your help is appreciated. > Hi Stephen, We have another report of this bug when gfs2 was exported using NFS. https://bugzilla.redhat.com/show_bug.cgi?id=1059808. Are you using NFS in your setup as well? We have not able to reproduce it to figure out what might be going on. Do you have a set procedure that you're able to recreate with reliably? If so, it would be of great help. Also, more info about your setup (file sizes, number of files, how many nodes mounting gfs2, what kinds of operations are being run) etc would be helpful as well. Cheers! --Abhi > Stephen Rankin > STFC, RAL, ISIS > > Mar 5 11:40:50 chadwick kernel: GFS2: fsid=analysis:lvol0.1: quota > exceeded for user 101355 Mar 5 11:40:50 chadwick nslcd[11420]: > [767df3] ldap_explode_dn(usi660) returned NULL: Success Mar 5 > 11:40:50 chadwick nslcd[11420]: [767df3] ldap_result() failed: Invalid > DN syntax Mar 5 11:40:50 chadwick nslcd[11420]: [767df3] lookup of > user usi660 failed: > Invalid DN syntax > Mar 5 11:41:46 chadwick kernel: ------------[ cut here ]------------ > Mar 5 11:41:46 chadwick kernel: WARNING: at lib/list_debug.c:26 > __list_add+0x6d/0xa0() (Not tainted) > Mar 5 11:41:46 chadwick kernel: Hardware name: PowerEdge R910 Mar 5 > 11:41:46 chadwick kernel: list_add corruption. next->prev should be > prev (ffff8820531518d0), but was ffff884d4c4594d0. (next=ffff884d4c4594d0). > Mar 5 11:41:46 chadwick kernel: Modules linked in: gfs2 dlm configfs > bridge > autofs4 des_generic ecb md4 nls_utf8 cifs bnx2fc cnic uio fcoe libfcoe > libfc 8021q garp stp llc ipv6 microcode power_meter iTCO_wdt > iTCO_vendor_support dcdbas serio_raw ixgbe dca ptp pps_core mdio > lpc_ich mfd_core sg ses enclosure i7core_edac edac_core bnx2 ext4 jbd2 > mbcache dm_round_robin sr_mod cdrom sd_mod crc_t10dif qla2xxx > scsi_transport_fc scsi_tgt pata_acpi ata_generic ata_piix megaraid_sas > dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: > speedstep_lib] Mar 5 11:41:46 chadwick kernel: Pid: 74823, comm: > vncserver Not tainted > 2.6.32-431.3.1.el6.x86_64 #1 > Mar 5 11:41:46 chadwick kernel: Call Trace: > Mar 5 11:41:46 chadwick kernel: [<ffffffff81071e27>] ? > warn_slowpath_common+0x87/0xc0 > Mar 5 11:41:46 chadwick kernel: [<ffffffff81071f16>] ? > warn_slowpath_fmt+0x46/0x50 > Mar 5 11:41:46 chadwick kernel: [<ffffffff812944ed>] ? > __list_add+0x6d/0xa0 Mar 5 11:41:46 chadwick kernel: > [<ffffffff811a6c02>] ? new_inode+0x72/0xb0 Mar 5 11:41:46 chadwick kernel: [<ffffffffa03f45d5>] ? > gfs2_create_inode+0x1b5/0x1150 [gfs2] > Mar 5 11:41:46 chadwick kernel: [<ffffffffa03f3986>] ? > gfs2_glock_nq_init+0x16/0x40 [gfs2] > Mar 5 11:41:46 chadwick kernel: [<ffffffffa03ffc74>] ? > gfs2_mkdir+0x24/0x30 [gfs2] Mar 5 11:41:46 chadwick kernel: > [<ffffffff8122766f>] ? > security_inode_mkdir+0x1f/0x30 > Mar 5 11:41:46 chadwick kernel: [<ffffffff81198149>] ? > vfs_mkdir+0xd9/0x140 Mar 5 11:41:46 chadwick kernel: [<ffffffff8119ab67>] ? > sys_mkdirat+0xc7/0x1b0 > Mar 5 11:41:46 chadwick kernel: [<ffffffff8119ac68>] ? > sys_mkdir+0x18/0x20 Mar 5 11:41:46 chadwick kernel: [<ffffffff8100b072>] ? > system_call_fastpath+0x16/0x1b > Mar 5 11:41:46 chadwick kernel: ---[ end trace e51734a39976a028 ]--- > Mar 5 11:41:46 chadwick kernel: GFS2: fsid=analysis:lvol0.1: quota > exceeded for user 101355 Mar 5 11:41:47 chadwick abrtd: Directory > 'oops-2014-03-05-11:41:47-12194-1' > creation detected > Mar 5 11:41:47 chadwick abrt-dump-oops: Reported 1 kernel oopses to > Abrt Mar 5 11:41:47 chadwick abrtd: Can't open file > '/var/spool/abrt/oops-2014-03-05-11:41:47-12194-1/uid': No such file > or directory Mar 5 11:41:54 chadwick kernel: GFS2: > fsid=analysis:lvol0.1: quota exceeded for user 101355 > > > > > -- > Scanned by iCritical. > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Scanned by iCritical. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster