Hi Everyone,I have a "2 x (4 + 2) = 12 Distributed-Disperse" volume. After upgraded to 3.7.8 I noticed the volume is frequently out of service. The glustershd.log is flooded by:
[ec-combine.c:866:ec_combine_check] 0-mainvol-disperse-1: Mismatching xdata in answers of 'LOOKUP'" [ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1E, bad=21) [ec-common.c:71:ec_heal_report] 0-mainvol-disperse-1: Heal failed [Invalid argument] [ec-combine.c:206:ec_iatt_combine] 0-mainvol-disperse-0: Failed to combine iatt (inode: xxx, links: 1-1, uid: 1000-1000, gid: 1000-1000, rdev: 0-0, size: xxx-xxx, mode: 100600-100600)
in normal working state, and sometimes 1000+ lines of:[client-rpc-fops.c:466:client3_3_open_cbk] 0-mainvol-client-7: remote operation failed. Path: <gfid:xxxx> (xxxx) [Too many open files]
and the brick went offline. "top open" showed "Max open fds: 899195".Can anyone suggest me what happened, and what should I do? I was trying to deal with the terrible IOPS problem but things got even worse.
Each Server has 2 x E5-2630v3 (32threads/server), 32GB RAM. Additional infos are in the attachements. Many thanks.
Sincerely yours, Chen -- Chen Chen Shanghai SmartQuerier Biotechnology Co., Ltd. Add: Room 410, 781 Cai Lun Road, China (Shanghai) Pilot Free Trade Zone Shanghai 201203, P. R. China Mob: +86 15221885893 Email: chenchen@xxxxxxxxxxxxxxxx Web: www.smartquerier.com
Volume Name: mainvol Type: Distributed-Disperse Volume ID: 2e190c59-9e28-43a5-b22a-24f75e9a580b Status: Started Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: sm11:/mnt/disk1/mainvol Brick2: sm12:/mnt/disk1/mainvol Brick3: sm13:/mnt/disk1/mainvol Brick4: sm14:/mnt/disk2/mainvol Brick5: sm15:/mnt/disk2/mainvol Brick6: sm16:/mnt/disk2/mainvol Brick7: sm11:/mnt/disk2/mainvol Brick8: sm12:/mnt/disk2/mainvol Brick9: sm13:/mnt/disk2/mainvol Brick10: sm14:/mnt/disk1/mainvol Brick11: sm15:/mnt/disk1/mainvol Brick12: sm16:/mnt/disk1/mainvol Options Reconfigured: server.outstanding-rpc-limit: 256 network.remote-dio: false performance.io-cache: true performance.readdir-ahead: on auth.allow: 172.16.135.* performance.cache-size: 16GB client.event-threads: 8 server.event-threads: 8 performance.io-thread-count: 32 performance.write-behind-window-size: 4MB nfs.disable: on diagnostics.client-log-level: WARNING diagnostics.brick-log-level: WARNING cluster.lookup-optimize: on cluster.readdir-optimize: on
Status of volume: mainvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick sm11:/mnt/disk1/mainvol 49152 0 Y 16501 Brick sm12:/mnt/disk1/mainvol 49152 0 Y 15007 Brick sm13:/mnt/disk1/mainvol 49154 0 Y 13123 Brick sm14:/mnt/disk2/mainvol 49154 0 Y 14947 Brick sm15:/mnt/disk2/mainvol 49152 0 Y 13236 Brick sm16:/mnt/disk2/mainvol 49152 0 Y 14762 Brick sm11:/mnt/disk2/mainvol 49153 0 Y 23039 Brick sm12:/mnt/disk2/mainvol 49153 0 Y 19614 Brick sm13:/mnt/disk2/mainvol 49155 0 Y 15387 Brick sm14:/mnt/disk1/mainvol 49155 0 Y 23231 Brick sm15:/mnt/disk1/mainvol 49153 0 Y 28494 Brick sm16:/mnt/disk1/mainvol 49153 0 Y 17656 Self-heal Daemon on localhost N/A N/A Y 25029 Self-heal Daemon on sm11 N/A N/A Y 23634 Self-heal Daemon on sm13 N/A N/A Y 17394 Self-heal Daemon on sm14 N/A N/A Y 31322 Self-heal Daemon on sm12 N/A N/A Y 19609 Self-heal Daemon on hw10 N/A N/A Y 14926 Self-heal Daemon on sm16 N/A N/A Y 17648
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users