Following up here on a related and very serious for
us issue.
I took down one of the 4 replicate gluster servers for
maintenance today. There are 2 gluster volumes totaling about
600GB. Not that much data. After the server comes back online,
it starts auto healing and pretty much all operations on
gluster freeze for many minutes.
For example, I was trying to run an ls -alrt in a folder
with 7300 files, and it took a good 15-20 minutes before
returning.
During this time, I can see iostat show 100% utilization on
the brick, heal status takes many minutes to return,
glusterfsd uses up tons of CPU (I saw it spike to 600%).
gluster already has massive performance issues for me, but
healing after a 4-hour downtime is on another level of bad
perf.
For example, this command took many minutes to run:
gluster volume heal androidpolice_data3 info summary
Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
Status: Connected
Total Number of entries: 91
Number of entries in heal pending: 90
Number of entries in split-brain: 0
Number of entries possibly healing: 1
Brick forge:/mnt/forge_block4/androidpolice_data3
Status: Connected
Total Number of entries: 87
Number of entries in heal pending: 86
Number of entries in split-brain: 0
Number of entries possibly healing: 1
Brick hive:/mnt/hive_block4/androidpolice_data3
Status: Connected
Total Number of entries: 87
Number of entries in heal pending: 86
Number of entries in split-brain: 0
Number of entries possibly healing: 1
Brick citadel:/mnt/citadel_block4/androidpolice_data3
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Statistics showed a diminishing number of failed heals:
...
Ending time of crawl: Tue Apr 17 21:13:08 2018
Type of crawl: INDEX
No. of entries healed: 2
No. of entries in split-brain: 0
No. of heal failed entries: 102
Starting time of crawl: Tue Apr 17 21:13:09 2018
Ending time of crawl: Tue Apr 17 21:14:30 2018
Type of crawl: INDEX
No. of entries healed: 4
No. of entries in split-brain: 0
No. of heal failed entries: 91
Starting time of crawl: Tue Apr 17 21:14:31 2018
Ending time of crawl: Tue Apr 17 21:15:34 2018
Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 88
...
Eventually, everything heals and goes back to at least
where the roof isn't on fire anymore.
The server stats and volume options were given in one of
the previous replies to this thread.
Any ideas or things I could run and show the output of to
help diagnose? I'm also very open to working with someone on
the team on a live debugging session if there's interest.