Re: I/O hang, possibly XFS, possibly general

pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi) · Thu, 2 Jun 2011 19:56:12 +0100

> This morning, I had a symptom of a I/O throughput problem in which
> dirty pages appeared to be taking a long time to write to disk.

That can happen because of a lot of reasons, like elevator
issues (CFQ has serious problems) and even CPU scheduler issues,
RAID HA firmware problems (if you are using one, and you seem to
be using MD, but then you may be using several in JBOD mode to
handle all the disks), or problems with the Linux page cache
(read ahead, the abominable plugger) or the flusher (the
defaults are not so hot). Sometimes there are odd resonances
between the page cache and multiple layers od MD or LVM too.

Lots of people have been burned even with much simpler setups
than the one you describe below:

> The system is a large x64 192GiB dell 810 server running
> 2.6.38.5 from kernel.org - the basic workload was data
> intensive - concurrent large NFS (with high metadata/low
> filesize),

Very imaginative. :-)

> rsync/lftp (with low metadata/high file size)

More suitable, but insignificant compared to this:

> all working in a 200TiB XFS volume on a software MD raid0 on
> top of 7 software MD raid6, each w/18 drives.

That's rather more than imaginative :-). But this is a family
oriented mailing list so I can't use appropriate euphemisms,
because they no longer look like euphemisms.

> [ ... ] (the array can readily do >1000MiB/second for big
> I/O). [ ... ]

In a very specific narrow case, and you can get that with a lot
less disks. You have 126 drives that can each do 130MB/s (outer
tracks), so you should be getting 10GB/s :-).

Also, your 1000MiB/s set probably is not full yet, so that's
outer tracks only, and when it fills up, data gets into the
inner tracks, and get a bit churned, then the real performances
will "shine" through.

> I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and
> noticed that according to top, the total amount of cached data
> would drop down rapidly (first time had the big drop), but
> still be stuck at around 8-10Gigabytes.

You have to watch '/proc/meminfo' to check the dirty pages in
the cache. But you seem to have 8-10GiB of dirty pages in your
192GiB system. Extraordinarily imaginative.

> While continuing to do this, I noticed finally that the cached
> data value was in fact dropping slowly (at the rate of
> 5-30MiB/second), and in fact finally dropped down to
> approximately 60Megabytes at which point the stuck dpkg
> command finished, and I was again able to issue sync commands
> that finished instantly.

Fantastic stuff, is that cached data or cached and dirty data?
Guessing that it is cached and dirty (also because of the
"Subject" line), do you really want to have several GiB of
cached dirty pages?

Do you want these to be zillions of little metadata transactions
scattered at random all over the place?  How "good" (I hesitate
to use the very word in the context) is this more than imaginative
RAID60 set at writing widely scattered small transactions?

> [ ... ]  since we will have 5 of these machines running at
> very high rates soon.

Look forward to that :-).

> Also, any suggestions for better metadata

Use some kind of low overhead database if you need a database,
else pray :-)

> or log management are very welcome.

Separate drives/flash SSD/RAM SSD. As previously revealed by a
question I asked, Linux MD does full-width stripe updates with
RAID6. The wider, the better of course :-).

> This particular machine is probably our worst, since it has
> the widest variation in offered file I/O load (tens of
> millions of small files, thousands of >1GB files).

Wide variation is not the problem, and neither is the machine,
it is the approach.

> If this workload is pushing XFS too hard,

XFS is a very good design within a fairly well defined envelope,
and often the problems are more with Linux or application
issues, but you may be a bit outside that envelope (euphemism
alert), and you need to work on the grain of the storage system
(understatement of the week).

> I can deploy new hardware to split the workload across
> different filesystems.

My usual recommendation is to default (unless you have
extraordinarily good arguments otherwise, and almost nobody
does) to use RAID10 sets of at most 10 pairs (of "enterprise"
drives of no more than 1TB each), with XFS or JFS depending on
workload, as many servers as needed (if at all possible located
topologically near to their users to avoid some potentially
nasty network syndromes like incast), and forget about having a
single large storage pool. Other details as to the flusher
(every 1-2 seconds), elevator (deadline or noop), ... can matter
a great deal.

If you do need a single large storage pool almost the only
reasonable way currently (even if I have great hopes for
GlusterFS) is Lustre or one of its forks (or much simpler
imitators like DPM), and that has its own downsides (it takes a
lot of work), but a single large storage pool is almost never
needed, at most a single large namespace, and that can be
instantiated with an automounter (and Lustre/DPM/.... is in
effect a more sophisticated automounter).

If you know better go ahead and build 200TB XFS filesystems on
top of a 7x(16+2) drive RAID60 and put lots of small files in
them (or whatever) and don't even think about 'fsck' because you
"know" it will never happen. And what about backing up one of
those storage sets to another one? That can happen in the
"background" of course, with no extra load :-).

Just realized another imaginative detail: a 126 drive RAID60 set
delivering 200TB, looks like that you are using 2TB drives. Why
am I not surprised? It would be just picture-perfect if they
were low cost "eco" drives, and only a bit less so if they were
ordinary drives without ERC. Indeed cost conscious budget heroes
can only suggest using 2TB drives in a 126-drive RAID60 set even
for a small-file metadata intensive workload, because IOPS and
concurrent RW are obsolete concepts in many parts of the world.

Disclaimer: some smart people I know built knowingly a similar
and fortunately much smaller collection of RAID6 sets because
that was the least worst option for them, and since they know
that it will not fill up before they can replace it, they are
effectively short-stroking all those 2TB drives (I still would
have bought ERC ones if possible) so it's cooler than it looks.

> Thanks very much for any thoughts or suggestions,

* Don't expect to slap together a lot of stuff at random and it
  working just like that. But then if you didn't expect that you
  wouldn't have done any of the above.

* "My usual recommendation" above is freely given yet often
  worth more than months/years of very expensive consultants.

* This mailing list is continuing proof that the "let's bang it
  together, it will just work" club is large.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs