Re: page allocation failures on osd nodes

Andrey Korolyov <andrey@xxxxxxx> · Sat, 26 Jan 2013 11:41:29 +0300

On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote:
> On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>> Sorry, I have written too less yesterday because of being sleepy.
>> That`s obviously a cache pressure since dropping caches resulted in
>> disappearance of this errors for a long period. I`m not very familiar
>> with kernel memory mechanisms, but shouldn`t kernel try to allocate
>> memory on the second node if this not prohibited by process` cpuset
>> first and only then report allocation failure(as can be seen only node
>> 0 involved in the failures)? I really have no idea where
>> numa-awareness may be count in case of osd daemons.
>
> Hi Andrey,
>
> You said that the allocation failure doesn't occur if you flush
> caches, but the kernel should evict pages from the cache as needed so
> that the osd can allocate more memory (unless their dirty, but it
> doesn't look like you have many dirty pages in this case).  It looks
> like you have plenty of reclaimable pages as well.  Does the osd
> remain running after that error occurs?

Yes, it keeps running flawlessly without even a bit in an osdmap, but
unfortunately logging wasn`t turned on for this moment. As soon as
I`ll end massive test for ``suicide timeout'' bug I`ll check you idea
with dd and also rerun test as below with ``debug osd = 20''.

My thought is that kernel has ready-to-be-free memory on node1 and for
strange reason osd process trying to reserve pages from node0 (where
it is obviously allocated memory on start, since node1` memory
starting only from high numbers over 32G), then kernel refuses to free
cache on the specific node(it`s a quite misty, at least for me, why
kernel just does not invalidate some buffers, even they are more
preferably to stay in RAM than tail of LRU` ones?).

Allocation looks like following on the most nodes:
MemTotal:       66081396 kB
MemFree:          278216 kB
Buffers:           15040 kB
Cached:         62422368 kB
SwapCached:            0 kB
Active:          2063908 kB
Inactive:       60876892 kB
Active(anon):     509784 kB
Inactive(anon):       56 kB
Active(file):    1554124 kB
Inactive(file): 60876836 kB

OSD-node free memory, with two osd processes on each node, libvirt
prints ``Free'' field there:

    0:     207500 KiB
    1:      72332 KiB
--------------------
Total:     279832 KiB

    0:     208528 KiB
    1:      80692 KiB
--------------------
Total:     289220 KiB

Since it is known that kernel reserve more memory on the node with
higher memory pressure, seems very legit - osd processes works mostly
with node 0` memory, so there is a bigger gap than on node 1 where
exists almost only fs cache.

>
> I wonder if you see the same error if you do a long write intensive
> workload on the local disk for the osd in question, maybe dd
> if=/dev/zero of=/data/osd.0/foo
>
> -sam
>
>
>>
>> On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>> Hi,
>>>
>>> Those traces happens only constant high constant writes and seems to
>>> be very rarely. OSD processes do not consume more memory after this
>>> event and peaks are not distinguishable by monitoring. I have able to
>>> catch it having four-hour constant writes on the cluster.
>>>
>>> http://xdel.ru/downloads/ceph-log/allocation-failure/
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html