Re: page allocation failures on osd nodes

Andrey Korolyov <andrey@xxxxxxx> · Sun, 27 Jan 2013 23:52:04 +0300

On Sat, Jan 26, 2013 at 12:41 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
> On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote:
>> On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>> Sorry, I have written too less yesterday because of being sleepy.
>>> That`s obviously a cache pressure since dropping caches resulted in
>>> disappearance of this errors for a long period. I`m not very familiar
>>> with kernel memory mechanisms, but shouldn`t kernel try to allocate
>>> memory on the second node if this not prohibited by process` cpuset
>>> first and only then report allocation failure(as can be seen only node
>>> 0 involved in the failures)? I really have no idea where
>>> numa-awareness may be count in case of osd daemons.
>>
>> Hi Andrey,
>>
>> You said that the allocation failure doesn't occur if you flush
>> caches, but the kernel should evict pages from the cache as needed so
>> that the osd can allocate more memory (unless their dirty, but it
>> doesn't look like you have many dirty pages in this case).  It looks
>> like you have plenty of reclaimable pages as well.  Does the osd
>> remain running after that error occurs?
>
> Yes, it keeps running flawlessly without even a bit in an osdmap, but
> unfortunately logging wasn`t turned on for this moment. As soon as
> I`ll end massive test for ``suicide timeout'' bug I`ll check you idea
> with dd and also rerun test as below with ``debug osd = 20''.
>
> My thought is that kernel has ready-to-be-free memory on node1 and for
> strange reason osd process trying to reserve pages from node0 (where
> it is obviously allocated memory on start, since node1` memory
> starting only from high numbers over 32G), then kernel refuses to free
> cache on the specific node(it`s a quite misty, at least for me, why
> kernel just does not invalidate some buffers, even they are more
> preferably to stay in RAM than tail of LRU` ones?).
>
> Allocation looks like following on the most nodes:
> MemTotal:       66081396 kB
> MemFree:          278216 kB
> Buffers:           15040 kB
> Cached:         62422368 kB
> SwapCached:            0 kB
> Active:          2063908 kB
> Inactive:       60876892 kB
> Active(anon):     509784 kB
> Inactive(anon):       56 kB
> Active(file):    1554124 kB
> Inactive(file): 60876836 kB
>
> OSD-node free memory, with two osd processes on each node, libvirt
> prints ``Free'' field there:
>
>
>     0:     207500 KiB
>     1:      72332 KiB
> --------------------
> Total:     279832 KiB
>
>     0:     208528 KiB
>     1:      80692 KiB
> --------------------
> Total:     289220 KiB
>
> Since it is known that kernel reserve more memory on the node with
> higher memory pressure, seems very legit - osd processes works mostly
> with node 0` memory, so there is a bigger gap than on node 1 where
> exists almost only fs cache.
>
>

Ahem. once on almost empty node same trace produced by qemu
process(which was actually pinned to the specific numa node), so seems
that`s generally is a some scheduler/mm bug, not directly related to
the osd processes. In other words, the less percentage of memory
actually is an RSS, the more is a probability of such allocation
failure.

I have printed timestamps of failure events on selected nodes, just
for reference:
http://xdel.ru/downloads/ceph-log/allocation-failure/stat.txt

>>
>> I wonder if you see the same error if you do a long write intensive
>> workload on the local disk for the osd in question, maybe dd
>> if=/dev/zero of=/data/osd.0/foo
>>
>> -sam
>>
>>
>>>
>>> On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> Those traces happens only constant high constant writes and seems to
>>>> be very rarely. OSD processes do not consume more memory after this
>>>> event and peaks are not distinguishable by monitoring. I have able to
>>>> catch it having four-hour constant writes on the cluster.
>>>>
>>>> http://xdel.ru/downloads/ceph-log/allocation-failure/
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html