Re: Lot of blocked operations

Jan Schermer <jan@xxxxxxxxxxx> · Fri, 18 Sep 2015 12:04:37 +0200

> On 18 Sep 2015, at 11:28, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
> 
>> Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit :
>>> In that case it can either be slow monitors (slow network, slow
>>> disks(!!!)  or a CPU or memory problem).
>>> But it still can also be on the OSD side in the form of either CPU
>>> usage or memory pressure - in my case there were lots of memory used
>>> for pagecache (so for all intents and purposes considered "free") but
>>> when peering the OSD had trouble allocating any memory from it and it
>>> caused lots of slow ops and peering hanging in there for a while.
>>> This also doesn't show as high CPU usage, only kswapd spins up a bit
>>> (don't be fooled by its name, it has nothing to do with swap in this
>>> case).
>> 
>> My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM (for
>> 4x800GB ones), so I will try track this too. Thanks !
>> 
> I haven't seen this (known problem) with 64GB or 128GB nodes, probably
> because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB respectively.
> 

I had this set to 6G and that doesn't help. This "buffer" is probably only useful for some atomic allocations that can use it, not for userland processes and their memory. Or maybe they get memory from this pool but it gets replenished immediately.
QEMU has no problem allocating 64G on the same host, OSD struggles to allocate memory during startup or when PGs are added during rebalancing - probably because it does a lot of smaller allocations instead of one big.

> Christian.
> 
>> 
>>> echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a
>>> routine now and that problem is gone.
>>> 
>>> Jan
>>> 
>>>> On 18 Sep 2015, at 10:53, Olivier Bonvalet <ceph.list@xxxxxxxxx>
>>>> wrote:
>>>> 
>>>> mmm good point.
>>>> 
>>>> I don't see CPU or IO problem on mons, but in logs, I have this :
>>>> 
>>>> 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap
>>>> v86359128:
>>>> 6632 pgs: 77 inactive, 1 remapped, 10
>>>> active+remapped+wait_backfill, 25
>>>> peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
>>>> active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used,
>>>> 58578
>>>> GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s;
>>>> 8417/15680513
>>>> objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
>>>> 
>>>> 
>>>> So... it can be a peering problem. Didn't see that, thanks.
>>>> 
>>>> 
>>>> 
>>>> Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
>>>>> Could this be caused by monitors? In my case lagging monitors can
>>>>> also cause slow requests (because of slow peering). Not sure if
>>>>> that's expected or not, but it of course doesn't show on the OSDs
>>>>> as
>>>>> any kind of bottleneck when you try to investigate...
>>>>> 
>>>>> Jan
>>>>> 
>>>>>> On 18 Sep 2015, at 09:37, Olivier Bonvalet <ceph.list@xxxxxxxxx
>>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> sorry for missing informations. I was to avoid putting too much
>>>>>> inappropriate infos ;)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
>>>>>> écrit :
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
>>>>>>> 
>>>>>>> The items below help, but be a s specific as possible, from
>>>>>>> OS,
>>>>>>> kernel
>>>>>>> version to Ceph version, "ceph -s", any other specific
>>>>>>> details
>>>>>>> (pool
>>>>>>> type,
>>>>>>> replica size).
>>>>>>> 
>>>>>> 
>>>>>> So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
>>>>>> kernel,
>>>>>> and Ceph 0.80.10.
>>>>>> I don't have anymore ceph status right now. But I have
>>>>>> data to move tonight again, so I'll track that.
>>>>>> 
>>>>>> The affected pool is a standard one (no erasure coding), with
>>>>>> only
>>>>>> 2 replica (size=2).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> Some additionnal informations :
>>>>>>>> - I have 4 SSD per node.
>>>>>>> Type, if nothing else for anecdotal reasons.
>>>>>> 
>>>>>> I have 7 storage nodes here :
>>>>>> - 3 nodes which have each 12 OSD of 300GB
>>>>>> SSD
>>>>>> - 4 nodes which have each  4 OSD of 800GB SSD
>>>>>> 
>>>>>> And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> - the CPU usage is near 0
>>>>>>>> - IO wait is near 0 too
>>>>>>> Including the trouble OSD(s)?
>>>>>> 
>>>>>> Yes
>>>>>> 
>>>>>> 
>>>>>>> Measured how, iostat or atop?
>>>>>> 
>>>>>> iostat, htop, and confirmed with Zabbix supervisor.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> - bandwith usage is also near 0
>>>>>>>> 
>>>>>>> Yeah, all of the above are not surprising if everything is
>>>>>>> stuck
>>>>>>> waiting
>>>>>>> on some ops to finish. 
>>>>>>> 
>>>>>>> How many nodes are we talking about?
>>>>>> 
>>>>>> 
>>>>>> 7 nodes, 52 OSDs.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> The whole cluster seems waiting for something... but I
>>>>>>>> don't
>>>>>>>> see
>>>>>>>> what.
>>>>>>>> 
>>>>>>> Is it just one specific OSD (or a set of them) or is that all
>>>>>>> over
>>>>>>> the
>>>>>>> place?
>>>>>> 
>>>>>> A set of them. When I increase the weight of all 4 OSDs of a
>>>>>> node,
>>>>>> I
>>>>>> frequently have blocked IO from 1 OSD of this node.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Does restarting the OSD fix things?
>>>>>> 
>>>>>> Yes. For several minutes.
>>>>>> 
>>>>>> 
>>>>>>> Christian
>>>>>>>> 
>>>>>>>> Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier
>>>>>>>> Bonvalet a
>>>>>>>> écrit :
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have a cluster with lot of blocked operations each time
>>>>>>>>> I
>>>>>>>>> try
>>>>>>>>> to
>>>>>>>>> move
>>>>>>>>> data (by reweighting a little an OSD).
>>>>>>>>> 
>>>>>>>>> It's a full SSD cluster, with 10GbE network.
>>>>>>>>> 
>>>>>>>>> In logs, when I have blocked OSD, on the main OSD I can
>>>>>>>>> see
>>>>>>>>> that
>>>>>>>>> :
>>>>>>>>> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2
>>>>>>>>> slow
>>>>>>>>> requests, 1 included below; oldest blocked for >
>>>>>>>>> 33.976680
>>>>>>>>> secs
>>>>>>>>> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] :
>>>>>>>>> slow
>>>>>>>>> request
>>>>>>>>> 30.125556 seconds old, received at 2015-09-18
>>>>>>>>> 01:54:46.855821:
>>>>>>>>> osd_op(client.29760717.1:18680817544
>>>>>>>>> rb.0.1c16005.238e1f29.00000000027f [write 180224~16384]
>>>>>>>>> 6.c11916a4
>>>>>>>>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
>>>>>>>>> currently
>>>>>>>>> reached pg
>>>>>>>>> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2
>>>>>>>>> slow
>>>>>>>>> requests, 1 included below; oldest blocked for >
>>>>>>>>> 63.981596
>>>>>>>>> secs
>>>>>>>>> 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] :
>>>>>>>>> slow
>>>>>>>>> request
>>>>>>>>> 60.130472 seconds old, received at 2015-09-18
>>>>>>>>> 01:54:46.855821:
>>>>>>>>> osd_op(client.29760717.1:18680817544
>>>>>>>>> rb.0.1c16005.238e1f29.00000000027f [write 180224~16384]
>>>>>>>>> 6.c11916a4
>>>>>>>>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
>>>>>>>>> currently
>>>>>>>>> reached pg
>>>>>>>>> 
>>>>>>>>> How should I read that ? What this OSD is waiting for ?
>>>>>>>>> 
>>>>>>>>> Thanks for any help,
>>>>>>>>> 
>>>>>>>>> Olivier
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com