I swear I remember a thread on the ML that talked about someone having increased memory usage on their OSDs after upgrading their MONs to Luminous as well, but I can't seem to find it. Iirc the problem for them was resolved when they finished the upgrade to Luminous. It might not be a bad idea for you to run `ceph osd pause` to stop all IO on the cluster and get through the upgrade without worrying about which OSDs are up or not. When all of the peering is done for the OSDs and the upgrade is complete, `ceph osd unpause` and you should be good to go. That works especially well if you're in a situation where you can't serve data to clients anyway since it'll just help things get through faster. If you haven't incurred down time yet, then you might want to try something less drastic.
On Wed, Aug 22, 2018 at 5:44 PM Adrien Gillard <gillard.adrien@xxxxxxxxx> wrote:
Thank you so much for your feedback. It always helps with this kind of situation.We fixed the network issue, added as much RAM and as much swap as possible butwere still far from anything with OOM killer decimating the OSDs which at timesused more than 35GB of memory.We decided to shut half of the OSDs and let the remaining peer and recover. We'llsee in a few hours.We'll keep investigating the root cause but nothing really interesting so far :/_______________________________________________On Wed, Aug 22, 2018 at 10:04 PM Lothar Gesslein <gesslein@xxxxxxxxxxxxx> wrote:Hi Adrien,
I don't expect I can fully explain what happened to your cluster, but
since you got no other feedback so far I'll try my best.
So you have 517 million RADOS objects. Assuming at least 3 copies each
for normal replication or 5 "shards" for EC pools, there are somewhere
between 1.5 to 2.5 billion objects stored on 168 osds, which gives each
osd 9-15 million objects to store and manage.
I am not surprised that this could consume a lot of memory for each osd,
and needs a lot of bandwith when the peering osds exchange info about
all objects. When peering they basically iterate over all objects to
compare metadata and state.
I would not try to downgrade the mons again, which AFAIK is not really
possible, but rather upgrade all OSDs.
As long as you see reasonable progress in the recovery after the memory
and bandwith increase I would give it some time. The most important part
in my opinion is that the number of pgs still peering is going down,
while all osds remain up.
If random osds still go down, I'd try to start only a few osds across
systems at a time to keep the number of concurrently peering pgs low and
a chance to carefully increase and monitor system and network load.
Regards,
Lothar
On 08/22/2018 08:34 PM, Adrien Gillard wrote:
> Some follow-up.
>
> After doubling the RAM (so 128GB for 12x4TB OSD), all the RAM is still
> consumed a bit after the OSD restart.
>
>
> We are considering going through with the update of the OSDs to luminous
> (or maybe go back to jewel on the mons...) but the
> cluster is in bad shape...
>
> health: HEALTH_ERR
> 5054 pgs are stuck inactive for more than 60 seconds
> 73 pgs backfill_wait
> 23 pgs backfilling
> 835 pgs degraded
> 1456 pgs down
> 4632 pgs peering
> 11 pgs recovering
> 11 pgs recovery_wait
> 1341 pgs stale
> 293 pgs stuck degraded
> 5051 pgs stuck inactive
> 3 pgs stuck stale
> 5719 pgs stuck unclean
> 284 pgs stuck undersized
> 820 pgs undersized
> 4 requests are blocked > 32 sec
> recovery 3408607/517873855 objects degraded (0.658%)
> recovery 173949/517873855 objects misplaced (0.034%)
> recovery 3/67978083 objects unfound (0.000%)
> 22 osds down
> pauserd,pausewr,nobackfill,norecover,noscrub,nodeep-scrub
> flag(s) set
>
>
> On Wed, Aug 22, 2018 at 3:35 PM Adrien Gillard <gillard.adrien@xxxxxxxxx>
> wrote:
>
>> Hi everyone,
>>
>> We have a hard time figuring out a behaviour encountered after upgrading
>> the monitors of one of our cluster from Jewel to Luminous yesterday.
>>
>> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of RAM),
>> each containing 12x4TB OSD with journals on DC grade SSDs). The cluster was
>> installed in Infernalis and upgraded to Jewel. We mainly use it for RBD
>> with a cache tier in front of an EC pool.
>>
>> We started the upgrade from Jewel 10.2.10 to Luminous 12.2.7 yesterday
>> afternoon with the monitors and the provisionning of the MGR. We were
>> planning on upgrading the OSD servers today.
>> All went well until the cluster received the following commands, a few
>> hours after the upgrade :
>>
>> ceph osd pool set pool-cache-01 target_max_bytes 0
>> ceph osd pool set pool-cache-02 target_max_bytes 0
>>
>> (The context here is that we use this cluster for backups and set
>> target_max_bytes to 0 to stop the flush and give maximum ressources to the
>> backups IOs)
>>
>> Below is an extract from the logs of one of the monitors.
>> The cluster started exchanging a lot over the cluster network, apparently
>> backfill/recovery and multiple Gbits/s, in an amount which we've never seen
>> before. Due to some network uplinks sub-sizing, the traffic could not be
>> handled well and this lead to OSDs crashing.
>>
>> At the moment we are still investigating the logs to try to understand
>> what happened exactly. The cluster is down. When we restart the OSDs, PG
>> start activating and peering but at some point all the RAM and swap of the
>> servers is consumed and OSDs crash.
>> I tried to set the ovious cluster parameters (osd_recovery_max_active=1,
>> osd_recovery_threads=1, osd_max_backfills=1) and norecover, nobackfill, but
>> when the OSDs start, we still see 4 or 5 Gbits/s over the cluster network.
>>
>> We are currently reconfiguring the network to try to remove the bandwith
>> problem and are preparing to add more RAM to the servers.
>>
>> But still, I have a couple of questions :
>>
>> - Has anyone encoutered the issue we have with the cache ? Or have a
>> possible explanation ?
>> - Is it normal to observe so much traffic on the cluster network when the
>> PG are activating / peering with norecover and nobackfill set ?
>>
>> Thank you for reading and for your input !
>>
>>
>> 2018-08-21 21:05:02.766969 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 TB / 611
>> TB avail; 298 MB/s rd, 212 MB/s wr, 456 op/s
>> 2018-08-21 21:05:03.805124 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 TB / 611
>> TB avail; 296 MB/s rd, 193 MB/s wr, 433 op/s
>> 2018-08-21 21:05:03.870690 7f10d83cc700 0 mon.clb-cph-frpar2-mon-01@0(leader)
>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>> pool set", "pool": "pool-cache-02", "val": "0"} v 0
>> ) v1
>> 2018-08-21 21:05:03.870749 7f10d83cc700 0 log_channel(audit) log [INF] :
>> from='client.146529236 -' entity='client.admin' cmd=[{"var":
>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-02
>> ", "val": "0"}]: dispatch
>> 2018-08-21 21:05:03.913663 7f10d83cc700 0 mon.clb-cph-frpar2-mon-01@0(leader)
>> e6 handle_command mon_command({"prefix": "osd pool stats", "format":
>> "json"} v 0) v1
>> 2018-08-21 21:05:03.913745 7f10d83cc700 0 log_channel(audit) log [DBG] :
>> from='client.? 10.25.25.3:0/4101432770' entity='client.admin'
>> cmd=[{"prefix": "osd pool stats", "format": "json"}]: dispatch
>> 2018-08-21 21:05:04.042835 7f10d83cc700 0 mon.clb-cph-frpar2-mon-01@0(leader)
>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>> pool set", "pool": "pool-cache-01", "val": "0"} v 0
>> ) v1
>> 2018-08-21 21:05:04.042894 7f10d83cc700 0 log_channel(audit) log [INF] :
>> from='client.146565481 -' entity='client.admin' cmd=[{"var":
>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-01
>> ", "val": "0"}]: dispatch
>> 2018-08-21 21:05:04.813217 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 TB / 611
>> TB avail; 271 MB/s rd, 181 MB/s wr, 415 op/s
>> 2018-08-21 21:05:04.821486 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114693 e114693: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:04.821711 7f10de377700 0 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114693 crush map has features 432365734501548032, adjusting msgr requires
>> 2018-08-21 21:05:04.823032 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114693 create_pending setting backfillfull_ratio = 0.9
>> 2018-08-21 21:05:04.823156 7f10de377700 0 log_channel(audit) log [INF] :
>> from='client.146529236 -' entity='client.admin' cmd='[{"var":
>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>> 2", "val": "0"}]': finished
>> 2018-08-21 21:05:04.823280 7f10de377700 0 log_channel(audit) log [INF] :
>> from='client.146565481 -' entity='client.admin' cmd='[{"var":
>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>> 1", "val": "0"}]': finished
>> 2018-08-21 21:05:04.825528 7f10de377700 0 log_channel(cluster) log [DBG]
>> : osdmap e114693: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:04.844932 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 TB / 611
>> TB avail; 275 MB/s rd, 192 MB/s wr, 429 op/s
>> 2018-08-21 21:05:05.873682 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 187 peering, 6021 active+clean; 261 TB data, 385 TB used,
>> 225 TB / 611 TB avail; 220 MB/s rd, 188 MB/s wr, 383
>> op/s
>> 2018-08-21 21:05:05.895414 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114694 e114694: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:05.895721 7f10de377700 0 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114694 crush map has features 432365734501548032, adjusting msgr requires
>> 2018-08-21 21:05:05.905473 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114694 create_pending setting backfillfull_ratio = 0.9
>> 2018-08-21 21:05:05.940729 7f10de377700 0 log_channel(cluster) log [DBG]
>> : osdmap e114694: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:05.947921 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 204 peering, 6004 active+clean; 261 TB data, 385 TB used,
>> 225 TB / 611 TB avail; 201 MB/s rd, 200 MB/s wr, 355
>> op/s
>> 2018-08-21 21:05:06.912752 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114695 e114695: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:06.913104 7f10de377700 0 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114695 crush map has features 432365734501548032, adjusting msgr requires
>> 2018-08-21 21:05:06.928521 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114695 create_pending setting backfillfull_ratio = 0.9
>> 2018-08-21 21:05:06.968793 7f10de377700 0 log_channel(cluster) log [DBG]
>> : osdmap e114695: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:07.001112 7f10d93ce700 1 heartbeat_map reset_timeout
>> 'Monitor::cpu_tp thread 0x7f10d93ce700' had timed out after 0
>> 2018-08-21 21:05:07.004274 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 12 unknown, 114 remapped+peering, 5 activating, 481
>> peering, 5596 active+clean; 261 TB data, 385 TB used, 225 T
>> B / 611 TB avail; 177 MB/s rd, 173 MB/s wr, 317 op/s
>> 2018-08-21 21:05:07.025200 7f10d8bcd700 1 heartbeat_map reset_timeout
>> 'Monitor::cpu_tp thread 0x7f10d8bcd700' had timed out after 0
>> 2018-08-21 21:05:07.948138 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114696 e114696: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:07.948487 7f10de377700 0 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114696 crush map has features 432365734501548032, adjusting msgr requires
>> 2018-08-21 21:05:07.965671 7f10de377700 1 mon.clb-cph-frpar2-mon-01@0(leader).osd
>> e114696 create_pending setting backfillfull_ratio = 0.9
>> 2018-08-21 21:05:08.007585 7f10de377700 0 log_channel(cluster) log [DBG]
>> : osdmap e114696: 168 total, 168 up, 168 in
>> 2018-08-21 21:05:08.052153 7f10de377700 0 log_channel(cluster) log [INF]
>> : pgmap 6208 pgs: 46 activating+remapped, 15 unknown, 305 remapped+peering,
>> 11 activating, 697 peering, 8 active+remapped+backfil
>> ling, 5126 active+clean; 261 TB data, 385 TB used, 225 TB / 611 TB avail;
>> 115 MB/s rd, 182 MB/s wr, 292 op/s; 26681/519327262 objects misplaced
>> (0.005%); 4856 kB/s, 3 objects/s recovering
>>
>>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
Lothar Gesslein
Linux Consultant
Mail: gesslein@xxxxxxxxxxxxx
B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com