Hi,
I have set noout, noscrub and nodeep-scrub and the last time we added osd's we adding few at a time.
The main issue here is with IOPS where the existing osd's are not able to backfill at a higher rate - not even 1 thread during peak hours and a max of 2 threads during off-peak. We are getting more client i/o and the documents ingested are more than the space being freed up by backfilling pg's to new osd's added.
Below is our cluster health
health HEALTH_WARN
5221 pgs backfill_wait
31 pgs backfilling
1453 pgs degraded
4 pgs recovering
1054 pgs recovery_wait
1453 pgs stuck degraded
6310 pgs stuck unclean
384 pgs stuck undersized
384 pgs undersized
recovery 130823732/9142530156 objects degraded (1.431%)
recovery 2446840943/9142530156 objects misplaced (26.763%)
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
monmap e1: 3 mons at {mon_1=x.x.x.x:x.yyyy/0,mon_2=x.x.x.x:yyyy/0,mon_3=x.x.x.x:yyyy/0}
election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
flags noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
475 TB used, 287 TB / 763 TB avail
130823732/9142530156 objects degraded (1.431%)
2446840943/9142530156 objects misplaced (26.763%)
4851 active+remapped+wait_backfill
4226 active+clean
659 active+recovery_wait+degraded+remapped
377 active+recovery_wait+degraded
357 active+undersized+degraded+remapped+wait_backfill
18 active+recovery_wait+undersized+degraded+remapped
16 active+degraded+remapped+backfilling
13 active+degraded+remapped+wait_backfill
9 active+undersized+degraded+remapped+backfilling
6 active+remapped+backfilling
2 active+recovering+degraded
2 active+recovering+degraded+remapped
client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
So, is it a good option to add new osd's on a new node with ssd's as journals?
On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <emccormick@xxxxxxxxxxxxxxx> wrote:
On Sat, Apr 27, 2019, 3:49 PM Nikhil R <nikh.ravindra@xxxxxxxxx> wrote:We have baremetal nodes 256GB RAM, 36core CPUWe are on ceph jewel 10.2.9 with leveldbThe osd’s and journals are on the same hdd.We have 1 backfill_max_active, 1 recovery_max_active and 1 recovery_op_priorityThe osd crashes and starts once a pg is backfilled and the next pg tried to backfill. This is when we see iostat and the disk is utilised upto 100%.I would set noout to prevent excess movement in the event of OSD flapping, and disable scrubbing and deep scrubbing until your backfilling has completed. I would also bring the new OSDs online a few at a time rather than all 25 at once if you add more servers.Appreciate your help David--On Sun, 28 Apr 2019 at 00:46, David C <dcsysengineer@xxxxxxxxx> wrote:On Sat, 27 Apr 2019, 18:50 Nikhil R, <nikh.ravindra@xxxxxxxxx> wrote:Guys,We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21 osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about 5GBThis would imply you've got a separate hdd partition for journals, I don't think there's any value in that and would probabaly be detrimental to performance.We expanded our cluster last week and added 1 more node with 21 HDD and journals on same disk.
Our client i/o is too heavy and we are not able to backfill even 1 thread during peak hours - incase we backfill during peak hours osd's are crashing causing undersized pg's and if we have another osd crash we wont be able to use our cluster due to undersized and recovery pg's. During non-peak we can just backfill 8-10 pgs.Due to this our MAX AVAIL is draining out very fast.How much ram have you got in your nodes? In my experience that's a common reason for crashing OSDs during recovery opsWhat does your recovery and backfill tuning look like?We are thinking of adding 2 more baremetal nodes with 21 *7tb osd’s on HDD and add 50GB SSD Journals for these.We aim to backfill from the 105 osd’s a bit faster and expect writes of backfillis coming to these osd’s faster.Ssd journals would certainly help, just be sure it's a model that performs well with CephIs this a good viable idea?Thoughts please?I'd recommend sharing more detail e.g full spec of the nodes, Ceph version etc._______________________________________________-Nikhil
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sent from my iPhone_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com