Re: [External Email] Re: Hardware for new OSD nodes.

Eneko Lacunza <elacunza@xxxxxxxxx> · Mon, 26 Oct 2020 14:51:01 +0100

Hi Dave,

El 23/10/20 a las 22:28, Dave Hall escribió:

Eneko,

# ceph health detail
HEALTH_WARN BlueFS spillover detected on 7 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 7 OSD(s)
     osd.1 spilled over 648 MiB metadata from 'db' device (28 GiB used 
of 124 GiB) to slow device
     osd.3 spilled over 613 MiB metadata from 'db' device (28 GiB used 
of 124 GiB) to slow device
     osd.4 spilled over 485 MiB metadata from 'db' device (28 GiB used 
of 124 GiB) to slow device
     osd.10 spilled over 1008 MiB metadata from 'db' device (28 GiB 
used of 124 GiB) to slow device
     osd.17 spilled over 808 MiB metadata from 'db' device (28 GiB 
used of 124 GiB) to slow device
     osd.18 spilled over 2.5 GiB metadata from 'db' device (28 GiB 
used of 124 GiB) to slow device
     osd.20 spilled over 1.5 GiB metadata from 'db' device (28 GiB 
used of 124 GiB) to slow device

nvme0n1                              259:1    0   1.5T  0 disk
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--6dcbb748--13f5--45cb--9d49--6c78d6589a71
│                              253:1    0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--736a22a8--e4aa--4da9--b63b--295d8f5f2a3d
│                              253:3    0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--751c6623--9870--4123--b551--1fd7fc837341
│                              253:5    0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--2a376e8d--abb1--42af--a4bd--4ae8734d703e
│                              253:7    0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--54fbe282--9b29--422b--bdb2--d7ed730bc589
│                              253:9    0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--c1153cd2--2ec0--4e7f--a3d7--91dac92560ad
│                              253:11   0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--d613f4eb--6ddc--4dd5--a2b5--cb520b6ba922
│                              253:13   0   124G  0 lvm
└─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--41f75c25--67db--46e8--a3fb--ddee9e7f7fc4
                             253:15   0   124G  0 lvm

So, this means that if you use 300GB WAL/DB partitions, your spillovers 
will be over (Bluestore is only ysing 28GiB as you can see).

I don't know what is the performance penalty of current spillover, but 
at least you know those 300GB will be of use :)

Cheers

Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx  <mailto:kdhall@xxxxxxxxxxxxxx>
607-760-2328 (Cell)
607-777-4641 (Office)
On 10/23/2020 6:00 AM, Eneko Lacunza wrote:
Hi Dave,

El 22/10/20 a las 19:43, Dave Hall escribió:

El 22/10/20 a las 16:48, Dave Hall escribió:

(BTW, Nautilus 14.2.7 on Debian non-container.)

We're about to purchase more OSD nodes for our cluster, but I have 
a couple questions about hardware choices.  Our original nodes 
were 8 x 12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, 
DB, etc.

We chose the NVMe card for performance since it has an 8 lane PCIe 
interface.  However, we're currently BlueFS spillovers.

The Tyan chassis we are considering has the option of 4 x U.2 NVMe 
bays - each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred 
to me that I might stripe 4 1TB NVMe drives together to get much 
more space for WAL/DB and a net performance of 16 PCIe lanes.

Any thoughts on this approach?
Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 
1 NVMe drive for 2  SAS drives  and provision 300GB for WAL/DB for 
each OSD (see related threads on this mailing list about why that 
exact size).

This way if a NVMe fails, you'll only lose 2 OSD.
I was under the impression that everything that BlueStore puts on 
the SSD/NVMe could be reconstructed from information on the OSD. Am 
I mistaken about this?  If so, my single 1.6TB NVMe card is equally 
vulnerable.

I don't think so, that info only exists on that partition as was the 
case with filestore journal. Your single 1.6TB NVMe is vulnerable, yes.

Also, what size of WAL/DB partitions do you have now, and what 
spillover size?

I recently posted another question to the list on this topic, since 
I now have spillover on 7 of 24 OSDs.  Since the data layout on the 
NVMe for BlueStore is not traditional I've never quite figured out 
how to get this information.   The current partition size is 1.6TB 
/12 since we had the possibility to add for more drives to each 
node.  How that was divided between WAL, DB, etc. is something I'd 
like to be able to understand.  However, we're not going to add the 
extra 4 drives, so expanding the LVM partitions is now a possibility.
Can you paste the warning message? If shows the spillover size. What 
size are the partitions on NVMe disk (lsblk)

Cheers

--
Eneko Lacunza                | +34 943 569 206
                             | elacunza@xxxxxxxxx
Zuzendari teknikoa           | https://www.binovo.es
Director técnico             | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx