Eneko and all,
Regarding my current BlueFS Spillover issues, I've just noticed in
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/
that it says:
If there is only a small amount of fast storage available (e.g., less
than a gigabyte), we recommend using it as a WAL device. If there is
more, provisioning a DB device makes more sense. The BlueStore journal
will always be placed on the fastest device available, so using a DB
device will provide the same benefit that the WAL device would while
/also/ allowing additional metadata to be stored there (if it will fit).
This makes me wonder if I shouldn't just move my DBs onto the HDDs and
switch to WAL on the NVMe partitions. Does anybody have any thoughts on
this?
BTW, I don't think I have WAL set up, but I'd really like to check on
both WAL and Journal settings to see if I can make any improvements. I
also have 150GB left on my mirrored boot drive. I could un-mirror part
of this and get 300GB of SATA SSD.
Thoughts?
-Dave
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
On 10/23/2020 6:00 AM, Eneko Lacunza wrote:
Hi Dave,
El 22/10/20 a las 19:43, Dave Hall escribió:
El 22/10/20 a las 16:48, Dave Hall escribió:
(BTW, Nautilus 14.2.7 on Debian non-container.)
We're about to purchase more OSD nodes for our cluster, but I have
a couple questions about hardware choices. Our original nodes were
8 x 12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc.
We chose the NVMe card for performance since it has an 8 lane PCIe
interface. However, we're currently BlueFS spillovers.
The Tyan chassis we are considering has the option of 4 x U.2 NVMe
bays - each with 4 PCIe lanes, (and 8 SAS bays). It has occurred to
me that I might stripe 4 1TB NVMe drives together to get much more
space for WAL/DB and a net performance of 16 PCIe lanes.
Any thoughts on this approach?
Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use
1 NVMe drive for 2 SAS drives and provision 300GB for WAL/DB for
each OSD (see related threads on this mailing list about why that
exact size).
This way if a NVMe fails, you'll only lose 2 OSD.
I was under the impression that everything that BlueStore puts on the
SSD/NVMe could be reconstructed from information on the OSD. Am I
mistaken about this? If so, my single 1.6TB NVMe card is equally
vulnerable.
I don't think so, that info only exists on that partition as was the
case with filestore journal. Your single 1.6TB NVMe is vulnerable, yes.
Also, what size of WAL/DB partitions do you have now, and what
spillover size?
I recently posted another question to the list on this topic, since I
now have spillover on 7 of 24 OSDs. Since the data layout on the
NVMe for BlueStore is not traditional I've never quite figured out
how to get this information. The current partition size is 1.6TB
/12 since we had the possibility to add for more drives to each
node. How that was divided between WAL, DB, etc. is something I'd
like to be able to understand. However, we're not going to add the
extra 4 drives, so expanding the LVM partitions is now a possibility.
Can you paste the warning message? If shows the spillover size. What
size are the partitions on NVMe disk (lsblk)
Cheers
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx