Re: Best way to add caching to a new raid setup.

Ram Ramesh <rramesh2400@xxxxxxxxx> · Fri, 11 Sep 2020 17:41:04 -0500

Appreciate the details. I agree spinning down the disk is a good idea to 
consider. I will look more in to it.

Ramesh

On 9/11/20 3:37 PM, Roger Heflin wrote:
It is simpler, and has very simple to maintain moving parts.  I have
been a linux admin for 20+ years, and a professional unix admin for
longer, and too often complicated seems nice but has burned me with
bugs and other unexpected results, so simple is best.  The daily move
uses nothing complicated and can be expected to work on any unix
system that has ever existed and relies on heavily used operations
that have a high probability of working and of being caught quickly as
broken if they did not work.  Any of the others are a bit more
complicated, and more likely to have bugs and less likely to get
caught as quick as the moving parts I rely on.  I also wanted to be
able to spin down my array for any hours when no one is watching the
dvr (usually this is 18+ hours per day, x 7 drives ==   1.25kw/day, or
37kw/month, or $4-$10 depending on power costs), and I also have
motion software collecting security cams that go to the SSD and are
also copied onto the array nighty.   The security cams would have kept
the array spinning when anything moved anywhere outside so pretty much
100% of the time.

On Fri, Sep 11, 2020 at 1:39 PM R. Ramesh <rramesh@xxxxxxxxxxx> wrote:
On 8/30/20 10:42 AM, Roger Heflin wrote:
The LSI should be a good controller as long as you the HBA fw and not
the raid fw.

I use an LSI with hba + the 8 AMD chipset sata ports, currently I have
12 ports cabled to hot swap bays but only 7+boot disk used.

How many recording do you think you will have and how many
clients/watchers?  With the SSD handling the writes for recording my
disks actually spin down if no one is watching anything.

The other trick the partitions let me do is initially I moved from 1.5
-> 3tb disks (2x750 -> 4x750) and once I got 3-3tbs in I added the 2
more partitions raid6(+1.5TB) (I bought the 3tb drives slowly), then
the next 3tb gets added to all 4 partitions (+3TB).

On reads at least each disk can do at least 50 iops, and for the most
part the disks themselves are very likely to cache the entire track
the head goes over, so a 2nd sequential read likely comes from the
disk's read cache and does not have to actually be read.  So several
sequential workloads jumping back and forth do not behave as bad as
one would expect.  Write are a different story and a lot more
expensive.  I isloate those to ssd and copy them in the middle of the
night when it is low activity.  And since they are being copied as big
fast streams one file at a time they end up with very few fragments
and write very quickly.   The way I have mine setup mythtv will find
the file whether it is on the ssd recording directory or the raid
recording directory, so when I mv the files nothing has to be done
except the mv.

On Sat, Aug 29, 2020 at 7:56 PM Ram Ramesh <rramesh2400@xxxxxxxxx> wrote:
On 8/29/20 4:26 PM, Roger Heflin wrote:
It should be worth noting that if you buy 2 exactly the same SSD's at
the same time and use them in a mirror they are very likely to be
wearing about the same.

I am hesitant to go much bigger on disks, especially since the $$/GB
really does not change much as the disks get bigger.

And be careful of adding on a cheap sata controller as a lot of them work badly.

Most of my disks have died from bad blocks causing a section of the
disk to have some errors, or bad blocks on sections causing the array
to pause for 7 seconds.  Make sure to get a disk with SCTERC settable
(timeout when bad blocks happen, otherwise the default timeout is a
60-120seconds, but with it you can set it to no more than 7 seconds).
    In the cases where the entire disk did not just stop and is just
getting bad blocks in places, typically you have time as only a single
section is getting bad blocks, so in this case having sections does
help.    Also note that mdadm with 4 sections like I have will only
run a single rebuild at a time as mdadm understands that the
underlying disks are shared, this makes replacing a disk with 1
section or 4 sections basically work pretty much the same.  It does
the same thing on the weekly scans, it sets all 4 to scan, and it
scans 1 and defers the other scan as disks are shared.

It seems to be a disk completely dying is a lot less often than badblock issues.

On Sat, Aug 29, 2020 at 3:50 PM Ram Ramesh <rramesh2400@xxxxxxxxx> wrote:
On 8/29/20 12:02 AM, Roman Mamedov wrote:
On Fri, 28 Aug 2020 22:08:22 -0500
"R. Ramesh" <rramesh@xxxxxxxxxxx> wrote:

I do not know how SSD caching is implemented. I assumed it will be
somewhat similar to memory cache (L2 vs L3 vs L4 etc). I am hoping that
with SSD caching, reads/writes to disk will be larger in size and
sequential within a file (similar to cache line fill in memory cache
which results in memory bursts that are efficient). I thought that is
what SSD caching will do to disk reads/writes. I assumed, once reads
(ahead) and writes (assuming writeback cache) buffers data sufficiently
in the SSD, all reads/writes will be to SSD with periodic well organized
large transfers to disk. If I am wrong here then I do not see any point
in SSD as a cache. My aim is not to optimize by cache hits, but optimize
by preventing disks from thrashing back and forth seeking after every
block read. I suppose Linux (memory) buffer cache alleviates some of
that. I was hoping SSD will provide next level. If not, I am off in my
understanding of SSD as a disk cache.
Just try it, as I said before with LVM it is easy to remove if it doesn't work
out. You can always go to the manual copying method or whatnot, but first why
not check if the automatic caching solution might be "good enough" for your
needs.

Yes it usually tries to avoid caching long sequential reads or writes, but
there's also quite a bit of other load on the FS, i.e. metadata. I found that
browsing directories and especially mounting the filesystem had a great
benefit from caching.

You are correct that it will try to increase performance via writeback
caching, however with LVM that needs to be enabled explicitly:
https://www.systutorials.com/docs/linux/man/7-lvmcache/#lbAK
And of course a failure of that cache SSD will mean losing some data, even if
the main array is RAID. Perhaps should consider a RAID of SSDs for cache in
that case then.

Yes, I have 2x500GB ssds for cache. May be, I should do raid1 on them
and use as cache volume.
I thought SSDs are more reliable and even when they begin to die, they
become readonly before quitting.  Of course, this is all theory, and I
do not think standards exists on how they behave when reaching EoL.

Ramesh

My SSDs are from different companies and bought at different times
(2019/2016, I think).

I have not had many hard disk failures. However, each time I had one, it
has been a total death. So, I am a bit biased. May be with sections, I
can replace one md at a time and letting others run degraded. I am sure
there other tricks. I am simply saying it is a lot of reads/writes, and
of course computation, in cold replacement of disks in RAID6 vs. RAID1.

Yes, larger disks are not cheaper, but they use one SATA port vs.
smaller disks. Also, they use less power in the long run (mine run
24x7). That is why I have a policy of replacing disks once 2x size disks
(compared to what I currently own) become commonplace.

I have a LSI 9211 SAS HBA which is touted to be reliable by this community.

Regards
Ramesh

Roger,

    Just curious, in your search for a SSD solution to mythtv recording,
did you consider overlayfs, unionfs or mergerfs? If you did, why did you
decide that a simple copy is better?

Ramesh