Re: Impressive boot times for big clusters: NFS, Image Objects, and Sharding

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Thu, 09 Apr 2020 08:25:20 +0300

On April 8, 2020 10:15:27 PM GMT+03:00, Erik Jacobson <erik.jacobson@xxxxxxx> wrote:
>I wanted to share some positive news with the group here.
>
>Summary: Using sharding and squashfs image files instead of expanded
>directory trees for RO NFS OS images have led to impressive boot times
>of
>2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.
>
>Details:
>
>As you may have seen in some of my other posts, we have been using
>gluster to boot giant clusters, some of which are in the top500 list of
>HPC resources. The compute nodes are diskless.
>
>Up until now, we have done this by pushing an operating system from our
>head node to the storage cluster, which is made up of one or more
>3-server/(3-brick) subvolumes in a distributed/replicate configuration.
>The servers are also PXE-boot and tftboot servers and also serve the
>"miniroot" (basically a fat initrd with a cluster manager toolchain).
>We also locate other management functions there unrelated to boot and
>root.
>
>This copy of the operating system is a simple a directory tree
>representing the whole operating system image. You could 'chroot' in to
>it, for example.
>
>So this operating system is a read-only NFS mount point used as a base
>by all compute nodes to use as their root filesystem.
>
>This has been working well, getting us boot times (not including BIOS
>startup) of between 10 and 15 minutes for a 2,000 node cluster.
>Typically a
>cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On
>simple
>RHEL8 images without much customization, I tend to get 10 minutes.
>
>We have observed some slow-downs with certain job launch work loads for
>customers who have very metadata intensive job launch. The metadata
>load
>of such an operation is very intensive, with giant loads being observed
>on the gluster servers.
>
>We recently started supporting RW NFS as opposed to TMPFS for this
>solution for the writable components of root. Our customers tend to
>prefer
>to keep every byte of memory for jobs. We came up with a solution of
>hosting
>the RW NFS sparse files with XFS filesystems on top from a writable
>area in
>gluster for NFS. This makes the RW NFS solution very fast because it
>reduces
>RW NFS metadata per-node. Boot times didn't go up significantly (but
>our
>first attempt with just using a directory tree was a slow disaster,
>hitting
>the worse-case lots of small file write + lots of metadata work load).
>So we
>solved that problem with XFS FS images on RW NFS.
>
>Building on that idea, we have in our development branch, a version of
>the
>solution that changes the RO NFS image to a squashfs file on a sharding
>volume. That is, instead of each operating system being many thousands
>of files and being (slowly) synced to the gluser servers, the head node
>makes a squashfs file out of the image and pushes that. Then all the
>compute nodes mount the squashfs image from the NFS mount.
>  (mount RO NFS mount, loop-mount squashfs image).
>
>On a 2,000 node cluster I had access to for a time, our prototype got
>us
>boot times of 5 minutes -- including RO NFS with squashfs and the RW
>NFS
>for writable areas like /etc, /var, etc (on an XFS image file).
>  * We also tried RW NFS with OVERLAY and no problem there
>
>I expect, for people who prefer the squashfs non-expanded format, we
>can
>reduce the leader per compute density.
>
>Now, not all customers will want squashfs. Some want to be able to edit
>a file and see it instantly on all nodes. However, customers looking
>for
>fast boot times or who are suffering slowness on metadata intensive
>job launch work loads, will have a new fast option.
>
>Therefore, it's very important we still solve the bug we're working on
>in another thread. But I wanted to share something positive.
>
>So now I've said something positive instead of only asking for help :)
>:)
>
>Erik
>________
>
>
>
>Community Meeting Calendar:
>
>Schedule -
>Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>Bridge: https://bluejeans.com/441850968
>
>Gluster-users mailing list
>Gluster-users@xxxxxxxxxxx
>https://lists.gluster.org/mailman/listinfo/gluster-users

Good Job Erik!

Best Regards,
Strahil Nikolov
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users