High-Performance, Highly Available Lustre Solution with xiRAID 4.1 on Dual-Node Shared NVMe

This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments. System layout xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability. This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1. This article will familiarize you with how to deploy xiRAID Classic for a real-life task. Lustre server SBB Platform We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below. Viking VDS2249R Node 0 Node 1 Hostname node26 node27 CPU AMD EPYC 7713P 64-Core AMD EPYC 7713P 64-Core Memory 256GB 256GB OS drives 2 x Samsung SSD 970 EVO Plus 250GB mirrored 2 x Samsung SSD 970 EVO Plus 250GB mirrored OS Rocky Linux 8.9 Rocky Linux 8.9 IPMI address 192.168.64.106 192.168.67.23 IPMI login admin admin IPMI password admin admin Management NIC enp194s0f0: 192.168.65.26/24 enp194s0f0: 192.168.65.27 Cluster Heartbeat NIC enp194s0f1: 10.10.10.1 enp194s0f1: 10.10.10.2 Infiniband LNET HDR ib0: 100.100.100.26 ib0: 100.100.100.27 ib3: 100.100.100.126 ib3: 100.100.100.127 NVMes 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8 System configuration and tuning Before software installation and configuration, we need to prepare the platform to provide optimal performance. Performance tuning tuned-adm profile accelerator-performance Network configuration Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes: 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.65.26 node26 192.168.65.27 node27 10.10.10.1 node26-ic 10.10.10.2 node27-ic 192.168.64.50 node26-ipmi 192.168.64.76 node27-ipmi 100.100.100.26 node26-ib 100.100.100.27 node27-ib Policy-based routing setup We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces. node26 setup: node26# nmcli connection modify ib0 ipv4.route-metric 100 node26# nmcli connection modify ib3 ipv4.route-metric 101 node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100" node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100" node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200" node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200" node26# nmcli connection up ib0 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) node26# nmcli connection up ib3 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7) node27 setup: node27# nmcli connection modify ib0 ipv4.route-metric 100 node27# nmcli connection modify ib3 ipv4.route-metric 101 node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100" node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100" node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200" node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200" node27# nmcli connection up ib0 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) node26# nmcli connection up ib3 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7) NVMe drives setup In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 dri

Mar 21, 2025 - 06:50

High-Performance, Highly Available Lustre Solution with xiRAID 4.1 on Dual-Node Shared NVMe

This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments.

System layout

xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability.

This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1.

This article will familiarize you with how to deploy xiRAID Classic for a real-life task.

Lustre server SBB Platform

We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below.

Viking VDS2249R

	Node 0	Node 1
Hostname	node26	node27
CPU	AMD EPYC 7713P 64-Core	AMD EPYC 7713P 64-Core
Memory	256GB	256GB
OS drives	2 x Samsung SSD 970 EVO Plus 250GB mirrored	2 x Samsung SSD 970 EVO Plus 250GB mirrored
OS	Rocky Linux 8.9	Rocky Linux 8.9
IPMI address	192.168.64.106	192.168.67.23
IPMI login	admin	admin
IPMI password	admin	admin
Management NIC	enp194s0f0: 192.168.65.26/24	enp194s0f0: 192.168.65.27
Cluster Heartbeat NIC	enp194s0f1: 10.10.10.1	enp194s0f1: 10.10.10.2
Infiniband LNET HDR	ib0: 100.100.100.26	ib0: 100.100.100.27
	ib3: 100.100.100.126	ib3: 100.100.100.127
NVMes	24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8	24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8

System configuration and tuning

Before software installation and configuration, we need to prepare the platform to provide optimal performance.

Performance tuning

tuned-adm profile accelerator-performance

Network configuration

Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.65.26 node26
192.168.65.27 node27
10.10.10.1 node26-ic
10.10.10.2 node27-ic
192.168.64.50 node26-ipmi
192.168.64.76 node27-ipmi
100.100.100.26 node26-ib
100.100.100.27 node27-ib

Policy-based routing setup

We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces.

node26 setup:

node26# nmcli connection modify ib0 ipv4.route-metric 100
node26# nmcli connection modify ib3 ipv4.route-metric 101
node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100"
node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100"
node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200"
node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200"
node26# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

node27 setup:

node27# nmcli connection modify ib0 ipv4.route-metric 100
node27# nmcli connection modify ib3 ipv4.route-metric 101
node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100"
node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100"
node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200"
node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200"
node27# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

NVMe drives setup

In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 drives. They are PCIe 4.0, dual-ported, read-intensive drives with 1DWPD endurance. A single drive's performance can theoretically reach up to 6.9GB/s for sequential read and 4.2GB/s for sequential write (according to the vendor specification).

In our setup, we plan to create a simple Lustre installation with sufficient performance. However, since each NVMe in the SBB system is connected to each server with only 2 PCIe lanes, the NVMe drives' performance will be limited. To overcome this limitation, we will create 2 namespaces on each NVMe drive, which will be used for the Lustre OST RAIDs, and create separate RAIDs from the first NVMe namespaces and the second NVMe namespaces. By configuring our cluster software to use the RAIDs made from the first namespaces (and their Lustre servers) on Lustre node #0 and the RAIDs created from the second namespaces on node #1, we will be able to utilize all four PCIe lanes for each NVMe used to store OST data, as Lustre itself will distribute the workload among all OSTs.

Since we are deploying a simple Lustre installation, we will use a simple filesystem scheme with just one metadata server. As we will have only one metadata server, we will need only one RAID for the metadata. Because of this, we will not create two namespaces on the drives used for the MDT RAID.

Here is how the NVMe drive configuration looks initially:

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          21G0A046T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme1n1          21G0A04BT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme10n1         21G0A04ET2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme11n1         21G0A045T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme12n1         S59BNM0R702322Z      Samsung SSD 970 EVO Plus 250GB           1           8.67  GB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme13n1         21G0A04KT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme14n1         21G0A047T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme15n1         21G0A04CT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme16n1         11U0A00KT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme17n1         21G0A04JT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme18n1         21G0A048T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme19n1         S59BNM0R702439A      Samsung SSD 970 EVO Plus 250GB           1         208.90  kB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme2n1          21G0A041T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme20n1         21G0A03TT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme21n1         21G0A04FT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme22n1         21G0A03ZT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme23n1         21G0A04DT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme24n1         21G0A03VT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme25n1         21G0A044T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme3n1          21G0A04GT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme5n1          21G0A04HT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme6n1          21G0A049T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme7n1          21G0A043T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme8n1          21G0A04AT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme9n1          21G0A03XT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106

The Samsung drives are used for the operating system installation.

Let's reserve /dev/nvme0 and /dev/nvme1 drives for the metadata RAID1. Currently, xiRAID does not support spare pools in a cluster configuration, but having a spare drive is really useful for quick manual drive replacement. So, let's also reserve /dev/nvme3 to be a spare for the RAID1 drive and split all other KCM61RUL3T84 drives into 2 namespaces.

Let’s take /dev/nvme4 as an example. All other drives will be splited in absolutely the same way.

Check the maximum possible size of the drive to be sure:

# nvme id-ctrl /dev/nvme4 | grep -i tnvmcap
tnvmcap : 3840755982336

Check the maximal number of the namespaces supported by the drive:

# nvme id-ctrl /dev/nvme4 | grep ^nn
nn : 64

Check the controller used for the drive connection at both servers (they will differ):

node27# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x1

node26# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x2

We need to calculate the size of the namespaces we are going to create. The real size of the drive in 4K blocks is:

3840755982336/4096=937684566

So, each namespace size in 4K blocks will be:

937684566/2=468842283

In fact, it is not possible to create 2 namespaces of exactly this size because of the NVMe internal architecture. So, we will create namespaces of 468700000 blocks.

If you are building a system for write-intensive tasks, we recommend using write-intensive drives with 3DWPD endurance. If that is not possible and you have to use read-optimized drives, consider leaving some space (10-25%) of the NVMe volume unallocated by namespaces. In many cases, this helps turn the NVMe behavior in terms of write performance degradation closer to that of write-intensive drives.

As a first step, remove the existing namespace on one of the nodes:

node26# nvme delete-ns /dev/nvme4 -n 1

After that, create namespaces on the same node:

node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:1
node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:2
node26# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x2
attach-ns: Success, nsid:1
node26# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x2
attach-ns: Success, nsid:2

Attach the namespaces on the second node with the proper controller:

node27# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x1
attach-ns: Success, nsid:1
node27# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x1
attach-ns: Success, nsid:2

It looks like this on both nodes:

# nvme list |grep nvme4
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme4n2          21G0A042T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106

All other drives were split in the same way. Here is the resulting configuration:

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          21G0A046T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme1n1          21G0A04BT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme10n1         21G0A04ET2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme10n2         21G0A04ET2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme11n1         21G0A045T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme11n2         21G0A045T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme12n1         S59BNM0R702322Z      Samsung SSD 970 EVO Plus 250GB           1           8.67  GB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme13n1         21G0A04KT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme13n2         21G0A04KT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme14n1         21G0A047T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme14n2         21G0A047T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme15n1         21G0A04CT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme15n2         21G0A04CT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme16n1         11U0A00KT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme16n2         11U0A00KT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme17n1         21G0A04JT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme17n2         21G0A04JT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme18n1         21G0A048T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme18n2         21G0A048T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme19n1         S59BNM0R702439A      Samsung SSD 970 EVO Plus 250GB           1         208.90  kB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme2n1          21G0A041T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme20n1         21G0A03TT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme20n2         21G0A03TT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme21n1         21G0A04FT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme21n2         21G0A04FT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme22n1         21G0A03ZT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme22n2         21G0A03ZT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme23n1         21G0A04DT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme23n2         21G0A04DT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme24n1         21G0A03VT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme24n2         21G0A03VT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme25n1         21G0A044T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme25n2         21G0A044T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme3n1          21G0A04GT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme4n2          21G0A042T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme5n1          21G0A04HT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme5n2          21G0A04HT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme6n1          21G0A049T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme6n2          21G0A049T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme7n1          21G0A043T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme7n2          21G0A043T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme8n1          21G0A04AT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme8n2          21G0A04AT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme9n1          21G0A03XT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme9n2          21G0A03XT2G8         KCM61RUL3T84

Software components installation

Lustre installation

Create Lustre repo file /etc/yum.repos.d/lustre-repo.repo :

lustre-server]
name=lustre-server
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/server
# exclude=*debuginfo*
gpgcheck=0

[lustre-client]
name=lustre-client
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/client
# exclude=*debuginfo*
gpgcheck=0

[e2fsprogs-wc]
name=e2fsprogs-wc
baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el8
# exclude=*debuginfo*
gpgcheck=0

Installing e2fs tools:

yum --nogpgcheck --disablerepo=* --enablerepo=e2fsprogs-wc install e2fsprogs

Installing Lustre kernel:

yum --nogpgcheck --disablerepo=baseos,extras,updates --enablerepo=lustre-server install kernel kernel-devel kernel-headers

Reboot to the new kernel:

reboot

Check the kernel version after reboot:

node26# uname -a
Linux node26 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Installing lustre server components:

yum --nogpgcheck --enablerepo=lustre-server,ha install kmod-lustre kmod-lustre-osd-ldiskfs lustre-osd-ldiskfs-mount lustre lustre-resource-agents

Check Lustre module load:

[root@node26 ~]# modprobe -v lustre
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/libcfs.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/lnet.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/obdclass.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/ptlrpc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fld.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fid.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/osc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lov.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/mdc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lmv.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lustre.ko

Unload modules:

# lustre_rmmod

Installing xiRAID Classic 4.1

Installing xiRAID Classic 4.1 at both nodes from the repositories following the Xinnor xiRAID 4.1.0 Installation Guide:

# yum install -y epel-release

# yum install https://pkg.xinnor.io/repository/Repository/xiraid/el/8/kver-4.18/xiraid-repo-1.1.0-446.kver.4.18.noarch.rpm

# yum install xiraid-release

Pacemaker installation
Running the following steps at both nodes:

Enable cluster repo

# yum config-manager --set-enabled ha appstream

Installing cluster:

# yum install pcs pacemaker psmisc policycoreutils-python3

Csync2 installation

Since we are installing the system on Rocky Linux 8, there is no need to compile Csync2 from sources ourselves. Just install the Csync2 package from the Xinnor repository on both nodes:

# yum install csync2

NTP server installation

# yum install chrony

HA cluster setup

Time synchronisation setup

Modify /etc/chrony.conf file if needed to setup working with proper NTP servers. At this setup we will work with the default settings.

# systemctl enable --now chronyd.service

Verify, that time sync works properly by running chronyc tracking.

Pacemaker cluster creation

In this chapter, the cluster configuration is described. In our cluster, we use a dedicated network to create a cluster interconnect. This network is physically created as a single direct connection (by dedicated Ethernet cable without any switch) between enp194s0f1 interfaces on the servers. The cluster interconnect is a very important component of any HA-cluster, and its reliability should be high. A Pacemaker-based cluster can be configured with two cluster interconnect networks for improved reliability through redundancy. While in our configuration we will use a single network configuration, please consider using a dual network interconnect for your projects if needed.

Set the firewall to allow pacemaker software to work (on both nodes):

# firewall-cmd --add-service=high-availability
# firewall-cmd --permanent --add-service=high-availability

Set the same password for the hacluster user at both nodes:

# passwd hacluster

Start the cluster software at both nodes:

# systemctl start pcsd.service
# systemctl enable pcsd.service

Authenticate the cluster nodes from one node by their interconnect interfaces:

node26# pcs host auth node26-ic node27-ic -u hacluster
Password:
node26-ic: Authorized
node27-ic: Authorized

Create and start the cluster (start at one node):

node26# pcs cluster setup lustrebox0 node26-ic node27-ic
No addresses specified for host 'node26-ic', using 'node26-ic'
No addresses specified for host 'node27-ic', using 'node27-ic'
Destroying cluster on hosts: 'node26-ic', 'node27-ic'...
node26-ic: Successfully destroyed cluster
node27-ic: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node26-ic', 'node27-ic'
node26-ic: successful removal of the file 'pcsd settings'
node27-ic: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync authkey'
node26-ic: successful distribution of the file 'pacemaker authkey'
node27-ic: successful distribution of the file 'corosync authkey'
node27-ic: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync.conf'
node27-ic: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

node26# pcs cluster start --all
node26-ic: Starting Cluster...
node27-ic: Starting Cluster...

Check the current cluster status:

node26# pcs status
Cluster name: lustrebox0

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Jul 12 20:55:53 2024 on node26-ic
  * Last change:  Fri Jul 12 20:55:12 2024 by hacluster via hacluster on node27-ic
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Fencing setup

It's very important to have properly configured and working fencing (STONITH) in any HA cluster that works with shared storage devices. In our case, the shared devices are all the NVMe namespaces we created earlier. The fencing (STONITH) design should be developed and implemented by the cluster administrator in consideration of the system's abilities and architecture. In this system, we will use fencing via IPMI. Anyway, when designing and deploying your own cluster, please choose the fencing configuration on your own, considering all the possibilities, limitations, and risks.

First of all, let's check the list of installed fencing agents in our system:

node26# pcs stonith list
fence_watchdog - Dummy watchdog fence agent

So, we don’t have the IPMI fencing agent installed at our cluster nodes. To install it, run the following command (at both nodes):

# yum install fence-agents-ipmilan

You may check the IPMI fencing agent options description by running the following command:

pcs stonith describe fence_ipmilan

Adding the fencing resources:

node26# pcs stonith create node27.stonith fence_ipmilan ip="192.168.67.23" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node27-ic" pcmk_host_check=static-list op monitor interval=10s
node26# pcs stonith create node26.stonith fence_ipmilan ip="192.168.64.106" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node26-ic" pcmk_host_check=static-list op monitor interval=10s

Preventing the STONITH resources from start on the node it should kill:

node26# pcs constraint location node27.stonith avoids node27-ic=INFINITY
node26# pcs constraint location node26.stonith avoids node26-ic=INFINITY

Csync2 configuration

Configure firewall to allow Csync2 to work (run at both nodes):

# firewall-cmd --add-port=30865/tcp
# firewall-cmd --permanent --add-port=30865/tcp

Create the Csync2 configuration file /usr/local/etc/csync2.cfg with the following content at node26 only:

nossl * *;
group csxiha {
host node26;
host node27;
key /usr/local/etc/csync2.key_ha;
include /etc/xiraid/raids; }

Generate the key:

node26# csync2 -k /usr/local/etc/csync2.key_ha

Copy the config and the key file to the second node:

node26# scp /usr/local/etc/csync2.cfg /usr/local/etc/csync2.key_ha node27:/usr/local/etc/

For Csync2 synchronisation by schedule one time per minute run crontab -e at both nodes and add the following record:

* * * * * /usr/local/sbin/csync2 -x

Also for asynchronous synchronisation run the following command to create a synchronisation script (repeat the script creation procedure at both nodes):

# vi /etc/xiraid/config_update_handler.sh

Fill the created script with the following content:

#!/usr/bin/bash
/usr/local/sbin/csync2 -xv

Save the file.

After that run the following command to set correct permissions for the script file:

# chmod +x /etc/xiraid/config_update_handler.sh

xiRAID Configuration for cluster setup

Disable RAID autostart to prevent RAIDs from being activated by xiRAID itself during a node boot. In a cluster configuration, RAIDs have to be activated by Pacemaker via cluster resources. Run the following command on both nodes:

# xicli settings cluster modify --raid_autostart 0

Make xiRAID Classic 4.1 resource agent visible for Pacemaker (run command this sequence at both nodes):

# mkdir -p /usr/lib/ocf/resource.d/xraid
# ln -s /etc/xraid/agents/raid /usr/lib/ocf/resource.d/xraid/raid

xiRAID RAIDs creation

To be able to create RAIDs, we need to install licenses for xiRAID Classic 4.1 on both hosts first. The licenses should be received from Xinnor. To generate the licenses, Xinnor requires the output of the xicli license show command (from both nodes).

node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64

hwkey: B8828A09E09E8F48
license_key: null
version: 0
crypto_version: 0
created: 0-0-0
expired: 0-0-0
disks: 4
levels: 0
type: nvme
disks_in_use: 2
status: trial

The license files received from Xinnor needs to be installed by xicli license update -p command (once again, at both nodes):

node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64

hwkey: B8828A09E09E8F48
license_key: 0F5A4B87A0FC6DB7544EA446B1B4AF5F34A08169C44E5FD119CE6D2352E202677768ECC78F56B583DABE11698BBC800EC96E556AA63E576DAB838010247678E7E3B95C7C4E3F592672D06C597045EAAD8A42CDE38C363C533E98411078967C38224C9274B862D45D4E6DED70B7E34602C80B60CBA7FDE93316438AFDCD7CBD23
version: 1
crypto_version: 1
created: 2024-7-16
expired: 2024-9-30
disks: 600
levels: 70
type: nvme
disks_in_use: 2
status: valid

Since we plan to deploy a small Lustre installation, combining MGT and MDT on the same target device is absolutely OK. But for medium or large Lustre installations, it's better to use a separate target (and RAID) for MGT.

Here is the list of the RAIDs we need to create.

RAID Name	RAID Level	Number of devices	Strip size	Drive list	Lustre target
r_mdt0	1	2	16	/dev/nvme0n1 /dev/nvme1n1	MGT + MDT index=0
r_ost0	6	10	128	/dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1	OST index=0
r_ost1	6	10	128	/dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2	OST index=1
r_ost2	6	10	128	/dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1	OST index=2
r_ost3	6	10	128	/dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2	OST index=3

Creating all the RAIDs at the first node:

node26# xicli raid create -n r_mdt0 -l 1 -d /dev/nvme0n1 /dev/nvme1n1
node26# xicli raid create -n r_ost0 -l 6 -ss 128 -d /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1
node26# xicli raid create -n r_ost1 -l 6 -ss 128 -d /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2
node26# xicli raid create -n r_ost2 -l 6 -ss 128 -d /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1
node26# xicli raid create -n r_ost3 -l 6 -ss 128 -d /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2

At this stage, there is no need to wait for the RAIDs initialization to finish - it can be safely left to run in the background.

Checking the RAID statuses at the first node:

ode26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗
║ name   ║ static           ║ state       ║ devices                ║ info              ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║                   ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║                   ║
║        ║ strip_size: 16   ║             ║                        ║                   ║
║        ║ block_size: 4096 ║             ║                        ║                   ║
║        ║ sparepool: -     ║             ║                        ║                   ║
║        ║ active: True     ║             ║                        ║                   ║
║        ║ config: True     ║             ║                        ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║ init_progress: 11 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n1 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║ init_progress: 7  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n2 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║ init_progress: 5  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n1 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║ init_progress: 2  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n2 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║                   ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝

Checking that the RAID configs were successfully replicated to the second node (please note that on the second node, the RAID status is None, which is expected in this case):

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗
║ name   ║ static           ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None  ║         ║      ║
║        ║ level: 1         ║       ║         ║      ║
║        ║ strip_size: 16   ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╚════════╩══════════════════╩═══════╩═════════╩══════╝

After RAID creation, there's no need to wait for RAID initialization to finish. The RAIDs are available for use immediately after creation, albeit with slightly reduced performance.

For optimal performance, it's better to dedicate specific disjoint CPU core sets to each RAID. Currently, all RAIDs are active on node26, so the sets will joint, but when they are spread between node26 and node27, they will not joint.

node26# xicli raid modify -n r_mdt0 -ca 0-7 -se 1
node26# xicli raid modify -n r_ost0 -ca 8-67 -se 1
node26# xicli raid modify -n r_ost1 -ca 8-67 -se 1 # will be running at node27
node26# xicli raid modify -n r_ost2 -ca 68-127 -se 1
node26# xicli raid modify -n r_ost3 -ca 68-127 -se 1 # will be running at node27

Lustre setup

LNET configuration

To make lustre working, we need to configure Lustre network stack (LNET).

Run at both nodes

# systemctl start lnet
# systemctl enable lnet
# lnetctl net add --net o2ib0 --if ib0
# lnetctl net add --net o2ib0 --if ib3

Check the configuration

# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 289478
              recv_count: 289474
              drop_count: 4
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          dev cpt: 0
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
    - net type: o2ib
      local NI(s):
        - nid: 100.100.100.26@o2ib
          status: down
          interfaces:
              0: ib0
          statistics:
              send_count: 213607
              recv_count: 213604
              drop_count: 7
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: -1
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
        - nid: 100.100.100.126@o2ib
          status: up
          interfaces:
              0: ib3
          statistics:
              send_count: 4
              recv_count: 4
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: -1
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"

Please pay attention to the LNET at the hosts - NIDs. We will use 100.100.100.26@o2ib for node26 and 100.100.100.27@o2ib for node27 as primary NIDs.

Save the LNET configuration:

# lnetctl export -b > /etc/lnet.conf

LDISKFS filesystems creation

At this step, we format the RAIDs into LDISKFS filesystem format. During formatting, we specify the target type (--mgs/--mdt/--ost), unique number of the specific target type (--index), Lustre filesystem name (--fsname), NIDs where each target filesystem could be mounted and where the corresponding servers will get started automatically (--servicenode), and NIDs where MGS could be found (--mgsnode).

Since our RAIDs will work within a cluster, we specify NIDs of both server nodes as the NIDs where the target filesystem could be mounted and where the corresponding servers will get started automatically for each target filesystem. For the same reason, we specify two NIDs where other servers should look for the MGS service.

node26# mkfs.lustre --mgs --mdt --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_mdt0
node26# mkfs.lustre --ost --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost0
node26# mkfs.lustre --ost --fsname=lustre0 --index=1 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost1
node26# mkfs.lustre --ost --fsname=lustre0 --index=2 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost2
node26# mkfs.lustre --ost --fsname=lustre0 --index=3 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost3

More details could be found in the Lustre documentation.

Cluster resources creation

Please check the table below. The configuration to configure is described in the table.

RAID name	HA cluster RAID resource name	Lustre target	Mountpoint	HA cluster filesystem resource name	Preferred cluster node
r_mdt0	rr_mdt0	MGT + MDT index=0	/lustre_t/mdt0	fsr_mdt0	node26
r_ost0	rr_ost0	OST index=0	/lustre_t/ost0	fsr_ost0	node26
r_ost1	rr_ost1	OST index=1	/lustre_t/ost1	fsr_ost1	node27
r_ost2	rr_ost2	OST index=2	/lustre_t/ost2	fsr_ost2	node26
r_ost3	rr_ost3	OST index=3	/lustre_t/ost3	fsr_ost3	node27

To create Pacemaker resources for xiRAID Classic RAIDs, we will use the xiRAID resource agent, which was installed with xiRAID Classic and made available to Pacemaker in one of the previous steps.

To cluster Lustre services, there are two options, as currently two resource agents are capable of managing Lustre OSDs: