High-Performance, Highly Available Lustre Solution with xiRAID 4.1 on Dual-Node Shared NVMe
This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments. System layout xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability. This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1. This article will familiarize you with how to deploy xiRAID Classic for a real-life task. Lustre server SBB Platform We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below. Viking VDS2249R Node 0 Node 1 Hostname node26 node27 CPU AMD EPYC 7713P 64-Core AMD EPYC 7713P 64-Core Memory 256GB 256GB OS drives 2 x Samsung SSD 970 EVO Plus 250GB mirrored 2 x Samsung SSD 970 EVO Plus 250GB mirrored OS Rocky Linux 8.9 Rocky Linux 8.9 IPMI address 192.168.64.106 192.168.67.23 IPMI login admin admin IPMI password admin admin Management NIC enp194s0f0: 192.168.65.26/24 enp194s0f0: 192.168.65.27 Cluster Heartbeat NIC enp194s0f1: 10.10.10.1 enp194s0f1: 10.10.10.2 Infiniband LNET HDR ib0: 100.100.100.26 ib0: 100.100.100.27 ib3: 100.100.100.126 ib3: 100.100.100.127 NVMes 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8 System configuration and tuning Before software installation and configuration, we need to prepare the platform to provide optimal performance. Performance tuning tuned-adm profile accelerator-performance Network configuration Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes: 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.65.26 node26 192.168.65.27 node27 10.10.10.1 node26-ic 10.10.10.2 node27-ic 192.168.64.50 node26-ipmi 192.168.64.76 node27-ipmi 100.100.100.26 node26-ib 100.100.100.27 node27-ib Policy-based routing setup We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces. node26 setup: node26# nmcli connection modify ib0 ipv4.route-metric 100 node26# nmcli connection modify ib3 ipv4.route-metric 101 node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100" node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100" node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200" node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200" node26# nmcli connection up ib0 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) node26# nmcli connection up ib3 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7) node27 setup: node27# nmcli connection modify ib0 ipv4.route-metric 100 node27# nmcli connection modify ib3 ipv4.route-metric 101 node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100" node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100" node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200" node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200" node27# nmcli connection up ib0 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) node26# nmcli connection up ib3 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7) NVMe drives setup In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 dri

This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments.
System layout
xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability.
This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1.
This article will familiarize you with how to deploy xiRAID Classic for a real-life task.
Lustre server SBB Platform
We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below.
Viking VDS2249R
Node 0 | Node 1 | |
---|---|---|
Hostname | node26 | node27 |
CPU | AMD EPYC 7713P 64-Core | AMD EPYC 7713P 64-Core |
Memory | 256GB | 256GB |
OS drives | 2 x Samsung SSD 970 EVO Plus 250GB mirrored | 2 x Samsung SSD 970 EVO Plus 250GB mirrored |
OS | Rocky Linux 8.9 | Rocky Linux 8.9 |
IPMI address | 192.168.64.106 | 192.168.67.23 |
IPMI login | admin | admin |
IPMI password | admin | admin |
Management NIC | enp194s0f0: 192.168.65.26/24 | enp194s0f0: 192.168.65.27 |
Cluster Heartbeat NIC | enp194s0f1: 10.10.10.1 | enp194s0f1: 10.10.10.2 |
Infiniband LNET HDR | ib0: 100.100.100.26 | ib0: 100.100.100.27 |
ib3: 100.100.100.126 | ib3: 100.100.100.127 | |
NVMes | 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8 | 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8 |
System configuration and tuning
Before software installation and configuration, we need to prepare the platform to provide optimal performance.
Performance tuning
tuned-adm profile accelerator-performance
Network configuration
Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.65.26 node26
192.168.65.27 node27
10.10.10.1 node26-ic
10.10.10.2 node27-ic
192.168.64.50 node26-ipmi
192.168.64.76 node27-ipmi
100.100.100.26 node26-ib
100.100.100.27 node27-ib
Policy-based routing setup
We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces.
node26 setup:
node26# nmcli connection modify ib0 ipv4.route-metric 100
node26# nmcli connection modify ib3 ipv4.route-metric 101
node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100"
node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100"
node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200"
node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200"
node26# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)
node27 setup:
node27# nmcli connection modify ib0 ipv4.route-metric 100
node27# nmcli connection modify ib3 ipv4.route-metric 101
node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100"
node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100"
node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200"
node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200"
node27# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)
NVMe drives setup
In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 drives. They are PCIe 4.0, dual-ported, read-intensive drives with 1DWPD endurance. A single drive's performance can theoretically reach up to 6.9GB/s for sequential read and 4.2GB/s for sequential write (according to the vendor specification).
In our setup, we plan to create a simple Lustre installation with sufficient performance. However, since each NVMe in the SBB system is connected to each server with only 2 PCIe lanes, the NVMe drives' performance will be limited. To overcome this limitation, we will create 2 namespaces on each NVMe drive, which will be used for the Lustre OST RAIDs, and create separate RAIDs from the first NVMe namespaces and the second NVMe namespaces. By configuring our cluster software to use the RAIDs made from the first namespaces (and their Lustre servers) on Lustre node #0 and the RAIDs created from the second namespaces on node #1, we will be able to utilize all four PCIe lanes for each NVMe used to store OST data, as Lustre itself will distribute the workload among all OSTs.
Since we are deploying a simple Lustre installation, we will use a simple filesystem scheme with just one metadata server. As we will have only one metadata server, we will need only one RAID for the metadata. Because of this, we will not create two namespaces on the drives used for the MDT RAID.
Here is how the NVMe drive configuration looks initially:
# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 21G0A046T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme1n1 21G0A04BT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme10n1 21G0A04ET2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme11n1 21G0A045T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme12n1 S59BNM0R702322Z Samsung SSD 970 EVO Plus 250GB 1 8.67 GB / 250.06 GB 512 B + 0 B 2B2QEXM7
/dev/nvme13n1 21G0A04KT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme14n1 21G0A047T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme15n1 21G0A04CT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme16n1 11U0A00KT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme17n1 21G0A04JT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme18n1 21G0A048T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme19n1 S59BNM0R702439A Samsung SSD 970 EVO Plus 250GB 1 208.90 kB / 250.06 GB 512 B + 0 B 2B2QEXM7
/dev/nvme2n1 21G0A041T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme20n1 21G0A03TT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme21n1 21G0A04FT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme22n1 21G0A03ZT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme23n1 21G0A04DT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme24n1 21G0A03VT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme25n1 21G0A044T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme3n1 21G0A04GT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme4n1 21G0A042T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme5n1 21G0A04HT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme6n1 21G0A049T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme7n1 21G0A043T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme8n1 21G0A04AT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme9n1 21G0A03XT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
The Samsung drives are used for the operating system installation.
Let's reserve /dev/nvme0 and /dev/nvme1 drives for the metadata RAID1. Currently, xiRAID does not support spare pools in a cluster configuration, but having a spare drive is really useful for quick manual drive replacement. So, let's also reserve /dev/nvme3 to be a spare for the RAID1 drive and split all other KCM61RUL3T84 drives into 2 namespaces.
Let’s take /dev/nvme4 as an example. All other drives will be splited in absolutely the same way.
Check the maximum possible size of the drive to be sure:
# nvme id-ctrl /dev/nvme4 | grep -i tnvmcap
tnvmcap : 3840755982336
Check the maximal number of the namespaces supported by the drive:
# nvme id-ctrl /dev/nvme4 | grep ^nn
nn : 64
Check the controller used for the drive connection at both servers (they will differ):
node27# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x1
node26# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x2
We need to calculate the size of the namespaces we are going to create. The real size of the drive in 4K blocks is:
3840755982336/4096=937684566
So, each namespace size in 4K blocks will be:
937684566/2=468842283
In fact, it is not possible to create 2 namespaces of exactly this size because of the NVMe internal architecture. So, we will create namespaces of 468700000 blocks.
If you are building a system for write-intensive tasks, we recommend using write-intensive drives with 3DWPD endurance. If that is not possible and you have to use read-optimized drives, consider leaving some space (10-25%) of the NVMe volume unallocated by namespaces. In many cases, this helps turn the NVMe behavior in terms of write performance degradation closer to that of write-intensive drives.
As a first step, remove the existing namespace on one of the nodes:
node26# nvme delete-ns /dev/nvme4 -n 1
After that, create namespaces on the same node:
node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:1
node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:2
node26# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x2
attach-ns: Success, nsid:1
node26# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x2
attach-ns: Success, nsid:2
Attach the namespaces on the second node with the proper controller:
node27# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x1
attach-ns: Success, nsid:1
node27# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x1
attach-ns: Success, nsid:2
It looks like this on both nodes:
# nvme list |grep nvme4
/dev/nvme4n1 21G0A042T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme4n2 21G0A042T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
All other drives were split in the same way. Here is the resulting configuration:
# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 21G0A046T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme1n1 21G0A04BT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme10n1 21G0A04ET2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme10n2 21G0A04ET2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme11n1 21G0A045T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme11n2 21G0A045T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme12n1 S59BNM0R702322Z Samsung SSD 970 EVO Plus 250GB 1 8.67 GB / 250.06 GB 512 B + 0 B 2B2QEXM7
/dev/nvme13n1 21G0A04KT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme13n2 21G0A04KT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme14n1 21G0A047T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme14n2 21G0A047T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme15n1 21G0A04CT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme15n2 21G0A04CT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme16n1 11U0A00KT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme16n2 11U0A00KT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme17n1 21G0A04JT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme17n2 21G0A04JT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme18n1 21G0A048T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme18n2 21G0A048T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme19n1 S59BNM0R702439A Samsung SSD 970 EVO Plus 250GB 1 208.90 kB / 250.06 GB 512 B + 0 B 2B2QEXM7
/dev/nvme2n1 21G0A041T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme20n1 21G0A03TT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme20n2 21G0A03TT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme21n1 21G0A04FT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme21n2 21G0A04FT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme22n1 21G0A03ZT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme22n2 21G0A03ZT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme23n1 21G0A04DT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme23n2 21G0A04DT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme24n1 21G0A03VT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme24n2 21G0A03VT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme25n1 21G0A044T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme25n2 21G0A044T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme3n1 21G0A04GT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
/dev/nvme4n1 21G0A042T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme4n2 21G0A042T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme5n1 21G0A04HT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme5n2 21G0A04HT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme6n1 21G0A049T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme6n2 21G0A049T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme7n1 21G0A043T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme7n2 21G0A043T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme8n1 21G0A04AT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme8n2 21G0A04AT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme9n1 21G0A03XT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106
/dev/nvme9n2 21G0A03XT2G8 KCM61RUL3T84
Software components installation
Lustre installation
Create Lustre repo file /etc/yum.repos.d/lustre-repo.repo :
lustre-server]
name=lustre-server
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/server
# exclude=*debuginfo*
gpgcheck=0
[lustre-client]
name=lustre-client
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/client
# exclude=*debuginfo*
gpgcheck=0
[e2fsprogs-wc]
name=e2fsprogs-wc
baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el8
# exclude=*debuginfo*
gpgcheck=0
Installing e2fs tools:
yum --nogpgcheck --disablerepo=* --enablerepo=e2fsprogs-wc install e2fsprogs
Installing Lustre kernel:
yum --nogpgcheck --disablerepo=baseos,extras,updates --enablerepo=lustre-server install kernel kernel-devel kernel-headers
Reboot to the new kernel:
reboot
Check the kernel version after reboot:
node26# uname -a
Linux node26 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Installing lustre server components:
yum --nogpgcheck --enablerepo=lustre-server,ha install kmod-lustre kmod-lustre-osd-ldiskfs lustre-osd-ldiskfs-mount lustre lustre-resource-agents
Check Lustre module load:
[root@node26 ~]# modprobe -v lustre
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/libcfs.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/lnet.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/obdclass.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/ptlrpc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fld.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fid.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/osc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lov.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/mdc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lmv.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lustre.ko
Unload modules:
# lustre_rmmod
Installing xiRAID Classic 4.1
Installing xiRAID Classic 4.1 at both nodes from the repositories following the Xinnor xiRAID 4.1.0 Installation Guide:
# yum install -y epel-release
# yum install https://pkg.xinnor.io/repository/Repository/xiraid/el/8/kver-4.18/xiraid-repo-1.1.0-446.kver.4.18.noarch.rpm
# yum install xiraid-release
Pacemaker installation
Running the following steps at both nodes:
Enable cluster repo
# yum config-manager --set-enabled ha appstream
Installing cluster:
# yum install pcs pacemaker psmisc policycoreutils-python3
Csync2 installation
Since we are installing the system on Rocky Linux 8, there is no need to compile Csync2 from sources ourselves. Just install the Csync2 package from the Xinnor repository on both nodes:
# yum install csync2
NTP server installation
# yum install chrony
HA cluster setup
Time synchronisation setup
Modify /etc/chrony.conf file if needed to setup working with proper NTP servers. At this setup we will work with the default settings.
# systemctl enable --now chronyd.service
Verify, that time sync works properly by running chronyc tracking.
Pacemaker cluster creation
In this chapter, the cluster configuration is described. In our cluster, we use a dedicated network to create a cluster interconnect. This network is physically created as a single direct connection (by dedicated Ethernet cable without any switch) between enp194s0f1 interfaces on the servers. The cluster interconnect is a very important component of any HA-cluster, and its reliability should be high. A Pacemaker-based cluster can be configured with two cluster interconnect networks for improved reliability through redundancy. While in our configuration we will use a single network configuration, please consider using a dual network interconnect for your projects if needed.
Set the firewall to allow pacemaker software to work (on both nodes):
# firewall-cmd --add-service=high-availability
# firewall-cmd --permanent --add-service=high-availability
Set the same password for the hacluster user at both nodes:
# passwd hacluster
Start the cluster software at both nodes:
# systemctl start pcsd.service
# systemctl enable pcsd.service
Authenticate the cluster nodes from one node by their interconnect interfaces:
node26# pcs host auth node26-ic node27-ic -u hacluster
Password:
node26-ic: Authorized
node27-ic: Authorized
Create and start the cluster (start at one node):
node26# pcs cluster setup lustrebox0 node26-ic node27-ic
No addresses specified for host 'node26-ic', using 'node26-ic'
No addresses specified for host 'node27-ic', using 'node27-ic'
Destroying cluster on hosts: 'node26-ic', 'node27-ic'...
node26-ic: Successfully destroyed cluster
node27-ic: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node26-ic', 'node27-ic'
node26-ic: successful removal of the file 'pcsd settings'
node27-ic: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync authkey'
node26-ic: successful distribution of the file 'pacemaker authkey'
node27-ic: successful distribution of the file 'corosync authkey'
node27-ic: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync.conf'
node27-ic: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.
node26# pcs cluster start --all
node26-ic: Starting Cluster...
node27-ic: Starting Cluster...
Check the current cluster status:
node26# pcs status
Cluster name: lustrebox0
WARNINGS:
No stonith devices and stonith-enabled is not false
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
* Last updated: Fri Jul 12 20:55:53 2024 on node26-ic
* Last change: Fri Jul 12 20:55:12 2024 by hacluster via hacluster on node27-ic
* 2 nodes configured
* 0 resource instances configured
Node List:
* Online: [ node26-ic node27-ic ]
Full List of Resources:
* No resources
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
Fencing setup
It's very important to have properly configured and working fencing (STONITH) in any HA cluster that works with shared storage devices. In our case, the shared devices are all the NVMe namespaces we created earlier. The fencing (STONITH) design should be developed and implemented by the cluster administrator in consideration of the system's abilities and architecture. In this system, we will use fencing via IPMI. Anyway, when designing and deploying your own cluster, please choose the fencing configuration on your own, considering all the possibilities, limitations, and risks.
First of all, let's check the list of installed fencing agents in our system:
node26# pcs stonith list
fence_watchdog - Dummy watchdog fence agent
So, we don’t have the IPMI fencing agent installed at our cluster nodes. To install it, run the following command (at both nodes):
# yum install fence-agents-ipmilan
You may check the IPMI fencing agent options description by running the following command:
pcs stonith describe fence_ipmilan
Adding the fencing resources:
node26# pcs stonith create node27.stonith fence_ipmilan ip="192.168.67.23" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node27-ic" pcmk_host_check=static-list op monitor interval=10s
node26# pcs stonith create node26.stonith fence_ipmilan ip="192.168.64.106" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node26-ic" pcmk_host_check=static-list op monitor interval=10s
Preventing the STONITH resources from start on the node it should kill:
node26# pcs constraint location node27.stonith avoids node27-ic=INFINITY
node26# pcs constraint location node26.stonith avoids node26-ic=INFINITY
Csync2 configuration
Configure firewall to allow Csync2 to work (run at both nodes):
# firewall-cmd --add-port=30865/tcp
# firewall-cmd --permanent --add-port=30865/tcp
Create the Csync2 configuration file /usr/local/etc/csync2.cfg with the following content at node26 only:
nossl * *;
group csxiha {
host node26;
host node27;
key /usr/local/etc/csync2.key_ha;
include /etc/xiraid/raids; }
Generate the key:
node26# csync2 -k /usr/local/etc/csync2.key_ha
Copy the config and the key file to the second node:
node26# scp /usr/local/etc/csync2.cfg /usr/local/etc/csync2.key_ha node27:/usr/local/etc/
For Csync2 synchronisation by schedule one time per minute run crontab -e at both nodes and add the following record:
* * * * * /usr/local/sbin/csync2 -x
Also for asynchronous synchronisation run the following command to create a synchronisation script (repeat the script creation procedure at both nodes):
# vi /etc/xiraid/config_update_handler.sh
Fill the created script with the following content:
#!/usr/bin/bash
/usr/local/sbin/csync2 -xv
Save the file.
After that run the following command to set correct permissions for the script file:
# chmod +x /etc/xiraid/config_update_handler.sh
xiRAID Configuration for cluster setup
Disable RAID autostart to prevent RAIDs from being activated by xiRAID itself during a node boot. In a cluster configuration, RAIDs have to be activated by Pacemaker via cluster resources. Run the following command on both nodes:
# xicli settings cluster modify --raid_autostart 0
Make xiRAID Classic 4.1 resource agent visible for Pacemaker (run command this sequence at both nodes):
# mkdir -p /usr/lib/ocf/resource.d/xraid
# ln -s /etc/xraid/agents/raid /usr/lib/ocf/resource.d/xraid/raid
xiRAID RAIDs creation
To be able to create RAIDs, we need to install licenses for xiRAID Classic 4.1 on both hosts first. The licenses should be received from Xinnor. To generate the licenses, Xinnor requires the output of the xicli license show command (from both nodes).
node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64
hwkey: B8828A09E09E8F48
license_key: null
version: 0
crypto_version: 0
created: 0-0-0
expired: 0-0-0
disks: 4
levels: 0
type: nvme
disks_in_use: 2
status: trial
The license files received from Xinnor needs to be installed by xicli license update -p command (once again, at both nodes):
node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64
hwkey: B8828A09E09E8F48
license_key: 0F5A4B87A0FC6DB7544EA446B1B4AF5F34A08169C44E5FD119CE6D2352E202677768ECC78F56B583DABE11698BBC800EC96E556AA63E576DAB838010247678E7E3B95C7C4E3F592672D06C597045EAAD8A42CDE38C363C533E98411078967C38224C9274B862D45D4E6DED70B7E34602C80B60CBA7FDE93316438AFDCD7CBD23
version: 1
crypto_version: 1
created: 2024-7-16
expired: 2024-9-30
disks: 600
levels: 70
type: nvme
disks_in_use: 2
status: valid
Since we plan to deploy a small Lustre installation, combining MGT and MDT on the same target device is absolutely OK. But for medium or large Lustre installations, it's better to use a separate target (and RAID) for MGT.
Here is the list of the RAIDs we need to create.
RAID Name | RAID Level | Number of devices | Strip size | Drive list | Lustre target |
---|---|---|---|---|---|
r_mdt0 | 1 | 2 | 16 | /dev/nvme0n1 /dev/nvme1n1 | MGT + MDT index=0 |
r_ost0 | 6 | 10 | 128 | /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1 | OST index=0 |
r_ost1 | 6 | 10 | 128 | /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2 | OST index=1 |
r_ost2 | 6 | 10 | 128 | /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1 | OST index=2 |
r_ost3 | 6 | 10 | 128 | /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2 | OST index=3 |
Creating all the RAIDs at the first node:
node26# xicli raid create -n r_mdt0 -l 1 -d /dev/nvme0n1 /dev/nvme1n1
node26# xicli raid create -n r_ost0 -l 6 -ss 128 -d /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1
node26# xicli raid create -n r_ost1 -l 6 -ss 128 -d /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2
node26# xicli raid create -n r_ost2 -l 6 -ss 128 -d /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1
node26# xicli raid create -n r_ost3 -l 6 -ss 128 -d /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2
At this stage, there is no need to wait for the RAIDs initialization to finish - it can be safely left to run in the background.
Checking the RAID statuses at the first node:
ode26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗
║ name ║ static ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_mdt0 ║ size: 3576 GiB ║ online ║ 0 /dev/nvme0n1 online ║ ║
║ ║ level: 1 ║ initialized ║ 1 /dev/nvme1n1 online ║ ║
║ ║ strip_size: 16 ║ ║ ║ ║
║ ║ block_size: 4096 ║ ║ ║ ║
║ ║ sparepool: - ║ ║ ║ ║
║ ║ active: True ║ ║ ║ ║
║ ║ config: True ║ ║ ║ ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost0 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n1 online ║ init_progress: 11 ║
║ ║ level: 6 ║ initing ║ 1 /dev/nvme5n1 online ║ ║
║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n1 online ║ ║
║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n1 online ║ ║
║ ║ sparepool: - ║ ║ 4 /dev/nvme8n1 online ║ ║
║ ║ active: True ║ ║ 5 /dev/nvme9n1 online ║ ║
║ ║ config: True ║ ║ 6 /dev/nvme10n1 online ║ ║
║ ║ ║ ║ 7 /dev/nvme11n1 online ║ ║
║ ║ ║ ║ 8 /dev/nvme13n1 online ║ ║
║ ║ ║ ║ 9 /dev/nvme14n1 online ║ ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost1 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n2 online ║ init_progress: 7 ║
║ ║ level: 6 ║ initing ║ 1 /dev/nvme5n2 online ║ ║
║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n2 online ║ ║
║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n2 online ║ ║
║ ║ sparepool: - ║ ║ 4 /dev/nvme8n2 online ║ ║
║ ║ active: True ║ ║ 5 /dev/nvme9n2 online ║ ║
║ ║ config: True ║ ║ 6 /dev/nvme10n2 online ║ ║
║ ║ ║ ║ 7 /dev/nvme11n2 online ║ ║
║ ║ ║ ║ 8 /dev/nvme13n2 online ║ ║
║ ║ ║ ║ 9 /dev/nvme14n2 online ║ ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost2 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n1 online ║ init_progress: 5 ║
║ ║ level: 6 ║ initing ║ 1 /dev/nvme16n1 online ║ ║
║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n1 online ║ ║
║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n1 online ║ ║
║ ║ sparepool: - ║ ║ 4 /dev/nvme20n1 online ║ ║
║ ║ active: True ║ ║ 5 /dev/nvme21n1 online ║ ║
║ ║ config: True ║ ║ 6 /dev/nvme22n1 online ║ ║
║ ║ ║ ║ 7 /dev/nvme23n1 online ║ ║
║ ║ ║ ║ 8 /dev/nvme24n1 online ║ ║
║ ║ ║ ║ 9 /dev/nvme25n1 online ║ ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost3 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n2 online ║ init_progress: 2 ║
║ ║ level: 6 ║ initing ║ 1 /dev/nvme16n2 online ║ ║
║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n2 online ║ ║
║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n2 online ║ ║
║ ║ sparepool: - ║ ║ 4 /dev/nvme20n2 online ║ ║
║ ║ active: True ║ ║ 5 /dev/nvme21n2 online ║ ║
║ ║ config: True ║ ║ 6 /dev/nvme22n2 online ║ ║
║ ║ ║ ║ 7 /dev/nvme23n2 online ║ ║
║ ║ ║ ║ 8 /dev/nvme24n2 online ║ ║
║ ║ ║ ║ 9 /dev/nvme25n2 online ║ ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝
Checking that the RAID configs were successfully replicated to the second node (please note that on the second node, the RAID status is None, which is expected in this case):
node27# xicli raid show
╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗
║ name ║ static ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB ║ None ║ ║ ║
║ ║ level: 1 ║ ║ ║ ║
║ ║ strip_size: 16 ║ ║ ║ ║
║ ║ block_size: 4096 ║ ║ ║ ║
║ ║ sparepool: - ║ ║ ║ ║
║ ║ active: False ║ ║ ║ ║
║ ║ config: True ║ ║ ║ ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost0 ║ size: 14302 GiB ║ None ║ ║ ║
║ ║ level: 6 ║ ║ ║ ║
║ ║ strip_size: 128 ║ ║ ║ ║
║ ║ block_size: 4096 ║ ║ ║ ║
║ ║ sparepool: - ║ ║ ║ ║
║ ║ active: False ║ ║ ║ ║
║ ║ config: True ║ ║ ║ ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost1 ║ size: 14302 GiB ║ None ║ ║ ║
║ ║ level: 6 ║ ║ ║ ║
║ ║ strip_size: 128 ║ ║ ║ ║
║ ║ block_size: 4096 ║ ║ ║ ║
║ ║ sparepool: - ║ ║ ║ ║
║ ║ active: False ║ ║ ║ ║
║ ║ config: True ║ ║ ║ ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost2 ║ size: 14302 GiB ║ None ║ ║ ║
║ ║ level: 6 ║ ║ ║ ║
║ ║ strip_size: 128 ║ ║ ║ ║
║ ║ block_size: 4096 ║ ║ ║ ║
║ ║ sparepool: - ║ ║ ║ ║
║ ║ active: False ║ ║ ║ ║
║ ║ config: True ║ ║ ║ ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost3 ║ size: 14302 GiB ║ None ║ ║ ║
║ ║ level: 6 ║ ║ ║ ║
║ ║ strip_size: 128 ║ ║ ║ ║
║ ║ block_size: 4096 ║ ║ ║ ║
║ ║ sparepool: - ║ ║ ║ ║
║ ║ active: False ║ ║ ║ ║
║ ║ config: True ║ ║ ║ ║
╚════════╩══════════════════╩═══════╩═════════╩══════╝
After RAID creation, there's no need to wait for RAID initialization to finish. The RAIDs are available for use immediately after creation, albeit with slightly reduced performance.
For optimal performance, it's better to dedicate specific disjoint CPU core sets to each RAID. Currently, all RAIDs are active on node26, so the sets will joint, but when they are spread between node26 and node27, they will not joint.
node26# xicli raid modify -n r_mdt0 -ca 0-7 -se 1
node26# xicli raid modify -n r_ost0 -ca 8-67 -se 1
node26# xicli raid modify -n r_ost1 -ca 8-67 -se 1 # will be running at node27
node26# xicli raid modify -n r_ost2 -ca 68-127 -se 1
node26# xicli raid modify -n r_ost3 -ca 68-127 -se 1 # will be running at node27
Lustre setup
LNET configuration
To make lustre working, we need to configure Lustre network stack (LNET).
Run at both nodes
# systemctl start lnet
# systemctl enable lnet
# lnetctl net add --net o2ib0 --if ib0
# lnetctl net add --net o2ib0 --if ib3
Check the configuration
# lnetctl net show -v
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 289478
recv_count: 289474
drop_count: 4
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
dev cpt: 0
CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
- net type: o2ib
local NI(s):
- nid: 100.100.100.26@o2ib
status: down
interfaces:
0: ib0
statistics:
send_count: 213607
recv_count: 213604
drop_count: 7
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 1
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
dev cpt: -1
CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
- nid: 100.100.100.126@o2ib
status: up
interfaces:
0: ib3
statistics:
send_count: 4
recv_count: 4
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 1
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
dev cpt: -1
CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
Please pay attention to the LNET at the hosts - NIDs. We will use 100.100.100.26@o2ib for node26 and 100.100.100.27@o2ib for node27 as primary NIDs.
Save the LNET configuration:
# lnetctl export -b > /etc/lnet.conf
LDISKFS filesystems creation
At this step, we format the RAIDs into LDISKFS filesystem format. During formatting, we specify the target type (--mgs/--mdt/--ost), unique number of the specific target type (--index), Lustre filesystem name (--fsname), NIDs where each target filesystem could be mounted and where the corresponding servers will get started automatically (--servicenode), and NIDs where MGS could be found (--mgsnode).
Since our RAIDs will work within a cluster, we specify NIDs of both server nodes as the NIDs where the target filesystem could be mounted and where the corresponding servers will get started automatically for each target filesystem. For the same reason, we specify two NIDs where other servers should look for the MGS service.
node26# mkfs.lustre --mgs --mdt --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_mdt0
node26# mkfs.lustre --ost --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost0
node26# mkfs.lustre --ost --fsname=lustre0 --index=1 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost1
node26# mkfs.lustre --ost --fsname=lustre0 --index=2 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost2
node26# mkfs.lustre --ost --fsname=lustre0 --index=3 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost3
More details could be found in the Lustre documentation.
Cluster resources creation
Please check the table below. The configuration to configure is described in the table.
RAID name | HA cluster RAID resource name | Lustre target | Mountpoint | HA cluster filesystem resource name | Preferred cluster node |
---|---|---|---|---|---|
r_mdt0 | rr_mdt0 | MGT + MDT index=0 | /lustre_t/mdt0 | fsr_mdt0 | node26 |
r_ost0 | rr_ost0 | OST index=0 | /lustre_t/ost0 | fsr_ost0 | node26 |
r_ost1 | rr_ost1 | OST index=1 | /lustre_t/ost1 | fsr_ost1 | node27 |
r_ost2 | rr_ost2 | OST index=2 | /lustre_t/ost2 | fsr_ost2 | node26 |
r_ost3 | rr_ost3 | OST index=3 | /lustre_t/ost3 | fsr_ost3 | node27 |
To create Pacemaker resources for xiRAID Classic RAIDs, we will use the xiRAID resource agent, which was installed with xiRAID Classic and made available to Pacemaker in one of the previous steps.
To cluster Lustre services, there are two options, as currently two resource agents are capable of managing Lustre OSDs:
- ocf