Lustre filesystem over ZFS
Introduction.
Scheme and equipment configurations.
Below scheme was used:
One server MGS/MDS/OSS and five OSS servers.
Configuration of MGS/MDS/OSS server:
Proc Intel Xeon 56xx 2×2.4Ggz
Mem 72Gb*
Net 6x1Gbit/s
SSD 2x120Gb
HDD RAID6+HS - 24x3TB disks
*Memory volume was caused by using this node for SMB,NFS export
Configuration of OSS server:
Proc Intel Xeon 56xx 2×2.4Ggz
Mem 12Gb
Net 4x1Gbit/s
HDD Adaptec RAID6+HS - 24x3TB disks
Network:
All servers in one vlan. (There is no Backend or Frontend)
OS on all server: Centos 6.5
Preparing and tuning
The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there were free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions.
* Install ОС Centos 6.5
* Update and install packets
yum --exlude=kernel/* update -y
yum localinstall --nogpgcheck https://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
yum install zfs strace sysstat man wget net-snmp openssh-clients ntp ntpdate tuned
Check that zfs module was compiled. (lustre 2.5.3 compartable with ZFS 0.6.3)
* Create bond on every of MGS/MDS/OSS and ОSS servers:
bond0
BONDING_OPTS="miimon=100 mode=0"
* Disable SELINUX
* Install follow packages:
yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp
* Configure ntp
* On all server set identical (uid:gid)
* Set scheduler parameters: tuned-adm profile latency-performance
* Tuning sysctl.conf
# increase Linux TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
# increase default and maximum Linux TCP buffer sizes
net.ipv4.tcp_rmem = 4096 262144 8388608
net.ipv4.tcp_wmem = 4096 262144 8388608
# increase max backlog to avoid dropped packets
net.core.netdev_max_backlog=2500
net.ipv4.tcp_mem=8388608 8388608 8388608
sysctl net.ipv4.tcp_ecn=0
Installing Lustre
For servers:
Download utils:
wget -r https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.9.wc1/el6/RPMS/x86_64/
and lustre:
wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/server/RPMS/x86_64/
Install utils:
Remove utils (rpm -e –nodeps) e2fsprogs e2fsprogs-libs libcom_err libss
install new:
rpm -ivh libcom_err-1.42.9.wc1-7.el6.x86_64.rpm
rpm -ivh e2fsprogs-libs-1.42.9.wc1-7.el6.x86_64.rpm
rpm -ivh e2fsprogs-1.42.9.wc1-7.el6.x86_64.rpm
Install Lustre:
rpm -ivh --force kernel-2.6.32-431.23.3.el6_lustre.x86_64.rpm
rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-osd-zfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-osd-ldiskfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
Check lustre kernel will be boot by default in /boot/grub/grub.conf
Configure LNET:
echo "options lnet networks=tcp0(bond0)" > /etc/modprobe.d/lustre.conf
Reboot nodes
reboot
For the clients:
Download and install utils.
Update kernel:
yum install -y kernel-2.6.32-431.23.3.el6
reboot
Dowload lustre:
wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/client/RPMS/x86_64/
Install Lustre:
rpm -ivh lustre-client-modules-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm
rpm -ivh lustre-client-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm
Deploying Lustre
Follow steps of deploying:
1. Make MGS/MDS.
2. Make OSS/OST
For MGS/MDS/OSS:
Just in case:
ln -s /lib64/libzfs.so.2.0.0 libzfs.so.2
mkfs.lustre --reformat --mgs --backfstype=zfs --fsname=lustrerr rumrrlustre-mdt0msg/mgs mirror /dev/sdd /dev/sde
mkfs.lustre --mdt --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-mdt0msg/mdt0
Create /etc/ldev.conf
# exampl mple /etc/ldev.conf
#
# local foreign/- label [md|zfs:]device-path [journal-path]
#
ls-1 - MGS zfs:rumrrlustre-mdt0msg/mgs
ls-1 - lustrerr:MDT0000 zfs:rumrrlustre-mdt0msg/mdt0
ls-1 - lustrerr:OST0000 zfs:rumrrlustre-oss0/ost0
service lustre start MGS
service lustre start MDT0000
In case of problems check lustre - LNET
lctl list_nids.
if no output
lctl network up
mkfs.lustre --ost --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss0/ost0 /dev/ost-drive
ost-drive -RAID6 named by udev rules.
mkdir /lustre
/etc/fstab
192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0
For OSS servers:
mkfs.lustre --ost --backfstype=zfs --index=**N** --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss**N**/ost0 /dev/ost-drive
where N-serial number.
Example: mkfs.lustre --ost --backfstype=zfs --index=1 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss1/ost0 /dev/ost-drive
Create /etc/ldev.conf
# exampl mple /etc/ldev.conf
#
# local foreign/- label [md|zfs:]device-path [journal-path]
#
ls-M - lustrerr:OST000N zfs:rumrrlustre-ossN/ost0
#where M = N+1
For the clients:
mkdir /lustre
/etc/fstab
192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0
For every server with mounted lustre filesystem:
lfs df -h
UUID bytes Used Available Use% Mounted on
lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0]
lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0]
lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1]
lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2]
lustrerr-OST0003_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:3]
lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4]
lustrerr-OST0005_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:5]
filesystem summary: 334.0T 40.6T 293.4T 12% /lustre
Working with Lustre
The tasks below will be considered:
Rebalance data, delete ost, backup/restore, restore data with snapshot
1. Rebalance of data for OST when new node was added
Example: (look at to lustrerr-OST0005_UUID)
lfs df -h
UUID bytes Used Available Use% Mounted on
lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0]
lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0]
lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1]
lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2]
lustrerr-OST0003_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:3]
lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4]
lustrerr-OST0005_UUID 55.7T 52.7T 5.0T 94% /lustre[OST:5]
filesystem summary: 334.0T 40.6T 293.4T 12% /lustre
There could be two problems:
1.1 Adding new data problem associated with lack of free space just on one of OST
1.2 Increasing of I/O load on a new node.
You should use following algorithm for solving this problem:
Example:
lctl –device N deactivate
lfs find –ost {OST_UUID} -size +1G | lfs_migrate -y
lctl –device N activate
2. Delete (OST)
You need to use algorithm above for solving this task:
OST deactivation (ost will available only for read)
Moving date to more free OST
Permanent OST deactivation
lctl --device FS-OST0003_UUID deactivate #temporary deactivate
lfs find --obd FS-OST0003_UUID /lustre | lfs_migrate -y #migrate data
lctl conf_param FS-OST0003_UUID.osc.active=0 #permanently deactivate
Result:
lfs df -h
UUID bytes Used Available Use% Mounted on
lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0]
lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0]
lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1]
lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2]
lustrerr-OST0003_UUID : inactive device
lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4]
lustrerr-OST0005_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:5]
3.Backup and restore.
Solved by using snapshots. Snapshots can be moved to different places. Example of MDT backup (OST commented).
vi /usr/local/bin/snapscript.sh
#!/bin/sh
currdate=`/bin/date +%Y-%m-%0e`
olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e`
chk=`zfs list -t snapshot | grep $olddate`
#creating snapshots for vol1 and Avol2 pools
/sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate
#/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X
#deleting 21-days old snapshots (if they are exists)
/sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate
#/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost
/sbin/zfs send -p rumrrlustre-mdt0msg/mdt0@$currdate | /bin/gzip > /root/meta-snap.gz #backup only mdt
# also mdt and ost can be backuped to remote node
# example: zfs send -R rumrrlustre-mdt0msg/mdt0@$currdate | ssh some-node zfs receive rumrrlustre-mdt0msg/mdt0@$currdate
Restore from backup (example only for mdt)
service lustre stop lustrerr:MDT0000
zfs rename rumrrlustre-mdt0msg/mdt0 rumrrlustre-mdt0msg/mdt0-old
gunzip -c /root/meta-snap.gz | zfs receive rumrrlustre-mdt0msg/mdt0
service lustre start lustrerz:MDT0000
See logs:
tail -f /var/log/messages
Oct 14 14:12:44 ls-1 kernel: Lustre: lustrerr-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1413281557/real 1413281557] req@ffff880c60512c00 x1474855950917400/t0(0) o38->lustrerz- MDT0000-mdc-ffff880463edc000@0@lo:12/10 lens 400/544 e 0 to 1 dl 1413281588 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 71364 previous similar messages
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Will be in recovery for at least 2:30, or until 1 client reconnects
Oct 14 14:13:08 ls-1 kernel: LustreError: 3937:0:(import.c:1000:ptlrpc_connect_interpret()) lustrerr-MDT0000_UUID went back in time (transno 55834576660 was previously committed, server now claims 55834576659)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000-mdc-ffff880463edc000: Connection restored to lustrerz-MDT0000 (at 0@lo)
Oct 14 14:13:08 ls-1 kernel: Lustre: Skipped 1 previous similar message
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-OST0000: deleting orphan objects from 0x0:1571748 to 0x0:1571857
Oct 14 14:13:33 ls-1 kernel: LustreError: 167-0: lustrerz-MDT0000-lwp-OST0000: This client was evicted by lustrerz-MDT0000; in progress operations using this service will fail.
Oct 14 14:13:33 ls-1 kernel: Lustre: lustrerr-MDT0000-lwp-OST0000: Connection restored to lustrerz-MDT0000 (at 0@lo)
4. Restore data by using snapshot.
The same script as for backup/restore was used.
4.1.
vi /usr/local/bin/snapscript.sh
#!/bin/sh
currdate=`/bin/date +%Y-%m-%0e`
olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e`
chk=`zfs list -t snapshot | grep $olddate`
#creating snapshots for vol1 and Avol2 pools
/sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate
/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X
#deleting 21-days old snapshots (if they are exists)
/sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate
/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost
4.2.
For MDT:
zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-MDT0000 -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=1 -o lustre:fsname=lustrerr -o lustre:index=0 -o lustre:version=1 rumrrlustre-mdt0msg/mdt0@date rumrrlustre-mdt0msg/mdt00
4.3.
For OST (N-namber of ost)
zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-OST000N -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=34 -o lustre:fsname=lustrerr -o lustre:index=N -o lustre:version=1 rumrrlustre-ossN/ost0@date rumrrlustre-ossN/ostN0
4.4.
Stop lustre (on all nodes)
service lustre stop
4.5.
In /etc/ldev (All must be edited. Follow is example for first server)
ls-1.scanex.ru - lustrerr:MDT0000 zfs:rumrrlustre-mdt0msg/mdt00
ls-1.scanex.ru - lustrerr:OST000N zfs:rumrrlustre-ossN/ostN0
4.6.
Start lustre (on all nodes)
service lustre start
4.7.
Copy data to the chose location (local paths or remote) and after stop lustre on all nodes
service lustre stop
4.8.
Restore initial configuration /etc/ldev.conf and start lustre.
service lustre start
4.9.
Copy data from path to luster
4.10.
Delete zfs-clones by zfs destroy. doc
About author