Lustre filesystem over ZFS

Lustre filesystem over ZFS

Introduction.

This article tells about UPGRADE of follow cluster with a little changes.
All following are valid for version Lustre 2.5.3

Scheme and equipment configurations.

Below scheme was used:
One server MGS/MDS/OSS and five OSS servers.

Configuration of MGS/MDS/OSS server:
Proc Intel Xeon 56xx 2×2.4Ggz
Mem 72Gb*
Net 6x1Gbit/s
SSD 2x120Gb
HDD RAID6+HS - 24x3TB disks

*Memory volume was caused by using this node for SMB,NFS export

Configuration of OSS server:
Proc Intel Xeon 56xx 2×2.4Ggz
Mem 12Gb
Net 4x1Gbit/s
HDD Adaptec RAID6+HS - 24x3TB disks

Network:
All servers in one vlan. (There is no Backend or Frontend)

OS on all server: Centos 6.5

Preparing and tuning

The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there were free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions.

* Install ОС Centos 6.5 * Update and install packets

 yum --exlude=kernel/* update -y
 yum localinstall --nogpgcheck https://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
 yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
 yum install zfs strace sysstat man wget net-snmp openssh-clients ntp ntpdate tuned

Check that zfs module was compiled. (lustre 2.5.3 compartable with ZFS 0.6.3)

* Create bond on every of MGS/MDS/OSS and ОSS servers:

 bond0 
 BONDING_OPTS="miimon=100 mode=0"

* Disable SELINUX
* Install follow packages:

 yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp

* Configure ntp
* On all server set identical (uid:gid)
* Set scheduler parameters: tuned-adm profile latency-performance
* Tuning sysctl.conf

  
   # increase Linux TCP buffer limits
   net.core.rmem_max = 8388608
   net.core.wmem_max = 8388608
   # increase default and maximum Linux TCP buffer sizes
   net.ipv4.tcp_rmem = 4096 262144 8388608
   net.ipv4.tcp_wmem = 4096 262144 8388608
   # increase max backlog to avoid dropped packets
   net.core.netdev_max_backlog=2500
   net.ipv4.tcp_mem=8388608 8388608 8388608
   sysctl net.ipv4.tcp_ecn=0

Installing Lustre

For servers:

Download utils:

wget -r https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.9.wc1/el6/RPMS/x86_64/

and lustre:

wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/server/RPMS/x86_64/

Install utils:
Remove utils (rpm -e –nodeps) e2fsprogs e2fsprogs-libs libcom_err libss

install new:

 rpm -ivh libcom_err-1.42.9.wc1-7.el6.x86_64.rpm
 rpm -ivh e2fsprogs-libs-1.42.9.wc1-7.el6.x86_64.rpm
 rpm -ivh e2fsprogs-1.42.9.wc1-7.el6.x86_64.rpm

Install Lustre:

rpm -ivh --force kernel-2.6.32-431.23.3.el6_lustre.x86_64.rpm
rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-osd-zfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-osd-ldiskfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
rpm -ivh lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm

Check lustre kernel will be boot by default in /boot/grub/grub.conf

Configure LNET:

echo "options lnet networks=tcp0(bond0)" > /etc/modprobe.d/lustre.conf

Reboot nodes

reboot

For the clients:

Download and install utils.

Update kernel:

yum install -y kernel-2.6.32-431.23.3.el6
reboot

Dowload lustre:

wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/client/RPMS/x86_64/

Install Lustre:

rpm -ivh lustre-client-modules-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm
rpm -ivh lustre-client-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm

Deploying Lustre

Follow steps of deploying:
1. Make MGS/MDS.
2. Make OSS/OST

For MGS/MDS/OSS:

Just in case:
ln -s /lib64/libzfs.so.2.0.0 libzfs.so.2

 mkfs.lustre --reformat --mgs --backfstype=zfs --fsname=lustrerr rumrrlustre-mdt0msg/mgs mirror /dev/sdd /dev/sde
mkfs.lustre --mdt --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-mdt0msg/mdt0

Create  /etc/ldev.conf
 # exampl mple /etc/ldev.conf
 #
 # local  foreign/-  label       [md|zfs:]device-path     [journal-path]
 #
 ls-1 - MGS                    zfs:rumrrlustre-mdt0msg/mgs
 ls-1 - lustrerr:MDT0000       zfs:rumrrlustre-mdt0msg/mdt0
 ls-1 - lustrerr:OST0000       zfs:rumrrlustre-oss0/ost0

 service lustre start MGS
 service lustre start MDT0000

In case of problems check lustre - LNET

 lctl list_nids.

if no output

 lctl network up

 mkfs.lustre --ost --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss0/ost0 /dev/ost-drive
 ost-drive -RAID6 named by udev rules.

 mkdir /lustre

 /etc/fstab
 192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0

For OSS servers:

mkfs.lustre --ost --backfstype=zfs --index=**N** --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss**N**/ost0 /dev/ost-drive
where N-serial number.
Example: mkfs.lustre --ost --backfstype=zfs --index=1 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss1/ost0 /dev/ost-drive

Create /etc/ldev.conf
 # exampl mple /etc/ldev.conf
 #
 # local  foreign/-  label       [md|zfs:]device-path     [journal-path]
 #
 ls-M - lustrerr:OST000N       zfs:rumrrlustre-ossN/ost0
 #where M = N+1

For the clients:

 mkdir /lustre
 /etc/fstab
 192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0

For every server with mounted lustre filesystem:

 lfs df -h
 UUID                       bytes        Used   Available Use% Mounted on
 lustrerr-MDT0000_UUID      108.4G        2.1G      106.2G   2% /lustre[MDT:0]
 lustrerr-OST0000_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:0]
 lustrerr-OST0001_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:1]
 lustrerr-OST0002_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:2]
 lustrerr-OST0003_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:3]
 lustrerr-OST0004_UUID       55.7T        6.9T       48.8T  12% /lustre[OST:4]
 lustrerr-OST0005_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:5]

 filesystem summary:       334.0T       40.6T      293.4T  12% /lustre

Working with Lustre

The tasks below will be considered: Rebalance data, delete ost, backup/restore, restore data with snapshot

1. Rebalance of data for OST when new node was added Example: (look at to lustrerr-OST0005_UUID)

 lfs df -h
 UUID                       bytes        Used   Available Use% Mounted on
 lustrerr-MDT0000_UUID      108.4G        2.1G      106.2G   2% /lustre[MDT:0]
 lustrerr-OST0000_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:0]
 lustrerr-OST0001_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:1]
 lustrerr-OST0002_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:2]
 lustrerr-OST0003_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:3]
 lustrerr-OST0004_UUID       55.7T        6.9T       48.8T  12% /lustre[OST:4]
 lustrerr-OST0005_UUID       55.7T        52.7T       5.0T  94% /lustre[OST:5]

 filesystem summary:       334.0T       40.6T      293.4T  12% /lustre

There could be two problems:
1.1 Adding new data problem associated with lack of free space just on one of OST
1.2 Increasing of I/O load on a new node.
You should use following algorithm for solving this problem:

OST deactivation (ost will available only for read)
Moving data to more free OST
OST activation

Example:

 lctl –device N deactivate
 lfs find –ost {OST_UUID} -size +1G | lfs_migrate -y
 lctl –device N activate

2. Delete (OST)

You need to use algorithm above for solving this task:

OST deactivation (ost will available only for read)
Moving date to more free OST
Permanent OST deactivation

 
lctl --device FS-OST0003_UUID deactivate #temporary deactivate
lfs find --obd FS-OST0003_UUID /lustre | lfs_migrate -y #migrate data
lctl conf_param FS-OST0003_UUID.osc.active=0 #permanently deactivate

Result:

lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustrerr-MDT0000_UUID      108.4G        2.1G      106.2G   2% /lustre[MDT:0]
lustrerr-OST0000_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:0]
lustrerr-OST0001_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:1]
lustrerr-OST0002_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:2]
lustrerr-OST0003_UUID       : inactive device
lustrerr-OST0004_UUID       55.7T        6.9T       48.8T  12% /lustre[OST:4]
lustrerr-OST0005_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:5]

3.Backup and restore.

Solved by using snapshots. Snapshots can be moved to different places. Example of MDT backup (OST commented).

vi /usr/local/bin/snapscript.sh

#!/bin/sh
currdate=`/bin/date +%Y-%m-%0e`
olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e`
chk=`zfs list -t snapshot | grep $olddate`
#creating snapshots for vol1 and Avol2 pools
/sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate
#/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X 
#deleting 21-days old snapshots (if they are exists)
/sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate
#/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost
/sbin/zfs send -p rumrrlustre-mdt0msg/mdt0@$currdate | /bin/gzip > /root/meta-snap.gz #backup only mdt
# also mdt and ost can be backuped to remote node 
# example: zfs send -R rumrrlustre-mdt0msg/mdt0@$currdate | ssh some-node zfs receive rumrrlustre-mdt0msg/mdt0@$currdate

Restore from backup (example only for mdt)

service lustre stop lustrerr:MDT0000
zfs rename rumrrlustre-mdt0msg/mdt0 rumrrlustre-mdt0msg/mdt0-old
gunzip -c /root/meta-snap.gz | zfs receive rumrrlustre-mdt0msg/mdt0
service lustre start lustrerz:MDT0000

See logs:

tail -f /var/log/messages
Oct 14 14:12:44 ls-1 kernel: Lustre: lustrerr-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1413281557/real 1413281557]  req@ffff880c60512c00 x1474855950917400/t0(0) o38->lustrerz-    MDT0000-mdc-ffff880463edc000@0@lo:12/10 lens 400/544 e 0 to 1 dl 1413281588 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 71364 previous similar messages
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Will be in recovery for at least 2:30, or until 1 client reconnects
Oct 14 14:13:08 ls-1 kernel: LustreError: 3937:0:(import.c:1000:ptlrpc_connect_interpret()) lustrerr-MDT0000_UUID went back in time (transno 55834576660 was previously committed, server now claims 55834576659)!  See   https://bugzilla.lustre.org/show_bug.cgi?id=9646
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000-mdc-ffff880463edc000: Connection restored to lustrerz-MDT0000 (at 0@lo)
Oct 14 14:13:08 ls-1 kernel: Lustre: Skipped 1 previous similar message
Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-OST0000: deleting orphan objects from 0x0:1571748 to 0x0:1571857
Oct 14 14:13:33 ls-1 kernel: LustreError: 167-0: lustrerz-MDT0000-lwp-OST0000: This client was evicted by lustrerz-MDT0000; in progress operations using this service will fail.
Oct 14 14:13:33 ls-1 kernel: Lustre: lustrerr-MDT0000-lwp-OST0000: Connection restored to lustrerz-MDT0000 (at 0@lo)

4. Restore data by using snapshot.

The same script as for backup/restore was used.
4.1.
vi /usr/local/bin/snapscript.sh

#!/bin/sh
currdate=`/bin/date +%Y-%m-%0e`
olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e`
chk=`zfs list -t snapshot | grep $olddate`
#creating snapshots for vol1 and Avol2 pools
/sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate
/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X 
#deleting 21-days old snapshots (if they are exists)
/sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate
/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost

4.2.
For MDT:

 zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-MDT0000 -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=1 -o lustre:fsname=lustrerr -o lustre:index=0 -o lustre:version=1  rumrrlustre-mdt0msg/mdt0@date rumrrlustre-mdt0msg/mdt00

4.3.
For OST (N-namber of ost)

 zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-OST000N  -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=34 -o lustre:fsname=lustrerr -o lustre:index=N -o lustre:version=1  rumrrlustre-ossN/ost0@date rumrrlustre-ossN/ostN0

4.4.
Stop lustre (on all nodes)

 service lustre stop

4.5.
In /etc/ldev (All must be edited. Follow is example for first server)

ls-1.scanex.ru - lustrerr:MDT0000       zfs:rumrrlustre-mdt0msg/mdt00
ls-1.scanex.ru - lustrerr:OST000N       zfs:rumrrlustre-ossN/ostN0

4.6.
Start lustre (on all nodes)

service lustre start

4.7.
Copy data to the chose location (local paths or remote) and after stop lustre on all nodes

service lustre stop

4.8.
Restore initial configuration /etc/ldev.conf and start lustre.

service lustre start

4.9.
Copy data from path to luster

4.10.

Delete zfs-clones by zfs destroy. doc

About author

Profile of the author

Table of Contents