This article tells about UPGRADE of follow cluster with a little changes.
All following are valid for version Lustre 2.5.3
Below scheme was used:
One server MGS/MDS/OSS and five OSS servers.
Configuration of MGS/MDS/OSS server:
Proc Intel Xeon 56xx 2×2.4Ggz
Mem 72Gb*
Net 6x1Gbit/s
SSD 2x120Gb
HDD RAID6+HS - 24x3TB disks
*Memory volume was caused by using this node for SMB,NFS export
Configuration of OSS server:
Proc Intel Xeon 56xx 2×2.4Ggz
Mem 12Gb
Net 4x1Gbit/s
HDD Adaptec RAID6+HS - 24x3TB disks
Network:
All servers in one vlan. (There is no Backend or Frontend)
OS on all server: Centos 6.5
The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there were free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions.
* Install ОС Centos 6.5 * Update and install packets
yum --exlude=kernel/* update -y yum localinstall --nogpgcheck https://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm yum install zfs strace sysstat man wget net-snmp openssh-clients ntp ntpdate tuned
Check that zfs module was compiled. (lustre 2.5.3 compartable with ZFS 0.6.3)
* Create bond on every of MGS/MDS/OSS and ОSS servers:
bond0 BONDING_OPTS="miimon=100 mode=0"
* Disable SELINUX
* Install follow packages:
yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp
* Configure ntp
* On all server set identical (uid:gid)
* Set scheduler parameters: tuned-adm profile latency-performance
* Tuning sysctl.conf
# increase Linux TCP buffer limits net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 # increase default and maximum Linux TCP buffer sizes net.ipv4.tcp_rmem = 4096 262144 8388608 net.ipv4.tcp_wmem = 4096 262144 8388608 # increase max backlog to avoid dropped packets net.core.netdev_max_backlog=2500 net.ipv4.tcp_mem=8388608 8388608 8388608 sysctl net.ipv4.tcp_ecn=0
For servers:
Download utils:
wget -r https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.9.wc1/el6/RPMS/x86_64/
and lustre:
wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/server/RPMS/x86_64/
Install utils:
Remove utils (rpm -e –nodeps) e2fsprogs e2fsprogs-libs libcom_err libss
install new:
rpm -ivh libcom_err-1.42.9.wc1-7.el6.x86_64.rpm rpm -ivh e2fsprogs-libs-1.42.9.wc1-7.el6.x86_64.rpm rpm -ivh e2fsprogs-1.42.9.wc1-7.el6.x86_64.rpm
Install Lustre:
rpm -ivh --force kernel-2.6.32-431.23.3.el6_lustre.x86_64.rpm rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-osd-zfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-osd-ldiskfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
Check lustre kernel will be boot by default in /boot/grub/grub.conf
Configure LNET:
echo "options lnet networks=tcp0(bond0)" > /etc/modprobe.d/lustre.conf
Reboot nodes
reboot
For the clients:
Download and install utils.
Update kernel:
yum install -y kernel-2.6.32-431.23.3.el6 reboot
Dowload lustre:
wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/client/RPMS/x86_64/
Install Lustre:
rpm -ivh lustre-client-modules-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm rpm -ivh lustre-client-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm
Follow steps of deploying:
1. Make MGS/MDS.
2. Make OSS/OST
For MGS/MDS/OSS:
Just in case: ln -s /lib64/libzfs.so.2.0.0 libzfs.so.2
mkfs.lustre --reformat --mgs --backfstype=zfs --fsname=lustrerr rumrrlustre-mdt0msg/mgs mirror /dev/sdd /dev/sde mkfs.lustre --mdt --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-mdt0msg/mdt0 Create /etc/ldev.conf # exampl mple /etc/ldev.conf # # local foreign/- label [md|zfs:]device-path [journal-path] # ls-1 - MGS zfs:rumrrlustre-mdt0msg/mgs ls-1 - lustrerr:MDT0000 zfs:rumrrlustre-mdt0msg/mdt0 ls-1 - lustrerr:OST0000 zfs:rumrrlustre-oss0/ost0
service lustre start MGS service lustre start MDT0000
In case of problems check lustre - LNET
lctl list_nids.
if no output
lctl network up
mkfs.lustre --ost --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss0/ost0 /dev/ost-drive ost-drive -RAID6 named by udev rules.
mkdir /lustre /etc/fstab 192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0
For OSS servers:
mkfs.lustre --ost --backfstype=zfs --index=**N** --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss**N**/ost0 /dev/ost-drive where N-serial number. Example: mkfs.lustre --ost --backfstype=zfs --index=1 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss1/ost0 /dev/ost-drive Create /etc/ldev.conf # exampl mple /etc/ldev.conf # # local foreign/- label [md|zfs:]device-path [journal-path] # ls-M - lustrerr:OST000N zfs:rumrrlustre-ossN/ost0 #where M = N+1
For the clients:
mkdir /lustre /etc/fstab 192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0
For every server with mounted lustre filesystem:
lfs df -h UUID bytes Used Available Use% Mounted on lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0] lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0] lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1] lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2] lustrerr-OST0003_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:3] lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4] lustrerr-OST0005_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:5]
filesystem summary: 334.0T 40.6T 293.4T 12% /lustre
The tasks below will be considered: Rebalance data, delete ost, backup/restore, restore data with snapshot
1. Rebalance of data for OST when new node was added Example: (look at to lustrerr-OST0005_UUID)
lfs df -h UUID bytes Used Available Use% Mounted on lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0] lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0] lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1] lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2] lustrerr-OST0003_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:3] lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4] lustrerr-OST0005_UUID 55.7T 52.7T 5.0T 94% /lustre[OST:5]
filesystem summary: 334.0T 40.6T 293.4T 12% /lustre
There could be two problems:
1.1 Adding new data problem associated with lack of free space just on one of OST
1.2 Increasing of I/O load on a new node.
You should use following algorithm for solving this problem:
Example:
lctl –device N deactivate lfs find –ost {OST_UUID} -size +1G | lfs_migrate -y lctl –device N activate
2. Delete (OST)
You need to use algorithm above for solving this task:
lctl --device FS-OST0003_UUID deactivate #temporary deactivate lfs find --obd FS-OST0003_UUID /lustre | lfs_migrate -y #migrate data lctl conf_param FS-OST0003_UUID.osc.active=0 #permanently deactivate
Result:
lfs df -h UUID bytes Used Available Use% Mounted on lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0] lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0] lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1] lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2] lustrerr-OST0003_UUID : inactive device lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4] lustrerr-OST0005_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:5]
3.Backup and restore.
Solved by using snapshots. Snapshots can be moved to different places. Example of MDT backup (OST commented).
vi /usr/local/bin/snapscript.sh
#!/bin/sh currdate=`/bin/date +%Y-%m-%0e` olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e` chk=`zfs list -t snapshot | grep $olddate` #creating snapshots for vol1 and Avol2 pools /sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate #/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X #deleting 21-days old snapshots (if they are exists) /sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate #/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost /sbin/zfs send -p rumrrlustre-mdt0msg/mdt0@$currdate | /bin/gzip > /root/meta-snap.gz #backup only mdt # also mdt and ost can be backuped to remote node # example: zfs send -R rumrrlustre-mdt0msg/mdt0@$currdate | ssh some-node zfs receive rumrrlustre-mdt0msg/mdt0@$currdate
Restore from backup (example only for mdt)
service lustre stop lustrerr:MDT0000 zfs rename rumrrlustre-mdt0msg/mdt0 rumrrlustre-mdt0msg/mdt0-old gunzip -c /root/meta-snap.gz | zfs receive rumrrlustre-mdt0msg/mdt0 service lustre start lustrerz:MDT0000
See logs:
tail -f /var/log/messages Oct 14 14:12:44 ls-1 kernel: Lustre: lustrerr-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1413281557/real 1413281557] req@ffff880c60512c00 x1474855950917400/t0(0) o38->lustrerz- MDT0000-mdc-ffff880463edc000@0@lo:12/10 lens 400/544 e 0 to 1 dl 1413281588 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 71364 previous similar messages Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Will be in recovery for at least 2:30, or until 1 client reconnects Oct 14 14:13:08 ls-1 kernel: LustreError: 3937:0:(import.c:1000:ptlrpc_connect_interpret()) lustrerr-MDT0000_UUID went back in time (transno 55834576660 was previously committed, server now claims 55834576659)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000-mdc-ffff880463edc000: Connection restored to lustrerz-MDT0000 (at 0@lo) Oct 14 14:13:08 ls-1 kernel: Lustre: Skipped 1 previous similar message Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-OST0000: deleting orphan objects from 0x0:1571748 to 0x0:1571857 Oct 14 14:13:33 ls-1 kernel: LustreError: 167-0: lustrerz-MDT0000-lwp-OST0000: This client was evicted by lustrerz-MDT0000; in progress operations using this service will fail. Oct 14 14:13:33 ls-1 kernel: Lustre: lustrerr-MDT0000-lwp-OST0000: Connection restored to lustrerz-MDT0000 (at 0@lo)
4. Restore data by using snapshot.
The same script as for backup/restore was used.
4.1.
vi /usr/local/bin/snapscript.sh
#!/bin/sh currdate=`/bin/date +%Y-%m-%0e` olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e` chk=`zfs list -t snapshot | grep $olddate` #creating snapshots for vol1 and Avol2 pools /sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate /sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X #deleting 21-days old snapshots (if they are exists) /sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate /sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost
4.2.
For MDT:
zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-MDT0000 -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=1 -o lustre:fsname=lustrerr -o lustre:index=0 -o lustre:version=1 rumrrlustre-mdt0msg/mdt0@date rumrrlustre-mdt0msg/mdt00
4.3.
For OST (N-namber of ost)
zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-OST000N -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=34 -o lustre:fsname=lustrerr -o lustre:index=N -o lustre:version=1 rumrrlustre-ossN/ost0@date rumrrlustre-ossN/ostN0
4.4.
Stop lustre (on all nodes)
service lustre stop
4.5.
In /etc/ldev (All must be edited. Follow is example for first server)
ls-1.scanex.ru - lustrerr:MDT0000 zfs:rumrrlustre-mdt0msg/mdt00 ls-1.scanex.ru - lustrerr:OST000N zfs:rumrrlustre-ossN/ostN0
4.6.
Start lustre (on all nodes)
service lustre start
4.7.
Copy data to the chose location (local paths or remote) and after stop lustre on all nodes
service lustre stop
4.8.
Restore initial configuration /etc/ldev.conf and start lustre.
service lustre start
4.9.
Copy data from path to luster
4.10.
Delete zfs-clones by zfs destroy. doc
Profile of the author