Task:

* Deploy file cluster in volume 300 Тb
* Provide possibility of expansion of volume
* Provide ability of fast seacrh of files
* Provide ability write/read with 1Gbit/s for 10 clients (hosts)
* Provide access for linux/windows hosts
* Average file size - 50 MB
* Budget - 53 000$

Solution - Lustre

Introduction.

Lustre is a distributed file system with separate storing of data and meta-data. ext3 file system is used for storing of data and meta-data directly on hosts. In this articale: ZFS based Lustre solution is zfs file system with associated ability used (we will not consider it).

Lustre consists of following components:

  • Management Server - MGS

Storing information about all file systems in the cluster. (Cluster can have only one MGS).
Exchanging the above information with all components.

  • Metadata Server - MDS

Providing functioning with meta-data, that are stored on one or more Metadata Targets - MDT (One MDS is the restriction of one file system).

  • Object Storage Server - ОSS

Providing functioning with data, that are stored on one or more Object Storage Targets - OST.

  • Lustre networking - LNET

Network module that provides exchanging of information in a cluster. Can use under Ethernet or InfiniBand

  • Lustre client

Nodes with deployed software Lustre client, that provide access to cluster file systems.

Below is the scheme from official documentation.

300

Choosing hardware*.

* All following will concern MGS/MDS/OSS - CentOS 6.2 x64 + Lustre Whamcloud 2.1.2
Lustre Client - CentOS 5.8 x64 + Lustre Whamcloud 1.8.8
Memory leak problem with clients CentOS 6.2 + Lustre Whamcloud 2.1.2 forced us to use different version on clients and servers
.

Taking into account the budget restrictions we decided to use the following scheme:
One node has functions - MGS,MDS,OSS.
Four nodes are - OSS.

Memory requirements :
Memory for every of MGS,MDS,OSS,Lustre clients should be more than 4 GB. Cache is set on all nodes on default. That must be taken into account. Cache of Lustre client can fill 3/4 of available memory space.

CPU requirements:
Lustre is enough productive system that can operate on slow CPUs. For example, 4-core Xeon 3000 will be enough. But for minimizing possible delays we have decided to use Intel Xeon E5620.

For nodes MGS/MDS/OSS - 48Gb
For nodes OSS - 24Gb
For client nodes - 24Gb. This value was chosen for extended functionality of clients. As a fact 8 Gb was enough. We faced serious delays on a system with 4Gb memory.

MDT HDD requirement:
Store Volume 300TB/File size average 50MB*2K (Inode size) = 12.228 Gb.
For providing high speed I/O for meta-data we decided to use SSD.

Network requirement:
For providing requirement characteristics with account of budget restriction we decided to use Ethernet for BackEnd-network.

Working Scheme and hardware configuration.


300

Configuration of MGS/MDS/OSS server:

Type Model Quantity
Chassis Supermicro SC846TQ-R900B 1
MotherBoard Supermicro X8DTN+-F 1
Memory KVR1333D3E9SK3/12G 2
CPU Intel® Xeon® Processor E5620 2
RAID controller Adaptec RAID 52445 1
Cable for RAID Cable SAS SFF8087 - 4*SATA MS36-4SATA-100 6
HDD Seagate ST3000DM001 24
Ethernet card 4-ports Intel <E1G44HT> Gigabit Adapter Quad Port (OEM) PCI-E 4×10/100/1000Mbps 1
SSD 40 Gb SATA-II 300 Intel 320 Series <SSDSA2CT040G3K5> 2
SSD 120 Gb SATA 6Gb / s Intel 520 Series < SSDSC2CW120A310 / 01> 2.5“ MLC 2

Configuration of OSS server:

Type Model Quantity
Chassis Supermicro SC846TQ-R900B 1
MotherBoard Supermicro X8DTN+-F 1
Memory KVR1333D3E9SK3/12G 2
CPU Intel® Xeon® Processor E5620 2
RAID controller Adaptec RAID 52445 1
Cable for RAID Cable SAS SFF8087 - 4*SATA MS36-4SATA-100 6
HDD Seagate ST3000DM001 24
Ethernet card 4-ports Intel <E1G44HT> Gigabit Adapter Quad Port (OEM) PCI-E 4×10/100/1000Mbps 1
SSD 40 Gb SATA-II 300 Intel 320 Series <SSDSA2CT040G3K5> 2

Configuration of Lustre-lient server:*
*this configuration was in our storehouse

Type Model Quantity
Server HP DL160R06 E5520 DL160 G6 E5620 2.40ГГц, 8ГБ (590161-421) 1
CPU Intel Xeon E5620 (2,40 ГГц/4-ядерный/80 Вт/12 МБ) для HP DL160 G6 (589711-B21) 1
Memeory MEM 4GB (1x4Gb 2Rank) 2Rx4 PC3-10600R-9 Registered DIMM (500658-B21) 4
HDD HDD 450GB 15k 6G LFF SAS 3.5” HotPlug Dual Port Universal Hard Drive (516816-B21) 2

Network Switch:

Type Model Quantity
Switch Сisco WS-C2960S-24TS-L 2
Stack module Cisco Catalyst 2960S FlexStack Stack Module optional for LAN Base [C2960S-STACK] 2

Also we included HP RACK 36U, APC 5000VA and KVM switch.

OS preparing and tuning

The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there was free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions.

  1. Create two RAID5 volume on a RAID. Each volume consist of 12 HDD.
  2. Create software RAID1 md0 using SSD 40Gb with mdadm.
  3. Create software RAID1 md1 using SSD 120Gb on MGS/MDS.
  4. Create two bond on each of MGS/MDS/OSS servers:
    1. bond0 (2 ports ) - FrontEnd
    2. bond1 (4 ports ) - BackEnd
    3. BONDING_OPTS=“miimon=100 mode=6”
  5. Disable SELINUX
  6. Deploy follow software:
    1. yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp
  7. Configure ntp
  8. Create same on all servers (uid:gid)
  9. Tune TCP parameters with sysctl.conf


   # increase Linux TCP buffer limits\\
   net.core.rmem_max = 8388608\\
   net.core.wmem_max = 8388608\\
   # increase default and maximum Linux TCP buffer sizes\\
   net.ipv4.tcp_rmem = 4096 262144 8388608\\
   net.ipv4.tcp_wmem = 4096 262144 8388608\\
   # increase max backlog to avoid dropped packets\\
   net.core.netdev_max_backlog=2500\\
   net.ipv4.tcp_mem=8388608 8388608 8388608\\
   sysctl net.ipv4.tcp_ecn=0

Installing Lustre

On a servers:

Download utilities' distributives : http://downloads.whamcloud.com/public/e2fsprogs/1.42.3.wc1/el6/RPMS/x86_64/
and lustre: http://downloads.whamcloud.com/public/lustre/lustre-2.1.2/el6/server/RPMS/x86_64/

Install utilities:
rpm -e e2fsprogs-1.41.12-11.el6.x86_64
rpm -e e2fsprogs-libs-1.41.12-11.el6.x86_64
rpm -Uvh e2fsprogs-libs-1.42.3.wc1-7.el6.x86_64.rpm
rpm -Uvh e2fsprogs-1.42.3.wc1-7.el6.x86_64.rpm
rpm -Uvh libss-1.42.3.wc1-7.el6.x86_64.rpm
rpm -Uvh libcom_err-1.42.3.wc1-7.el6.x86_64.rpm

Install Lustre:
rpm -ivh kernel-firmware-2.6.32-220.el6_lustre.g4554b65.x86_64.rpm
rpm -ivh kernel-2.6.32-220.el6_lustre.g4554b65.x86_64.rpm
rpm -ivh lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.g4554b65.x86_64.x86_64.rpm
rpm -ivh perf-2.6.32-220.el6_lustre.g4554b65.x86_64.rpm
rpm -ivh lustre-modules-2.1.1-2.6.32_220.el6_lustre.g4554b65.x86_64.x86_64.rpm
rpm -ivh lustre-2.1.1-2.6.32_220.el6_lustre.g4554b65.x86_64.x86_64.rpm

Check /boot/grub/grub.conf for default boot of lustre kernel

Configure network LNET:
echo “options lnet networks=tcp0(bond1)” > /etc/modprobe.d/lustre.conf

reboot

On a clients:

Update kernel:
yum update kernel-2.6.18-308.4.1.el5.x86_64 reboot Download utilities distributives: http://downloads.whamcloud.com/public/e2fsprogs/1.41.90.wc4/el5/x86_64/
and lustre: http://downloads.whamcloud.com/public/lustre/lustre-1.8.8-wc1/el5/client/RPMS/x86_64/

Install utilities:
rpm -Uvh –nodeps e2fsprogs-1.41.90.wc4-0redhat.x86_64.rpm
rpm -ivh uuidd-1.41.90.wc4-0redhat.x86_64.rpm

Install Lustre:
rpm -ivh lustre-client-modules-1.8.8-wc1_2.6.18_308.4.1.el5_gbc88c4c.x86_64.rpm
rpm -ivh lustre-client-1.8.8-wc1_2.6.18_308.4.1.el5_gbc88c4c.x86_64.rpm

Configure network LNET:
echo “options lnet networks=tcp0(eth1)” > /etc/modprobe.d/lustre.conf

reboot

Deploying Lustre

On a server MGS/MDS/OSS:

mkfs.lustre –fsname=FS –mgs –mdt –index=0 /dev/md1 (/dev/md1 is software RAID1)
mkdir /mdt
mount -t lustre /dev/md1 /mdt
echo “/dev/md1 /mdt lustre defaults,_netdev 0 0” » /etc/fstab

mkfs.lustre –fsname=FS –mgsnode=10.255.255.1@tcp0 –ost –index=0 /dev/sda (Где /dev/sda - том RAID5)
mkfs.lustre –fsname=FS –mgsnode=10.255.255.1@tcp0 –ost –index=1 /dev/sdb (Где /dev/sdb - том RAID5)

mkdir /ost0
mkdir /ost1

mount -t lustre /dev/sda /ost0
mount -t lustre /dev/sdb /ost1

echo “/dev/sda /ost0 lustre defaults,_netdev 0 0” » /etc/fstab
echo “/dev/sdb /ost1 lustre defaults,_netdev 0 0” » /etc/fstab

mkdir /FS
mount -t lustre /FS
echo “10.255.255.1@tcp0:/temp /FS lustre defaults,_netdev 0 0” » /etc/fstab

On a servers OSS:

mkfs.lustre –fsname=FS –mgsnode=10.255.255.1@tcp0 –ost –index=N /dev/sda (Где N-номер узла, /dev/sda - том RAID5)
mkfs.lustre –fsname=FS –mgsnode=10.255.255.1@tcp0 –ost –index=N+1 /dev/sdb (Где N-номер узла, /dev/sdb - том RAID5)

mkdir /ostN
mkdir /ostN+1

mount -t lustre /dev/sda /ostN
mount -t lustre /dev/sdb /ostN+1

echo “/dev/sda /ostN lustre defaults,_netdev 0 0” » /etc/fstab
echo “/dev/sdb /ostN+1 lustre defaults,_netdev 0 0” » /etc/fstab

mkdir /FS
mount -t lustre /FS
echo “10.255.255.1@tcp0:/temp /FS lustre defaults,_netdev 0 0” » /etc/fstab

Also you can release 5% of space on each OST without of service turnof
tune2fs -m 0 /dev/sda

On a clients:

mkdir /FS
mount -t lustre /FS
echo “10.255.255.1@tcp0:/temp /FS lustre defaults,_netdev 0 0” » /etc/fstab

Now you can display the system state:
lfs df -h
FS-MDT0000_UUID 83.8G 2.2G 76.1G 3% /FS[MDT:0]
FS-OST0000_UUID 30.0T 28.6T 1.4T 95% /FS[OST:0]
FS-OST0001_UUID 30.0T 28.7T 1.3T 96% /FS[OST:1]
FS-OST0002_UUID 30.0T 28.6T 1.3T 96% /FS[OST:2]
FS-OST0003_UUID 30.0T 28.7T 1.3T 96% /FS[OST:3]
FS-OST0004_UUID 30.0T 28.3T 1.7T 94% /FS[OST:4]
FS-OST0005_UUID 30.0T 28.2T 1.8T 94% /FS[OST:5]
FS-OST0006_UUID 30.0T 28.3T 1.7T 94% /FS[OST:6]
FS-OST0007_UUID 30.0T 28.2T 1.7T 94% /FS[OST:7]
FS-OST0008_UUID 30.0T 28.3T 1.7T 94% /FS[OST:8]
FS-OST0009_UUID 30.0T 28.2T 1.8T 94% /FS[OST:9]

Working with Lustre

The section is in detail reflected in official manual Manual we staying only on two tasks:

1. Rebalancing of data on OST if new node was added
Exapmle:

FS-MDT0000_UUID 83.8G 2.2G 76.1G 3% /FS[MDT:0]
FS-OST0000_UUID 30.0T 28.6T 1.4T 95% /FS[OST:0]
FS-OST0001_UUID 30.0T 28.7T 1.3T 96% /FS[OST:1]
FS-OST0002_UUID 30.0T 28.6T 1.3T 96% /FS[OST:2]
FS-OST0003_UUID 30.0T 28.7T 1.3T 96% /FS[OST:3]
FS-OST0004_UUID 30.0T 28.3T 1.7T 94% /FS[OST:4]
FS-OST0005_UUID 30.0T 28.2T 1.8T 94% /FS[OST:5]
FS-OST0006_UUID 30.0T 28.3T 1.7T 94% /FS[OST:6]
FS-OST0007_UUID 30.0T 28.2T 1.7T 94% /FS[OST:7]
FS-OST0008_UUID 30.0T 28.3T 1.7T 94% /FS[OST:8]
FS-OST0009_UUID 30.0T 28.2T 1.8T 94% /FS[OST:9]
FS-OST000a_UUID 30.0T 2.1T 27.9T 7% /FS[OST:10]
FS-OST000b_UUID 30.0T 2.2T 27.8T 7% /FS[OST:11]

There could be two problems:
1.1 Adding new data problem associated with lack of free space just on one of OST
1.2 Increasing of I/O load on a new node.

You should use following algorithm for solving this problems:

  • OST deactivation (ost will available only for read)
  • Moving date to more free OST
  • OST activation

Exapmle:
lctl –device N deactivate
lfs find –ost {OST_UUID} -size +1G | lfs_migrate -y
lctl –device N activate

2.Backup

I will need to stop the writing data with deativation and backup after.
Or you will need to use LVM2 snapshot, but production will down.

Afterword

Now I am recommending to use Lustre 1.88wc4 with OS Centos 5.8, as a stable.
Special thanks Whamcloud Manual for exhaustive documentation.

About author

Profile of the author

en/jobs/lustrefs.txt · Last modified: 2015/03/12 17:06 by admin
Recent changes RSS feed Debian Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki