====== Task: ====== * Deploy file cluster in volume 300 Тb * Provide possibility of expansion of volume * Provide ability of fast seacrh of files * Provide ability write/read with 1Gbit/s for 10 clients (hosts) * Provide access for linux/windows hosts * Average file size - 50 MB * Budget - 53 000$ ====== Solution - Lustre ====== ==== Introduction. ==== Lustre is a distributed file system with separate storing of data and meta-data. **ext3** file system is used for storing of data and meta-data directly on hosts. In this articale: [[http://myitnotes.info/doku.php?id=en:jobs:lustrefszfs|ZFS based Lustre solution]] is zfs file system with associated ability used (we will not consider it). **Lustre consists of following components:**\\ * **Management Server - MGS**\\ Storing information about all file systems in the cluster. (Cluster can have only one MGS).\\ Exchanging the above information with all components. * **Metadata Server - MDS**\\ Providing functioning with meta-data, that are stored on one or more Metadata Targets - MDT (One MDS is the restriction of one file system). * **Object Storage Server - ОSS**\\ Providing functioning with data, that are stored on one or more Object Storage Targets - OST. * **Lustre networking - LNET**\\ Network module that provides exchanging of information in a cluster. Can use under Ethernet or %%InfiniBand%%\\ * **Lustre client**\\ Nodes with deployed software Lustre client, that provide access to cluster file systems. Below is the scheme from official documentation. {{:ru:jobs:lustrecomp.jpg?600|300}} ==== Choosing hardware*. ==== *** All following will concern MGS/MDS/OSS - %%CentOS%% 6.2 x64 + Lustre Whamcloud 2.1.2 \\ Lustre Client - %%CentOS%% 5.8 x64 + Lustre Whamcloud 1.8.8 \\ Memory leak problem with clients %%CentOS%% 6.2 + Lustre Whamcloud 2.1.2 forced us to use different version on clients and servers**.\\ Taking into account the budget restrictions we decided to use the following scheme: \\ One node has functions - MGS,MDS,OSS.\\ Four nodes are - OSS.\\ Memory requirements :\\ Memory for every of MGS,MDS,OSS,Lustre clients should be more than 4 GB. Cache is set on all nodes on default. That must be taken into account. Cache of Lustre client can fill 3/4 of available memory space. CPU requirements:\\ Lustre is enough productive system that can operate on slow CPUs. For example, 4-core Xeon 3000 will be enough. But for minimizing possible delays we have decided to use Intel Xeon E5620. For nodes MGS/MDS/OSS - 48Gb\\ For nodes OSS - 24Gb\\ For client nodes - 24Gb. This value was chosen for extended functionality of clients. As a fact 8 Gb was enough. We faced serious delays on a system with 4Gb memory.\\ MDT HDD requirement:\\ Store Volume 300TB/File size average 50MB*2K (Inode size) = 12.228 Gb.\\ For providing high speed I/O for meta-data we decided to use SSD.\\ Network requirement:\\ For providing requirement characteristics with account of budget restriction we decided to use Ethernet for %%BackEnd%%-network. ====Working Scheme and hardware configuration.==== \\ {{:ru:jobs:lustre-cluster.jpg?600|300}} **Configuration of MGS/MDS/OSS server:**\\ ^ Type ^ Model ^ Quantity ^ |Chassis | Supermicro SC846TQ-R900B | 1 | |%%MotherBoard%% | Supermicro X8DTN+-F | 1 | |Memory | KVR1333D3E9SK3/12G | 2 | |CPU| Intel® Xeon® Processor E5620 | 2 | | RAID controller| Adaptec RAID 52445 | 1 | | Cable for RAID | Cable SAS SFF8087 - 4*SATA MS36-4SATA-100 | 6 | | HDD | Seagate ST3000DM001 | 24 | | Ethernet card 4-ports | Intel Gigabit Adapter Quad Port (OEM) PCI-E 4x10/100/1000Mbps | 1 | | SSD | 40 Gb SATA-II 300 Intel 320 Series | 2 | | SSD | 120 Gb SATA 6Gb / s Intel 520 Series < SSDSC2CW120A310 / 01> 2.5" MLC | 2 | **Configuration of OSS server:**\\ ^ Type ^ Model ^ Quantity ^ |Chassis | Supermicro SC846TQ-R900B | 1 | |MotherBoard | Supermicro X8DTN+-F | 1 | |Memory | KVR1333D3E9SK3/12G | 2 | |CPU | Intel® Xeon® Processor E5620 | 2 | | RAID controller | Adaptec RAID 52445 | 1 | | Cable for RAID | Cable SAS SFF8087 - 4*SATA MS36-4SATA-100 | 6 | | HDD | Seagate ST3000DM001 | 24 | | Ethernet card 4-ports | Intel Gigabit Adapter Quad Port (OEM) PCI-E 4x10/100/1000Mbps | 1 | | SSD | 40 Gb SATA-II 300 Intel 320 Series | 2 | **Configuration of Lustre-lient server:***\\ *this configuration was in our storehouse ^ Type ^ Model ^ Quantity ^ | Server | HP DL160R06 E5520 DL160 G6 E5620 2.40ГГц, 8ГБ (590161-421) | 1 | |CPU | Intel Xeon E5620 (2,40 ГГц/4-ядерный/80 Вт/12 МБ) для HP DL160 G6 (589711-B21) | 1 | |Memeory | MEM 4GB (1x4Gb 2Rank) 2Rx4 PC3-10600R-9 Registered DIMM (500658-B21) | 4 | |HDD | HDD 450GB 15k 6G LFF SAS 3.5" %%HotPlug%% Dual Port Universal Hard Drive (516816-B21) | 2 | **Network Switch:**\\ ^ Type ^ Model ^ Quantity ^ | Switch | Сisco WS-C2960S-24TS-L | 2 | |Stack module | Cisco Catalyst 2960S %%FlexStack%% Stack Module optional for LAN Base [C2960S-STACK] | 2 | Also we included HP RACK 36U, APC 5000VA and KVM switch. ==== OS preparing and tuning ===== The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there was free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions.\\ - Create two RAID5 volume on a RAID. Each volume consist of 12 HDD. - Create software RAID1 md0 using SSD 40Gb with mdadm. - Create software RAID1 md1 using SSD 120Gb on MGS/MDS. - Create two bond on each of MGS/MDS/OSS servers: - bond0 (2 ports ) - %% FrontEnd %% - bond1 (4 ports ) - %% BackEnd %% - BONDING_OPTS="miimon=100 mode=6" - Disable SELINUX - Deploy follow software: - //yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp// - Configure ntp - Create same on all servers (uid:gid) - Tune TCP parameters with sysctl.conf \\ # increase Linux TCP buffer limits\\ net.core.rmem_max = 8388608\\ net.core.wmem_max = 8388608\\ # increase default and maximum Linux TCP buffer sizes\\ net.ipv4.tcp_rmem = 4096 262144 8388608\\ net.ipv4.tcp_wmem = 4096 262144 8388608\\ # increase max backlog to avoid dropped packets\\ net.core.netdev_max_backlog=2500\\ net.ipv4.tcp_mem=8388608 8388608 8388608\\ sysctl net.ipv4.tcp_ecn=0 ==== Installing Lustre ===== **On a servers:** Download utilities' distributives : http://downloads.whamcloud.com/public/e2fsprogs/1.42.3.wc1/el6/RPMS/x86_64/\\ and lustre: http://downloads.whamcloud.com/public/lustre/lustre-2.1.2/el6/server/RPMS/x86_64/\\ Install utilities:\\ rpm -e e2fsprogs-1.41.12-11.el6.x86_64\\ rpm -e e2fsprogs-libs-1.41.12-11.el6.x86_64\\ rpm -Uvh e2fsprogs-libs-1.42.3.wc1-7.el6.x86_64.rpm\\ rpm -Uvh e2fsprogs-1.42.3.wc1-7.el6.x86_64.rpm\\ rpm -Uvh libss-1.42.3.wc1-7.el6.x86_64.rpm\\ rpm -Uvh libcom_err-1.42.3.wc1-7.el6.x86_64.rpm\\ Install Lustre:\\ rpm -ivh kernel-firmware-2.6.32-220.el6_lustre.g4554b65.x86_64.rpm\\ rpm -ivh kernel-2.6.32-220.el6_lustre.g4554b65.x86_64.rpm\\ rpm -ivh lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.g4554b65.x86_64.x86_64.rpm\\ rpm -ivh perf-2.6.32-220.el6_lustre.g4554b65.x86_64.rpm\\ rpm -ivh lustre-modules-2.1.1-2.6.32_220.el6_lustre.g4554b65.x86_64.x86_64.rpm\\ rpm -ivh lustre-2.1.1-2.6.32_220.el6_lustre.g4554b65.x86_64.x86_64.rpm\\ Check /boot/grub/grub.conf for default boot of lustre kernel\\ Configure network LNET:\\ echo "options lnet networks=tcp0(bond1)" > /etc/modprobe.d/lustre.conf\\ reboot **On a clients:** Update kernel:\\ yum update kernel-2.6.18-308.4.1.el5.x86_64 reboot Download utilities distributives: http://downloads.whamcloud.com/public/e2fsprogs/1.41.90.wc4/el5/x86_64/\\ and lustre: http://downloads.whamcloud.com/public/lustre/lustre-1.8.8-wc1/el5/client/RPMS/x86_64/\\ Install utilities:\\ rpm -Uvh --nodeps e2fsprogs-1.41.90.wc4-0redhat.x86_64.rpm\\ rpm -ivh uuidd-1.41.90.wc4-0redhat.x86_64.rpm\\ Install Lustre:\\ rpm -ivh lustre-client-modules-1.8.8-wc1_2.6.18_308.4.1.el5_gbc88c4c.x86_64.rpm\\ rpm -ivh lustre-client-1.8.8-wc1_2.6.18_308.4.1.el5_gbc88c4c.x86_64.rpm\\ Configure network LNET:\\ echo "options lnet networks=tcp0(eth1)" > /etc/modprobe.d/lustre.conf reboot ====Deploying Lustre==== **On a server MGS/MDS/OSS:**\\ //mkfs.lustre --fsname=FS --mgs --mdt --index=0 /dev/md1// (/dev/md1 is software RAID1)\\ //mkdir /mdt\\ mount -t lustre /dev/md1 /mdt\\ echo "/dev/md1 /mdt lustre defaults,_netdev 0 0" >> /etc/fstab//\\ //mkfs.lustre --fsname=FS --mgsnode=10.255.255.1@tcp0 --ost --index=0 /dev/sda// (Где /dev/sda - том RAID5)\\ //mkfs.lustre --fsname=FS --mgsnode=10.255.255.1@tcp0 --ost --index=1 /dev/sdb// (Где /dev/sdb - том RAID5)\\ //mkdir /ost0\\ mkdir /ost1// //mount -t lustre /dev/sda /ost0\\ mount -t lustre /dev/sdb /ost1// // echo "/dev/sda /ost0 lustre defaults,_netdev 0 0" >> /etc/fstab\\ echo "/dev/sdb /ost1 lustre defaults,_netdev 0 0" >> /etc/fstab// //mkdir /FS\\ mount -t lustre /FS\\ echo "10.255.255.1@tcp0:/temp /FS lustre defaults,_netdev 0 0" >> /etc/fstab// **On a servers OSS:**\\ //mkfs.lustre --fsname=FS --mgsnode=10.255.255.1@tcp0 --ost --index=N /dev/sda// (Где N-номер узла, /dev/sda - том RAID5)\\ //mkfs.lustre --fsname=FS --mgsnode=10.255.255.1@tcp0 --ost --index=N+1 /dev/sdb// (Где N-номер узла, /dev/sdb - том RAID5)\\ //mkdir /ostN\\ mkdir /ostN+1// //mount -t lustre /dev/sda /ostN\\ mount -t lustre /dev/sdb /ostN+1// // echo "/dev/sda /ostN lustre defaults,_netdev 0 0" >> /etc/fstab\\ echo "/dev/sdb /ostN+1 lustre defaults,_netdev 0 0" >> /etc/fstab// //mkdir /FS\\ mount -t lustre /FS\\ echo "10.255.255.1@tcp0:/temp /FS lustre defaults,_netdev 0 0" >> /etc/fstab// ** Also you can release 5% of space on each OST without of service turnof**\\ //tune2fs -m 0 /dev/sda// **On a clients:** //mkdir /FS\\ mount -t lustre /FS\\ echo "10.255.255.1@tcp0:/temp /FS lustre defaults,_netdev 0 0" >> /etc/fstab// Now you can display the system state:\\ //lfs df -h//\\ //FS-MDT0000_UUID 83.8G 2.2G 76.1G 3% /FS[MDT:0]\\ FS-OST0000_UUID 30.0T 28.6T 1.4T 95% /FS[OST:0]\\ FS-OST0001_UUID 30.0T 28.7T 1.3T 96% /FS[OST:1]\\ FS-OST0002_UUID 30.0T 28.6T 1.3T 96% /FS[OST:2]\\ FS-OST0003_UUID 30.0T 28.7T 1.3T 96% /FS[OST:3]\\ FS-OST0004_UUID 30.0T 28.3T 1.7T 94% /FS[OST:4]\\ FS-OST0005_UUID 30.0T 28.2T 1.8T 94% /FS[OST:5]\\ FS-OST0006_UUID 30.0T 28.3T 1.7T 94% /FS[OST:6]\\ FS-OST0007_UUID 30.0T 28.2T 1.7T 94% /FS[OST:7]\\ FS-OST0008_UUID 30.0T 28.3T 1.7T 94% /FS[OST:8]\\ FS-OST0009_UUID 30.0T 28.2T 1.8T 94% /FS[OST:9]// ====Working with Lustre==== The section is in detail reflected in official manual [[http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.pdf|Manual]] we staying only on two tasks: 1. Rebalancing of data on OST if new node was added\\ **Exapmle:** //FS-MDT0000_UUID 83.8G 2.2G 76.1G 3% /FS[MDT:0]\\ FS-OST0000_UUID 30.0T 28.6T 1.4T 95% /FS[OST:0]\\ FS-OST0001_UUID 30.0T 28.7T 1.3T 96% /FS[OST:1]\\ FS-OST0002_UUID 30.0T 28.6T 1.3T 96% /FS[OST:2]\\ FS-OST0003_UUID 30.0T 28.7T 1.3T 96% /FS[OST:3]\\ FS-OST0004_UUID 30.0T 28.3T 1.7T 94% /FS[OST:4]\\ FS-OST0005_UUID 30.0T 28.2T 1.8T 94% /FS[OST:5]\\ FS-OST0006_UUID 30.0T 28.3T 1.7T 94% /FS[OST:6]\\ FS-OST0007_UUID 30.0T 28.2T 1.7T 94% /FS[OST:7]\\ FS-OST0008_UUID 30.0T 28.3T 1.7T 94% /FS[OST:8]\\ FS-OST0009_UUID 30.0T 28.2T 1.8T 94% /FS[OST:9]\\ FS-OST000a_UUID 30.0T 2.1T 27.9T 7% /FS[OST:10]\\ FS-OST000b_UUID 30.0T 2.2T 27.8T 7% /FS[OST:11]//\\ There could be two problems:\\ 1.1 Adding new data problem associated with lack of free space just on one of OST\\ 1.2 Increasing of I/O load on a new node.\\ You should use following algorithm for solving this problems:\\ * OST deactivation (ost will available only for read) * Moving date to more free OST * OST activation Exapmle:\\ //lctl --device N deactivate\\ lfs find --ost {OST_UUID} -size +1G | lfs_migrate -y\\ lctl --device N activate// 2.Backup I will need to stop the writing data with deativation and backup after.\\ Or you will need to use LVM2 snapshot, but production will down.\\ ====Afterword==== Now I am recommending to use Lustre 1.88wc4 with OS Centos 5.8, as a stable.\\ Special thanks Whamcloud [[http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.pdf|Manual]] for exhaustive documentation.\\ ==== About author ==== [[https://www.linkedin.com/pub/alexey-vyrodov/59/976/16b|Profile]] of the author