Background
Normally, we run our servers with RAID1 with a pair of disks added at a time. Since we also use DRBD on top of our LVM LVs, we have 3 servers in play (3-way DRBD replication) — this means adding new disks in groups of 6, which is sort of spendy.
So, with our new servers, we are looking into switching out the RAID1s (grouped into a single VG) to a single RAID5 under the LVM PV (in a single VG). Linux RAID5s can be expanded on the fly, this lets us grow the data disk by adding only 1 disk at a time (3 across the 3 servers). The bulk of the application servers are not really I/O intensive, so we’re not really worried about the RAID5 performance hit.
Here’s the setup on our test of this process:
- Ubuntu 8.04 LTS Server
- AMD 64 x2 5600+ with 8GB
- 3 x 1TB SATA2 drives, 2 x 750GB SATA2 drives
- Xen’ified 2.6.18 kernel with Xen 3.3.1-rc4
- Linux RAID, LVM2, etc, etc.
We configured the partitions on the data disks as 2GB partitions, just so the sync doesn’t take *forever*.
Disk /dev/sde: 750.1 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00046fcd
Device Boot Start End Blocks Id System
/dev/sde1 1 1217 9775521 fd Linux raid autodetect
/dev/sde2 1218 1704 3911827+ 82 Linux swap / Solaris
/dev/sde3 1705 1946 1959930 fd Linux raid autodetect
The initial RAID 5 configuration:
md1 : active raid5 sda3[0] sdc3[2] sdb3[1]
3919616 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
LVM configuration:
root@xen-80-31:~# pvdisplay /dev/md1
--- Physical volume ---
PV Name /dev/md1
VG Name testvg
PV Size 3.74 GB / not usable 3.75 MB
Allocatable yes
PE Size (KByte) 4096
Total PE 956
Free PE 956
Allocated PE 0
PV UUID h6qBlQ-RCy3-YeE9-zQXw-j1oa-bg8K-2JULo1
root@xen-80-31:~# vgdisplay testvg
--- Volume group ---
VG Name testvg
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 3.73 GB
PE Size 4.00 MB
Total PE 956
Alloc PE / Size 0 / 0
Free PE / Size 956 / 3.73 GB
VG UUID iOVFKf-8iSV-k1VK-G37u-Ivns-9uqx-vZCduc
Process to expand the data device
Initial RAID5 configuration:
root@xen-80-31:~# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Thu Jan 8 04:33:16 2009
Raid Level : raid5
Array Size : 3919616 (3.74 GiB 4.01 GB)
Used Dev Size : 1959808 (1914.20 MiB 2006.84 MB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Thu Jan 8 17:52:46 2009
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 8fd9e7d9:0dae82af:b836248b:2f509f91
Events : 0.4
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
Add the new 2GB device to the RAID5 — it will show up as a “spare” device initially:
root@xen-80-31:~# mdadm --add /dev/md1 /dev/sdd3
mdadm: added /dev/sdd3
root@xen-80-31:~# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Thu Jan 8 04:33:16 2009
Raid Level : raid5
Array Size : 3919616 (3.74 GiB 4.01 GB)
Used Dev Size : 1959808 (1914.20 MiB 2006.84 MB)
Raid Devices : 3
Total Devices : 4
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Thu Jan 8 17:52:46 2009
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 64K
UUID : 8fd9e7d9:0dae82af:b836248b:2f509f91
Events : 0.4
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
3 8 51 - spare /dev/sdd3
Grow the RAID5 to include the new device:
root@xen-80-31:~# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
test1 testvg -wi-a- 1.00G
root@xen-80-31:~# mdadm --grow /dev/md1 --raid-devices=4
mdadm: Need to backup 384K of critical section..
mdadm: ... critical section passed.
/proc/mdstat:
md1 : active raid5 sdd3[3] sda3[0] sdc3[2] sdb3[1]
3919616 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
[=>...................] reshape = 7.9% (156192/1959808) finish=1.3min speed=22313K/sec
While the reshape is running, the VG is still active:
root@xen-80-31:~# lvcreate -L 1g -n test2 testvg
Logical volume "test2" created
root@xen-80-31:~# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
test1 testvg -wi-a- 1.00G
test2 testvg -wi-a- 1.00G
After reshaping complete:
root@xen-80-31:~# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Thu Jan 8 04:33:16 2009
Raid Level : raid5
Array Size : 5879424 (5.61 GiB 6.02 GB)
Used Dev Size : 1959808 (1914.20 MiB 2006.84 MB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Thu Jan 8 18:44:32 2009
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 8fd9e7d9:0dae82af:b836248b:2f509f91
Events : 0.1336
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
3 8 51 3 active sync /dev/sdd3
The new space is not reflected yet in the PV or VG – grow the PV:
root@xen-80-31:~# pvresize /dev/md1
Physical volume "/dev/md1" changed
1 physical volume(s) resized / 0 physical volume(s) not resized
root@xen-80-31:~# pvdisplay
--- Physical volume ---
PV Name /dev/md1
VG Name testvg
PV Size 5.61 GB / not usable 1.44 MB
Allocatable yes
PE Size (KByte) 4096
Total PE 1435
Free PE 923
Allocated PE 512
PV UUID h6qBlQ-RCy3-YeE9-zQXw-j1oa-bg8K-2JULo1
root@xen-80-31:~# vgdisplay
--- Volume group ---
VG Name testvg
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 5.61 GB
PE Size 4.00 MB
Total PE 1435
Alloc PE / Size 512 / 2.00 GB
Free PE / Size 923 / 3.61 GB
VG UUID iOVFKf-8iSV-k1VK-G37u-Ivns-9uqx-vZCduc
In another test, growing 3 x 200GB partitions by 1 more 200GB partition on our system took around 150 minutes, so the reshaping process is not super speedy. Even though our test showed that you can still perform I/O against the back end data store (RAID5) while it is reshaping, it would probably be best to keep I/O to a minimum.
UPDATE: We repeated the test with 3 x 700GB partitions and added a 4th 700GB partition — reshaping time took about 8.5h with no external I/O performed to the LVM/RAID5 device.