Expanding a Linux LVM PV and underlying RAID5

January 8, 2009

Background

Normally, we run our servers with RAID1 with a pair of disks added at a time. Since we also use DRBD on top of our LVM LVs, we have 3 servers in play (3-way DRBD replication) — this means adding new disks in groups of 6, which is sort of spendy.

So, with our new servers, we are looking into switching out the RAID1s (grouped into a single VG) to a single RAID5 under the LVM PV (in a single VG). Linux RAID5s can be expanded on the fly, this lets us grow the data disk by adding only 1 disk at a time (3 across the 3 servers). The bulk of the application servers are not really I/O intensive, so we’re not really worried about the RAID5 performance hit.

Here’s the setup on our test of this process:

  • Ubuntu 8.04 LTS Server
  • AMD 64 x2 5600+ with 8GB
  • 3 x 1TB SATA2 drives, 2 x 750GB SATA2 drives
  • Xen’ified 2.6.18 kernel with Xen 3.3.1-rc4
  • Linux RAID, LVM2, etc, etc.

We configured the partitions on the data disks as 2GB partitions, just so the sync doesn’t take *forever*.

Disk /dev/sde: 750.1 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00046fcd

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1        1217     9775521   fd  Linux raid autodetect
/dev/sde2            1218        1704     3911827+  82  Linux swap / Solaris
/dev/sde3            1705        1946     1959930   fd  Linux raid autodetect

The initial RAID 5 configuration:

md1 : active raid5 sda3[0] sdc3[2] sdb3[1]
      3919616 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

LVM configuration:

root@xen-80-31:~# pvdisplay /dev/md1
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               testvg
  PV Size               3.74 GB / not usable 3.75 MB
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              956
  Free PE               956
  Allocated PE          0
  PV UUID               h6qBlQ-RCy3-YeE9-zQXw-j1oa-bg8K-2JULo1

root@xen-80-31:~# vgdisplay testvg
  --- Volume group ---
  VG Name               testvg
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               3.73 GB
  PE Size               4.00 MB
  Total PE              956
  Alloc PE / Size       0 / 0
  Free  PE / Size       956 / 3.73 GB
  VG UUID               iOVFKf-8iSV-k1VK-G37u-Ivns-9uqx-vZCduc

Process to expand the data device

Initial RAID5 configuration:

root@xen-80-31:~# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Thu Jan  8 04:33:16 2009
     Raid Level : raid5
     Array Size : 3919616 (3.74 GiB 4.01 GB)
  Used Dev Size : 1959808 (1914.20 MiB 2006.84 MB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Thu Jan  8 17:52:46 2009
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 8fd9e7d9:0dae82af:b836248b:2f509f91
         Events : 0.4

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3

Add the new 2GB device to the RAID5 — it will show up as a “spare” device initially:

root@xen-80-31:~# mdadm --add /dev/md1 /dev/sdd3
mdadm: added /dev/sdd3
root@xen-80-31:~# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Thu Jan  8 04:33:16 2009
     Raid Level : raid5
     Array Size : 3919616 (3.74 GiB 4.01 GB)
  Used Dev Size : 1959808 (1914.20 MiB 2006.84 MB)
   Raid Devices : 3
  Total Devices : 4
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Thu Jan  8 17:52:46 2009
          State : clean
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 8fd9e7d9:0dae82af:b836248b:2f509f91
         Events : 0.4

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3

       3       8       51        -      spare   /dev/sdd3

Grow the RAID5 to include the new device:

root@xen-80-31:~# lvs
  LV    VG     Attr   LSize Origin Snap%  Move Log Copy%
  test1 testvg -wi-a- 1.00G
root@xen-80-31:~# mdadm --grow /dev/md1 --raid-devices=4
mdadm: Need to backup 384K of critical section..
mdadm: ... critical section passed.

/proc/mdstat:

md1 : active raid5 sdd3[3] sda3[0] sdc3[2] sdb3[1]
      3919616 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      [=>...................]  reshape =  7.9% (156192/1959808) finish=1.3min speed=22313K/sec

While the reshape is running, the VG is still active:

root@xen-80-31:~# lvcreate -L 1g -n test2 testvg
  Logical volume "test2" created
root@xen-80-31:~# lvs
  LV    VG     Attr   LSize Origin Snap%  Move Log Copy%
  test1 testvg -wi-a- 1.00G
  test2 testvg -wi-a- 1.00G

After reshaping complete:

root@xen-80-31:~# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Thu Jan  8 04:33:16 2009
     Raid Level : raid5
     Array Size : 5879424 (5.61 GiB 6.02 GB)
  Used Dev Size : 1959808 (1914.20 MiB 2006.84 MB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Thu Jan  8 18:44:32 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 8fd9e7d9:0dae82af:b836248b:2f509f91
         Events : 0.1336

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3

The new space is not reflected yet in the PV or VG – grow the PV:

root@xen-80-31:~# pvresize /dev/md1
  Physical volume "/dev/md1" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized
root@xen-80-31:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               testvg
  PV Size               5.61 GB / not usable 1.44 MB
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              1435
  Free PE               923
  Allocated PE          512
  PV UUID               h6qBlQ-RCy3-YeE9-zQXw-j1oa-bg8K-2JULo1

root@xen-80-31:~# vgdisplay
  --- Volume group ---
  VG Name               testvg
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               5.61 GB
  PE Size               4.00 MB
  Total PE              1435
  Alloc PE / Size       512 / 2.00 GB
  Free  PE / Size       923 / 3.61 GB
  VG UUID               iOVFKf-8iSV-k1VK-G37u-Ivns-9uqx-vZCduc

In another test, growing 3 x 200GB partitions by 1 more 200GB partition on our system took around 150 minutes, so the reshaping process is not super speedy. Even though our test showed that you can still perform I/O against the back end data store (RAID5) while it is reshaping, it would probably be best to keep I/O to a minimum.

UPDATE: We repeated the test with 3 x 700GB partitions and added a 4th 700GB partition — reshaping time took about 8.5h with no external I/O performed to the LVM/RAID5 device.


Shrinking a Linux LVM on top of an md RAID1

December 23, 2008

If I don’t start blogging some of the stuff I’m doing, I’ll never remember how I did it in the first place!  So, time to try to come out from the darkness and see if I can keep discipline to post more.

Problem:  One of our servers has a Linux md RAID1 device (md1), which would keep falling out of sync, especially after a reboot.  Upon nearing the end of they re-sync, it would pause, fail, and start over.  Entries from /var/log/messages

Dec 22 11:56:52 xen-33-18-02 kernel: md: md1: sync done.
Dec 22 11:56:56 xen-33-18-02 kernel: ata1: soft resetting port
Dec 22 11:56:56 xen-33-18-02 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 22 11:56:56 xen-33-18-02 kernel: ata1.00: configured for UDMA/133
Dec 22 11:56:56 xen-33-18-02 kernel: ata1: EH complete
:
Dec 22 11:56:56 xen-33-18-02 kernel: SCSI device sda: 1465149168 512-byte hdwr sectors (750156 MB)
Dec 22 11:56:56 xen-33-18-02 kernel: sda: Write Protect is off
Dec 22 11:56:56 xen-33-18-02 kernel: SCSI device sda: drive cache: write back
Dec 22 11:57:04 xen-33-18-02 kernel: ata1.00: configured for UDMA/133
Dec 22 11:57:04 xen-33-18-02 kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Dec 22 11:57:04 xen-33-18-02 kernel: sda: Current: sense key: Aborted Command
Dec 22 11:57:04 xen-33-18-02 kernel:     Additional sense: No additional sense information
Dec 22 11:57:04 xen-33-18-02 kernel: end_request: I/O error, dev sda, sector 1465143264
Dec 22 11:57:04 xen-33-18-02 kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
:
Dec 22 11:57:14 xen-33-18-02 kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Dec 22 11:57:14 xen-33-18-02 kernel: sda: Current: sense key: Medium Error
Dec 22 11:57:14 xen-33-18-02 kernel:     Additional sense: Unrecovered read error - auto reallocate failed
:

Then, the RAID1 re-sync would start all over again … only to fail a few hours later … and again.

Since there appear to be some bad spots on sda at the end of the disk, the solution seemed to be to shrink the partition a bit to avoid the bad spots.  

Our md RAID1 (md1) consists of 2 x 694GB partitions on sda3 and sdb3.  On top of md1 lives an LVM PV, VG, and lots of LVs.  The PV is not 100% allocated, so we have some wiggle room to shrink everything.

Here’s the procedure we followed:

  1. Shrink the LVM PV
    • “pvdisplay” showed the physical disk had 693.98 GB, so we just trimmed it down to an even 693GB.
    • use “pvresize” to shrink the PV
    • # pvresize  -v --setphysicalvolumesize 693G /dev/md1      Using physical volume(s) on command line
          Archiving volume group "datavg" metadata (seqno 42).
          /dev/md1: Pretending size is 1453326336 not 1455376384 sectors.
          Resizing physical volume /dev/md1 from 177658 to 177407 extents.
          Resizing volume "/dev/md1" to 1453325952 sectors.
          Updating physical volume "/dev/md1"
          Creating volume group backup "/etc/lvm/backup/datavg" (seqno 43).
          Physical volume "/dev/md1" changed
          1 physical volume(s) resized / 0 physical volume(s) not resized
  2. Shrink the md RAID1 device
    • If you do NOT do this and shrink the hard drive partitions beneath the md device … you loose your superblock!  Cool, eh?
    • Check the size of the existing md
    • # mdadm --verbose --detail /dev/md1
      /dev/md1:
              Version : 00.90.03
        Creation Time : Mon Sep 29 23:41:30 2008
           Raid Level : raid1
           Array Size : 727688192 (693.98 GiB 745.15 GB)
        Used Dev Size : 727688192 (693.98 GiB 745.15 GB)
         Raid Devices : 2
        Total Devices : 1
      Preferred Minor : 1
          Persistence : Superblock is persistent
      
          Update Time : Mon Dec 22 22:57:31 2008
                State : clean, degraded
       Active Devices : 1
      Working Devices : 1
       Failed Devices : 0
        Spare Devices : 0
      
                 UUID : d584a880:d77024a2:9ae023b4:27ec0db5
               Events : 0.5501090
      
          Number   Major   Minor   RaidDevice State
             0       8        3        0      active sync   /dev/sda3
             1       0        0        1      removed
    • Hey, what a coincidence … the Array Size is the same as the PV was! 🙂  Now, we need to calculate the new size.  The new size of the PV is 1453325952 sectors, which is 726662976 blocks (even *you* can divide by 2!!).
    • Shrink the md:
    • # mdadm --verbose --grow /dev/md1 --size=726662976
      # mdadm --verbose --detail /dev/md1
      /dev/md1:
         :
           Array Size : 726662976 (693.00 GiB 744.10 GB)
    • Woot! Getting closer!
  3. Now, technically, we should shrink the partition, too.  Here’s where I ran into a bit of trouble (ok, I’m too lazy to do the math).  “fdisk” on linux doesn’t seem to want to let you specify a size in blocks or sectors, so you have to keep shrinking the ending cylinder number until you get in the range of the new block size.  I’ll leave this as an exercise for the reader, and feel free to post a comment with the actual procedure.  🙂

    After these steps were completed, we were able to perform the raid1 re-sync successfully.