expanding RAID and LVM

- May 2008

Copyright 2008 by Robert Forsman <raid@thoth.purplefrog.com>

I have a house file server. I am routinely adding hard drives to it as I fill up the others. The main consumer of space is fingerprint-proof backups of my DVD collection. I'd rather not expose the disks to the hazards of handling*, so I rip them to hard drive and the originals go to live on the shelf.

* - You can find articles where people claim that a DVD can be damaged by the bending it endures when being removed from an unreasonably tight DVD case. There are also claims of DVD rot where the reflective layer corrodes, probably due to contaminants from the manufacturing process. I'm hoping to be able to enjoy my purchases long after the original physical media fails.

Since I really don't want to go to the hassle of re-ripping the collection when one of the hard drives fails (and it is a question of "when", not if), I use RAID5. In order to allow my filesystems to grow and move, I use LVM on top of the RAIDs. Recent linux kernels support the ability to grow a RAID5, but my house file server isn't running anything recent.

chinese fire drill

Let's see what happens when we want to add a 4th hard drive to a 3-drive system modelled loosely on the state of my array a year ago.

partitioning
partition sda , 300G sdb , 400G sdc , 500G sdd , 750G
1 /boot 64M 64M 64M 64M
2 swap 512M 512M 512M 512M
3 root 4G 4G 4G 4G
4 extended 295G 395G 495G 745G
5 64G 64G 64G 128G
6 64G 64G 64G 128G
7 64G 64G 64G 64G
8 64G 64G 64G 64G
9 39G 64G 64G 64G
10 64G 64G 64G
11 11G 64G 64G
12 47G 64G
13 64G
14 11G

In the days of 300G drives, 64g chunks seemed like a good building block. Now that 750Gs are common and 1TB are on the shelf at Fry's, 128G seems a little more plausible. You can still fit 7 of those in a 1TB drive, which seems a little excessive, but reconstructing a 128G slice of a raid already takes a couple of hours, and I don't have a fast enough machine to make me comfortable moving to 256G chunks.

In this scenario there are four RAID5 arrays (/dev/md[3567]), each made of three 64G chunks. There are also two mirrors (/dev/md[01]). These mirrors are physical volumes allocated to the mg20 volume group.

allocation
partition sda , 300G sdb , 400G sdc , 500G sdd , 750G
5 MD3 MD3 MD3 128G
6 MD5 MD5 MD5 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD0 MD0 64G
10 MD1 MD1 64G
11 vg20 vg20 64G
12 vg20 64G
13 64G
14 11G

For educational purposes, this is how you could replicate this setup:

mdadm --create -l 1 -n 3 /dev/md0 /dev/sdb9 /dev/sdc9 missing
mdadm --create -l 1 -n 3 /dev/md1 /dev/sdb10 /dev/sdc10 missing
mdadm --create -l 5 -n 3 /dev/md3 /dev/sd[abc]5
mdadm --create -l 5 -n 3 /dev/md5 /dev/sd[abc]6
mdadm --create -l 5 -n 3 /dev/md6 /dev/sd[abc]7
mdadm --create -l 5 -n 3 /dev/md7 /dev/sd[abc]8
vi /etc/mdadm.conf
pvcreate /dev/md[013567] /dev/sda9 /dev/sdb11 /dev/sdc11 /dev/sdc12
vgcreate mg20 /dev/md[013567]
vgcreate vg20 /dev/sda9 /dev/sdb11 /dev/sdc11 /dev/sdc12

I use mirrors for data that I really don't want to lose (mail archives, CVS repository). I use RAID5 for stuff that it would be inconvenient to lose (DVD backups). Even mirrors will not protect your data from an errant rm -rf, or software error. Make backups to another machine. If that machine can be in another state, even better. The mg20 volume group is all-RAID. Any volume created in that VG will be resistant to hardware failure.

The vg20 volume group has no RAID components. If one of the drives dies, any logical volume with extents on the failed drive will be destroyed. I use vg20 for backups of other computers in the house. If a drive fails, the other computer is probably still fine.

philosophy for the new drive.

By creating two 128G partitions on the new 750 I have seriously cramped my style. Incorporating them into a RAID where the other partitions are 64G would be a massive waste of space. However, since hard drives are getting ridiculously large the 64G chunk size is becoming unwieldy. I will be buying more drives and they will be 750G or larger, so I will have more 128G partitions.

I want the finished product to have the same amount of mirrored space. I mostly want to expand the RAID5 space so I can rip the next DVDs I buy. Let's look at the goal:

goal
partition sda , 300G sdb , 400G sdc , 500G sdd , 750G
5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD9 MD9 MD9 MD9
8 MD10 MD10 MD10 MD10
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 vg20

You'll notice that MD1 and MD0 need to be shuffled around to make room for the MD2 RAID5. Also, all the old 3-drive RAID5s are being replaced by 4-drive RAID5s. The fact that /dev/sdd[56] are 128G kind of damages the aesthetics, but defective aesthetics are nothing new in the world of computers.

Achieving this goal requires a significant amount of gymnastics, which is the point of this article.

The rough sequence you will find useful is to move elements of existing mirrors out of the way to allow you to create new RAID5 partitions. Then you can pvmove the data off the old RAID5s and recycle their partitions into new larger RAID5s.

Why you should always make a 3-drive mirror

You may have noticed that when I made the mirrors, I made them have 3 drives, with one missing (yes, you literally type "missing" where the 3rd partition would go on the command line). If you made a 2-drive mirror, then to change where one of the partitions lived you would have to fail that drive to make room for the new drive. I'm not comfortable living in that uncertain state. An inopportune drive failure would result in the destruction of data and send you digging through your backups. Since we have a "degraded" 3-drive system, we can shuffle the partitions around with much less danger:

mdadm /dev/md0 -a /dev/sdd13
mdadm /dev/md1 -a /dev/sdd12

At this point you need to leave the computer alone for an hour or three. It will be busy reading data from the old parittions in the RAID and copying them to the new partition in the RAID. The kernel is smart enough to do one reconstruction at a time. If you make the mistake of doing pvmoves at the same time as a RAID reconstruction, you will cause the operating system to thrash the heads on the hard drive as it deals with two separate subsystems each imposing massive IO loads on different sections of the disk. That's great for triggering infant mortality (if you believe in that for modern hard drives), but it will turn a 2-hour operation into a 20-hour operation.

A simple while sleep 60; do cat /proc/mdstat; done should keep you apprised of the progress of the mirror "reconstruction". The kernel is even kind enough to provide you with an estimated time of completion for each in-progress reconstruction.

intermediate
partition sdb , 400G sdc , 500G sdd , 750G
9 MD0 MD0 64G
10 MD1 MD1 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G

Once your mirrors are fully active on 3 drives, you can deactivate the old partitions.

mdadm /dev/md0 -f /dev/sdc9 -r /dev/sdc9
mdadm /dev/md1 -f /dev/sdc10 -r /dev/sdc10

We had a choice of deactivating /dev/sdc9 or /dev/sdb9 from MD0. We intend to put MD0 on /dev/sdc10. If we had deactivated /dev/sdb9 then we would be copying data from sdd13 and sdc9 onto sdc10. Copying data from sdc9 to sdc10 would cause head thrashing.

intermediate
partition sdb , 400G sdc , 500G sdd , 750G
9 MD0 - 64G
10 MD1 - 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G
Now we move MD0 into /dev/sdc10:

mdadm /dev/md0 -a /dev/sdc10
cat /proc/mdstat until the copying is done and then remove /dev/sdb9

mdadm /dev/md0 -f /dev/sdb9 -r /dev/sdb9

intermediate
partition sdb , 400G sdc , 500G sdd , 750G
9 - - 64G
10 MD1 MD0 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G
MD1 and MD0 are in their new homes. Assuming there are no kernel bugs, the data is intact, and you didn't even have to take the logical volumes off-line. We have also freed up the 3 partitions needed to create /dev/md2.

mdadm --create -l 5 -n 3 /dev/md2 /dev/sd[bcd]9
vi /etc/mdadm.conf
Again, you should wait for the operating system to finish "reconstructing" the RAID.

intermediate
partition sda, 300G sdb , 400G sdc , 500G sdd , 750G
5 MD3 MD3 MD3 128G
6 MD5 MD5 MD5 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G

Reviewing our TODO list, we find that we still need to replace MD3,5,6, and 7 with 4-drive RAID5s. If I had a modern version of linux, I might be able to expand them in place, but since I do not, I have an opportunity to exercise the LVM layer of my setup. Since the new MD2 is the same size as MD3 we can easily ask LVM to relocate all the data.

pvcreate /dev/md2
vgextend mg20 /dev/md2
pvmove /dev/md3 /dev/md2
This command actually stays in the foreground and monitors the progress of the LVM layer. pvmove does sometimes crash, but it seems to checkpoint its progress (or maybe the kernel continues on behind the scenes) so you can just run pvmove with no arguments to resume watching the progress.

If there were not enough room on /dev/md2 to fit all of the /dev/md3 data, there is a syntax to pvmove to relocate a subset of the extents. We will touch on that later in the process.

Once the pvmove from /dev/md3 is complete we can deallocate those partitions and build the replacement /dev/md4 RAID.

vgreduce mg20 /dev/md3
pvremove /dev/md3
mdadm --stop /dev/md3
mdadm --misc --zero-superblock /dev/sd[abc]5
mdadm --create -l 5 -n 4 /dev/md4 /dev/sd[abc]5 /dev/sdd11
vi /etc/mdadm.conf
pvcreate /dev/md4
vgextend mg20 /dev/md4

Be super-careful with the --zero-superblock command. If you use it on the wrong partitions, bad things will happen.

intermediate
partition sda, 300G sdb , 400G sdc , 500G sdd , 750G
5 MD4 MD4 MD4 128G
6 MD5 MD5 MD5 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 64G
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

Now that /dev/md4 is part of the mg20 volume group we can relocate /dev/md5's data onto it and then recycle MD5 into MD8.

pvmove /dev/md5 /dev/md4
vgreduce /dev/mg20 /dev/md5
pvremove /dev/md5
mdadm --stop /dev/md5
mdadm --misc --zero-superblock /dev/sd[abc]6
mdadm --create -l 5 -n 4 /dev/md8 /dev/sd[abc]6 /dev/sdd10
vi /etc/mdadm.conf
pvcreate /dev/md8
vgextend mg20 /dev/md8

intermediate
partition sda, 300G sdb , 400G sdc , 500G sdd , 750G
5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

Now we are in an interesting position. MD4 was larger than MD5, so it will have some unallocated physical extents. You can see how many extents are free with the pvdisplay command. Here's a sample output from my laptop:

File descriptor 4 left open
  --- Physical volume ---
  PV Name               /dev/sda6
  VG Name               vg80
  PV Size               227.37 GB / not usable 825.00 KB
  Allocatable           yes 
  PE Size (KByte)       4096
  Total PE              58206
  Free PE               38119
  Allocated PE          20087
  PV UUID               cPQtZ4-GloJ-FRMg-3B6j-42C8-Cpo0-X2ZH8K

Notice the Free PE. Those are the units of allocation for LVM. On this PV they are 4096K chunks. I am not sure what would happen if you had volumes with different sizes for the PEs. Since the MD5 RAID had 128G of space and the MD4 RAID had 192G, even if MD5 had been full there would still be 64G of space remaining on MD4, which is more than 16000 extents it would report under Free PE. If you wanted to "compact" your PVs you could issue the following command to fill up the rest of MD4 with extents from MD6.

pvmove /dev/md6:0-16300 /dev/md4

Just replace 16300 with the number of free extents on /dev/md4 minus 1. That should fill /dev/md4 (unless MD4 had some holes in its allocation map, but that's a topic for advanced readers). You can then move the other half of MD6 into the fresh MD8.

pvmove /dev/md6 /dev/md8

Now deallocate MD6 and build MD9.

vgreduce /dev/mg20 /dev/md6
pvremove /dev/md6
mdadm --stop /dev/md6
mdadm --misc --zero-superblock /dev/sd[abc]7
mdadm --create -l 5 -n 4 /dev/md9 /dev/sd[abcd]7
vi /etc/mdadm.conf
pvcreate /dev/md9
vgextend mg20 /dev/md9

intermediate
partition sda, 300G sdb , 400G sdc , 500G sdd , 750G
5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD9 MD9 MD9 MD9
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

The only step left is to empty MD7 and recycle it into MD10. Even though MD8 got the last half of MD6s data, it probably still has enough space to absorb all of MD7.

pvmove /dev/md7 /dev/md8

Again we do the RAID5 recycle dance:

vgreduce /dev/mg20 /dev/md7
pvremove /dev/md7
mdadm --stop /dev/md7
mdadm --misc --zero-superblock /dev/sd[abc]8
mdadm --create -l 5 -n 4 /dev/md10 /dev/sd[abcd]8
vi /etc/mdadm.conf
pvcreate /dev/md10
vgextend mg20 /dev/md10

intermediate
partition sda, 300G sdb , 400G sdc , 500G sdd , 750G
5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD9 MD9 MD9 MD9
8 MD10 MD10 MD10 MD10
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G
At this point we can accomplish our goal by adding /dev/sdd14 to vg20. While we're at it, let's just throw /dev/sdd5 and 6 in there as well

pvcreate /dev/sdd14 /dev/sdd[56]
vgextend vg20 /dev/sdd14 /dev/sdd[56]

We've added over 250G to vg20 (the non-RAID volume group) and have 384G of new usable RAID space (plus 128G of checksum).

a note about mirrors

You will notice that when I moved /dev/md0 I actually moved both of the partitions. If I wanted to avoid that I could have built the /dev/md2 RAID5 from /dev/sdb9, sdc10, and sdd9. That would irritate me because I like my RAID5s to span partitions with identical numbers. When we look at MD3 and MD5 in the finished product we realize that I'm irritated anyway.

The only guidance I can offer you is that when you create mirrors, don't put them on partitions with the same number. That will make it easier to create a RAID5 that is "pretty".

Then again, you might not care about "pretty".

aftermath

To be fair, this dance probably took about 1 day, with about an hour of it requiring the operator's attention. This is probably why people pay big money to Netapp and EMC. I assume their software handles crap like this automatically.

In the free software world I'm sure someone somewhere has at least an experimental system for managing this kind of thing automatically and by the time you get around to reading this article it might be ready for early adopters.

alternatives

It is possible to add a disk to a RAID5 array with modern kernels and tools. Based on http://scotgate.org/?p=107 here is what I think the MD3 RAID adjustment would look like:

mdadm /dev/md3 -a /dev/sdd11
mdadm --grow /dev/md3 --raid-devices=4
pvresize /dev/md3

This would save you the trouble of destroying the RAID arrays and building replacements with more disk. You would still have to use the pvresize command to make the physical volumes fill out their enlarged RAID devices.

Another result would be fragmentation which is uninteresting to sysadmins of modern operating systems. If you object to fragmentation on aesthetic grounds you can use pvmove to make your logical volumes occupy contiguous extents. lvdisplay -m can show you the physical extents allocated to each logical volume.

resizing partitions

It is technically possible to resize partitions. It does require nerves of steel, serious attention to detail, and a piece of paper. The general strategy is to copy down or print out the start and end blocks from your partition table.

alexandria thoth # fdisk -l /dev/sda

Disk /dev/sda: 300.0 GB, 300069052416 bytes
255 heads, 63 sectors/track, 36481 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1           2       16033+  83  Linux
/dev/sda2               3          35      265072+  82  Linux swap / Solaris
/dev/sda3              36         166     1052257+  83  Linux
/dev/sda4             167       36481   291700237+   5  Extended
/dev/sda5             167        8325    65537136   fd  Linux raid autodetect
/dev/sda6            8326       16484    65537136   fd  Linux raid autodetect
/dev/sda7           16485       24643    65537136   fd  Linux raid autodetect
/dev/sda8           24644       32803    65545168+  fd  Linux raid autodetect
/dev/sda9           32804       36481    29543503+  8e  Linux LVM

I think it is possible to resize a partition containing a PV without data loss but I have not experimented with it. You will use the pvresize tool to expand into the new space. Resizing RAID is covered later in this document and is probably not worth the hassle.
If you wished to merge /dev/sda5 and 6 into a single 128G partition you would first evacuate all data from those partitions. If they are physical volumes you should pvmove all the data off them and then vgreduce and pvremove. If they are RAIDs you must pvmove the data off the enclosing RAID device, vgreduce, pvremove, and mdadm --stop the RAID array.

You will also want to deactivate any filesystems that live on the disk after the partition that you will be altering. This often means unmounting all logical volumes and deactivating all volume groups ( vgchange -a n ). Also deactivate all RAIDs that have pieces on that disk ( mdadm --stop /dev/mdwhatever ).

Following that use fdisk to edit the partition table creating a replacement /dev/sda5 with a start equivalent to the old sd5 and an end equivalent to the old sd6. Before you write the partition table to disk print it out and make sure it looks exactly how you want it to. Partitions that are not being consolidated will have different numbers but the start/end should be the same.

Command (m for help): p

Disk /dev/sda: 300.0 GB, 300069052416 bytes
255 heads, 63 sectors/track, 36481 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1           2       16033+  83  Linux
/dev/sda2               3          35      265072+  82  Linux swap / Solaris
/dev/sda3              36         166     1052257+  83  Linux
/dev/sda4             167       36481   291700237+   5  Extended
/dev/sda5           16485       24643    65537136   fd  Linux raid autodetect
/dev/sda6           24644       32803    65545168+  fd  Linux raid autodetect
/dev/sda7           32804       36481    29543503+  8e  Linux LVM
/dev/sda8             167       16484    65537136   fd  Linux raid autodetect

If the fact that the partitions are out of order bothers you, go ahead and delete and recreate the partitions that come after the resized one. Initial experiments indicate that LVM and mdadm do not care. LVM and mdadm do not care about device IDs because they use UUIDs stored inside and can find their pieces no matter how you number them.

You can now write the partition table to disk, but it is very common that the operating system is unable to accept the new partition table.

Anything that references partitions that have been renumbered (sda7,8,9 became 5,6,7 ; while sda5 should not be referenced by anything any more) will have to be updated. If you're using these partitions purely for LVM then I don't know of any files that reference partitions. LVM scans hard drives and picks up anything that has the right ID and a PV header. For RAIDs, your /etc/mdadm.conf should be identifying things with a uuid=, which means there are no partition names. If your mdadm.conf does use partition names, you will have to adjust them. If you are using any of the partitions for a regular filesystem, you'll have to update /etc/fstab.

If you need to reboot you can do it now. The freshly booted kernel will read your new partition table. If you screwed anything up, be ready to do some hardcore troubleshooting. A printout of your old partition table will probably save your butt.

(If the repartitioned disk only had LVM and RAID on it, you probably do not need a reboot. A reboot is only required if the kernel had a locked copy of the partition table in RAM for its own safety, which I think only happens if there are normal filesystems currently mounted from that drive.)

Resizing a RAID

It is not actually practical to expand all the partitions of a RAID5, but I will show you how it is done anyway.

The problem is that the RAID superblock lives at the end of each partition (which makes it easier for LILO and grub to boot from a RAID1 mirror. They don't even realize it is part of a RAID). As a result each partition resize wrecks that element of the RAID (because the superblock is in the middle of the expanded partition, not the end) and it must be reconstructed. During this reconstruction phase your are vulnerable to a disk failure at the same time you are thrashing the bejeezus out of several hard drives.

Let us imagine a RAID5 called /dev/md0 built from /dev/sda5, b5, and, c5. Let us also imagine that you have emptied out the partitions after each of them in preparation for expansion.

# mdadm --stop /dev/md0
# fdisk /dev/sda
     expand /dev/sda5.
# mdadm --assemble /dev/md0
# mdadm /dev/md0 -a /dev/sda5
# while grep recovery /proc/mdstat; do sleep 60; done
     this should take a while.  Large RAIDs can take hours.
# cat /proc/mdstat
     review this output to make sure the RAID is in a good state. 
# mdadm --stop /dev/md0
# fdisk /dev/sdb
     expand /dev/sdb5.
# mdadm --assemble /dev/md0
# mdadm /dev/md0 -a /dev/sdb5
# while grep recovery /proc/mdstat; do sleep 60; done
# cat /proc/mdstat
# mdadm --stop /dev/md0
# fdisk /dev/sdc
     expand /dev/sdc5.
# mdadm --assemble /dev/md0
# mdadm /dev/md0 -a /dev/sdc5
# while grep recovery /proc/mdstat; do sleep 60; done
# cat /proc/mdstat
# mdadm --stop /dev/md0
# mdadm /dev/md0 --grow -z max
# mdadm --assemble /dev/md0
# cat /proc/mdstat

Have I talked you out of it yet?

anecdotes about the SansDigital TR4U