expanding RAID and LVM

- May 2008

I have a house file server. I am routinely adding hard drives to it as I fill up the others. The main consumer of space is fingerprint-proof backups of my DVD collection. I'd rather not expose the disks to the hazards of handling*, so I rip them to hard drive and the originals go to live on the shelf.

* - You can find articles where people claim that a DVD can be damaged by the bending it endures when being removed from an unreasonably tight DVD case. There are also claims of DVD rot where the reflective layer corrodes, probably due to contaminants from the manufacturing process. I'm hoping to be able to enjoy my purchases long after the original physical media fails.

Since I really don't want to go to the hassle of re-ripping the collection when one of the hard drives fails (and it is a question of "when", not if), I use RAID5. In order to allow my filesystems to grow and move, I use LVM on top of the RAIDs. Recent linux kernels support the ability to grow a RAID5, but my house file server isn't running anything recent.

chinese fire drill

Let's see what happens when we want to add a 4th hard drive to a 3-drive system modelled loosely on the state of my array a year ago.

partitioning

partition sda , 300G sdb , 400G sdc , 500G sdd , 750G

1 /boot 64M 64M 64M 64M
2 swap 512M 512M 512M 512M
3 root 4G 4G 4G 4G
4 extended 295G 395G 495G 745G
5 64G 64G 64G 128G
6 64G 64G 64G 128G
7 64G 64G 64G 64G
8 64G 64G 64G 64G
9 39G 64G 64G 64G
10 64G 64G 64G
11 11G 64G 64G
12 47G 64G
13 64G
14 11G

In the days of 300G drives, 64g chunks seemed like a good building block. Now that 750Gs are common and 1TB are on the shelf at Fry's, 128G seems a little more plausible. You can still fit 7 of those in a 1TB drive, which seems a little excessive, but reconstructing a 128G slice of a raid already takes a couple of hours, and I don't have a fast enough machine to make me comfortable moving to 256G chunks.

In this scenario there are four RAID5 arrays (/dev/md[3567]), each made of three 64G chunks. There are also two mirrors (/dev/md[01]). These mirrors are physical volumes allocated to the mg20 volume group.

allocation

partition sda , 300G sdb , 400G sdc , 500G sdd , 750G

5 MD3 MD3 MD3 128G
6 MD5 MD5 MD5 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD0 MD0 64G
10 MD1 MD1 64G
11 vg20 vg20 64G
12 vg20 64G
13 64G
14 11G

For educational purposes, this is how you could replicate this setup:

mdadm --create -l 1 -n 3 /dev/md0 /dev/sdb9 /dev/sdc9 missing mdadm --create -l 1 -n 3 /dev/md1 /dev/sdb10 /dev/sdc10 missing mdadm --create -l 5 -n 3 /dev/md3 /dev/sd[abc]5 mdadm --create -l 5 -n 3 /dev/md5 /dev/sd[abc]6 mdadm --create -l 5 -n 3 /dev/md6 /dev/sd[abc]7 mdadm --create -l 5 -n 3 /dev/md7 /dev/sd[abc]8 vi /etc/mdadm.conf pvcreate /dev/md[013567] /dev/sda9 /dev/sdb11 /dev/sdc11 /dev/sdc12 vgcreate mg20 /dev/md[013567] vgcreate vg20 /dev/sda9 /dev/sdb11 /dev/sdc11 /dev/sdc12

I use mirrors for data that I really don't want to lose (mail archives, CVS repository). I use RAID5 for stuff that it would be inconvenient to lose (DVD backups). Even mirrors will not protect your data from an errant rm -rf, or software error. Make backups to another machine. If that machine can be in another state, even better. The mg20 volume group is all-RAID. Any volume created in that VG will be resistant to hardware failure.

The vg20 volume group has no RAID components. If one of the drives dies, any logical volume with extents on the failed drive will be destroyed. I use vg20 for backups of other computers in the house. If a drive fails, the other computer is probably still fine.

philosophy for the new drive.

By creating two 128G partitions on the new 750 I have seriously cramped my style. Incorporating them into a RAID where the other partitions are 64G would be a massive waste of space. However, since hard drives are getting ridiculously large the 64G chunk size is becoming unwieldy. I will be buying more drives and they will be 750G or larger, so I will have more 128G partitions.

I want the finished product to have the same amount of mirrored space. I mostly want to expand the RAID5 space so I can rip the next DVDs I buy. Let's look at the goal:

goal

partition sda , 300G sdb , 400G sdc , 500G sdd , 750G

5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD9 MD9 MD9 MD9
8 MD10 MD10 MD10 MD10
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 vg20

You'll notice that MD1 and MD0 need to be shuffled around to make room for the MD2 RAID5. Also, all the old 3-drive RAID5s are being replaced by 4-drive RAID5s. The fact that /dev/sdd[56] are 128G kind of damages the aesthetics, but defective aesthetics are nothing new in the world of computers.

Achieving this goal requires a significant amount of gymnastics, which is the point of this article.

The rough sequence you will find useful is to move elements of existing mirrors out of the way to allow you to create new RAID5 partitions. Then you can pvmove the data off the old RAID5s and recycle their partitions into new larger RAID5s.

Why you should always make a 3-drive mirror

You may have noticed that when I made the mirrors, I made them have 3 drives, with one missing (yes, you literally type "missing" where the 3rd partition would go on the command line). If you made a 2-drive mirror, then to change where one of the partitions lived you would have to fail that drive to make room for the new drive. I'm not comfortable living in that uncertain state. An inopportune drive failure would result in the destruction of data and send you digging through your backups. Since we have a "degraded" 3-drive system, we can shuffle the partitions around with much less danger:

mdadm /dev/md0 -a /dev/sdd13 mdadm /dev/md1 -a /dev/sdd12

At this point you need to leave the computer alone for an hour or three. It will be busy reading data from the old parittions in the RAID and copying them to the new partition in the RAID. The kernel is smart enough to do one reconstruction at a time. If you make the mistake of doing pvmoves at the same time as a RAID reconstruction, you will cause the operating system to thrash the heads on the hard drive as it deals with two separate subsystems each imposing massive IO loads on different sections of the disk. That's great for triggering infant mortality (if you believe in that for modern hard drives), but it will turn a 2-hour operation into a 20-hour operation.

A simple while sleep 60; do cat /proc/mdstat; done should keep you apprised of the progress of the mirror "reconstruction". The kernel is even kind enough to provide you with an estimated time of completion for each in-progress reconstruction.

intermediate

partition sdb , 400G sdc , 500G sdd , 750G

9 MD0 MD0 64G
10 MD1 MD1 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G

Once your mirrors are fully active on 3 drives, you can deactivate the old partitions.

mdadm /dev/md0 -f /dev/sdc9 -r /dev/sdc9 mdadm /dev/md1 -f /dev/sdc10 -r /dev/sdc10

We had a choice of deactivating /dev/sdc9 or /dev/sdb9 from MD0. We intend to put MD0 on /dev/sdc10. If we had deactivated /dev/sdb9 then we would be copying data from sdd13 and sdc9 onto sdc10. Copying data from sdc9 to sdc10 would cause head thrashing.

intermediate

partition sdb , 400G sdc , 500G sdd , 750G

9 MD0 - 64G
10 MD1 - 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G

Now we move MD0 into /dev/sdc10:

mdadm /dev/md0 -a /dev/sdc10

cat /proc/mdstat until the copying is done and then remove /dev/sdb9

mdadm /dev/md0 -f /dev/sdb9 -r /dev/sdb9

intermediate

partition sdb , 400G sdc , 500G sdd , 750G

9 - - 64G
10 MD1 MD0 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G

MD1 and MD0 are in their new homes. Assuming there are no kernel bugs, the data is intact, and you didn't even have to take the logical volumes off-line. We have also freed up the 3 partitions needed to create /dev/md2.

mdadm --create -l 5 -n 3 /dev/md2 /dev/sd[bcd]9 vi /etc/mdadm.conf

Again, you should wait for the operating system to finish "reconstructing" the RAID.

intermediate

partition sda, 300G sdb , 400G sdc , 500G sdd , 750G

5 MD3 MD3 MD3 128G
6 MD5 MD5 MD5 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 64G
11 vg20 vg20 64G
12 vg20 MD1
13 MD0
14 11G

Reviewing our TODO list, we find that we still need to replace MD3,5,6, and 7 with 4-drive RAID5s. If I had a modern version of linux, I might be able to expand them in place, but since I do not, I have an opportunity to exercise the LVM layer of my setup. Since the new MD2 is the same size as MD3 we can easily ask LVM to relocate all the data.

pvcreate /dev/md2 vgextend mg20 /dev/md2 pvmove /dev/md3 /dev/md2

This command actually stays in the foreground and monitors the progress of the LVM layer. pvmove does sometimes crash, but it seems to checkpoint its progress (or maybe the kernel continues on behind the scenes) so you can just run pvmove with no arguments to resume watching the progress.

If there were not enough room on /dev/md2 to fit all of the /dev/md3 data, there is a syntax to pvmove to relocate a subset of the extents. We will touch on that later in the process.

Once the pvmove from /dev/md3 is complete we can deallocate those partitions and build the replacement /dev/md4 RAID.

vgreduce mg20 /dev/md3 pvremove /dev/md3 mdadm --stop /dev/md3 mdadm --misc --zero-superblock /dev/sd[abc]5 mdadm --create -l 5 -n 4 /dev/md4 /dev/sd[abc]5 /dev/sdd11 vi /etc/mdadm.conf pvcreate /dev/md4 vgextend mg20 /dev/md4

Be super-careful with the --zero-superblock command. If you use it on the wrong partitions, bad things will happen.

intermediate

partition sda, 300G sdb , 400G sdc , 500G sdd , 750G

5 MD4 MD4 MD4 128G
6 MD5 MD5 MD5 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 64G
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

Now that /dev/md4 is part of the mg20 volume group we can relocate /dev/md5's data onto it and then recycle MD5 into MD8.

pvmove /dev/md5 /dev/md4 vgreduce /dev/mg20 /dev/md5 pvremove /dev/md5 mdadm --stop /dev/md5 mdadm --misc --zero-superblock /dev/sd[abc]6 mdadm --create -l 5 -n 4 /dev/md8 /dev/sd[abc]6 /dev/sdd10 vi /etc/mdadm.conf pvcreate /dev/md8 vgextend mg20 /dev/md8

intermediate

partition sda, 300G sdb , 400G sdc , 500G sdd , 750G

5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD6 MD6 MD6 64G
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

Now we are in an interesting position. MD4 was larger than MD5, so it will have some unallocated physical extents. You can see how many extents are free with the pvdisplay command. Here's a sample output from my laptop:

File descriptor 4 left open --- Physical volume --- PV Name /dev/sda6 VG Name vg80 PV Size 227.37 GB / not usable 825.00 KB Allocatable yes PE Size (KByte) 4096 Total PE 58206 Free PE 38119 Allocated PE 20087 PV UUID cPQtZ4-GloJ-FRMg-3B6j-42C8-Cpo0-X2ZH8K

Notice the Free PE. Those are the units of allocation for LVM. On this PV they are 4096K chunks. I am not sure what would happen if you had volumes with different sizes for the PEs. Since the MD5 RAID had 128G of space and the MD4 RAID had 192G, even if MD5 had been full there would still be 64G of space remaining on MD4, which is more than 16000 extents it would report under Free PE. If you wanted to "compact" your PVs you could issue the following command to fill up the rest of MD4 with extents from MD6.

pvmove /dev/md6:0-16300 /dev/md4

Just replace 16300 with the number of free extents on /dev/md4 minus 1. That should fill /dev/md4 (unless MD4 had some holes in its allocation map, but that's a topic for advanced readers). You can then move the other half of MD6 into the fresh MD8.

pvmove /dev/md6 /dev/md8

Now deallocate MD6 and build MD9.

vgreduce /dev/mg20 /dev/md6 pvremove /dev/md6 mdadm --stop /dev/md6 mdadm --misc --zero-superblock /dev/sd[abc]7 mdadm --create -l 5 -n 4 /dev/md9 /dev/sd[abcd]7 vi /etc/mdadm.conf pvcreate /dev/md9 vgextend mg20 /dev/md9

intermediate

partition sda, 300G sdb , 400G sdc , 500G sdd , 750G

5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD9 MD9 MD9 MD9
8 MD7 MD7 MD7 64G
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

The only step left is to empty MD7 and recycle it into MD10. Even though MD8 got the last half of MD6s data, it probably still has enough space to absorb all of MD7.

pvmove /dev/md7 /dev/md8

Again we do the RAID5 recycle dance:

vgreduce /dev/mg20 /dev/md7 pvremove /dev/md7 mdadm --stop /dev/md7 mdadm --misc --zero-superblock /dev/sd[abc]8 mdadm --create -l 5 -n 4 /dev/md10 /dev/sd[abcd]8 vi /etc/mdadm.conf pvcreate /dev/md10 vgextend mg20 /dev/md10

intermediate

partition sda, 300G sdb , 400G sdc , 500G sdd , 750G

5 MD4 MD4 MD4 128G
6 MD8 MD8 MD8 128G
7 MD9 MD9 MD9 MD9
8 MD10 MD10 MD10 MD10
9 vg20 MD2 MD2 MD2
10 MD1 MD0 MD8
11 vg20 vg20 MD4
12 vg20 MD1
13 MD0
14 11G

At this point we can accomplish our goal by adding /dev/sdd14 to vg20. While we're at it, let's just throw /dev/sdd5 and 6 in there as well

pvcreate /dev/sdd14 /dev/sdd[56] vgextend vg20 /dev/sdd14 /dev/sdd[56]

We've added over 250G to vg20 (the non-RAID volume group) and have 384G of new usable RAID space (plus 128G of checksum).

a note about mirrors

You will notice that when I moved /dev/md0 I actually moved both of the partitions. If I wanted to avoid that I could have built the /dev/md2 RAID5 from /dev/sdb9, sdc10, and sdd9. That would irritate me because I like my RAID5s to span partitions with identical numbers. When we look at MD3 and MD5 in the finished product we realize that I'm irritated anyway.

The only guidance I can offer you is that when you create mirrors, don't put them on partitions with the same number. That will make it easier to create a RAID5 that is "pretty".

Then again, you might not care about "pretty".

aftermath

To be fair, this dance probably took about 1 day, with about an hour of it requiring the operator's attention. This is probably why people pay big money to Netapp and EMC. I assume their software handles crap like this automatically.

In the free software world I'm sure someone somewhere has at least an experimental system for managing this kind of thing automatically and by the time you get around to reading this article it might be ready for early adopters.

alternatives

It is possible to add a disk to a RAID5 array with modern kernels and tools. Based on http://scotgate.org/?p=107 here is what I think the MD3 RAID adjustment would look like:

mdadm /dev/md3 -a /dev/sdd11 mdadm --grow /dev/md3 --raid-devices=4 pvresize /dev/md3

This would save you the trouble of destroying the RAID arrays and building replacements with more disk. You would still have to use the pvresize command to make the physical volumes fill out their enlarged RAID devices.

Another result would be fragmentation which is uninteresting to sysadmins of modern operating systems. If you object to fragmentation on aesthetic grounds you can use pvmove to make your logical volumes occupy contiguous extents. lvdisplay -m can show you the physical extents allocated to each logical volume.

resizing partitions

It is technically possible to resize partitions. It does require nerves of steel, serious attention to detail, and a piece of paper. The general strategy is to copy down or print out the start and end blocks from your partition table.

alexandria thoth # fdisk -l /dev/sda Disk /dev/sda: 300.0 GB, 300069052416 bytes 255 heads, 63 sectors/track, 36481 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 1 2 16033+ 83 Linux /dev/sda2 3 35 265072+ 82 Linux swap / Solaris /dev/sda3 36 166 1052257+ 83 Linux /dev/sda4 167 36481 291700237+ 5 Extended /dev/sda5 167 8325 65537136 fd Linux raid autodetect /dev/sda6 8326 16484 65537136 fd Linux raid autodetect /dev/sda7 16485 24643 65537136 fd Linux raid autodetect /dev/sda8 24644 32803 65545168+ fd Linux raid autodetect /dev/sda9 32804 36481 29543503+ 8e Linux LVM

I think it is possible to resize a partition containing a PV without data loss but I have not experimented with it. You will use the pvresize tool to expand into the new space. Resizing RAID is covered later in this document and is probably not worth the hassle.
If you wished to merge /dev/sda5 and 6 into a single 128G partition you would first evacuate all data from those partitions. If they are physical volumes you should pvmove all the data off them and then vgreduce and pvremove. If they are RAIDs you must pvmove the data off the enclosing RAID device, vgreduce, pvremove, and mdadm --stop the RAID array.

You will also want to deactivate any filesystems that live on the disk after the partition that you will be altering. This often means unmounting all logical volumes and deactivating all volume groups ( vgchange -a n ). Also deactivate all RAIDs that have pieces on that disk ( mdadm --stop /dev/mdwhatever ).

Following that use fdisk to edit the partition table creating a replacement /dev/sda5 with a start equivalent to the old sd5 and an end equivalent to the old sd6. Before you write the partition table to disk print it out and make sure it looks exactly how you want it to. Partitions that are not being consolidated will have different numbers but the start/end should be the same.

Command (m for help): p Disk /dev/sda: 300.0 GB, 300069052416 bytes 255 heads, 63 sectors/track, 36481 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 1 2 16033+ 83 Linux /dev/sda2 3 35 265072+ 82 Linux swap / Solaris /dev/sda3 36 166 1052257+ 83 Linux /dev/sda4 167 36481 291700237+ 5 Extended /dev/sda5 16485 24643 65537136 fd Linux raid autodetect /dev/sda6 24644 32803 65545168+ fd Linux raid autodetect /dev/sda7 32804 36481 29543503+ 8e Linux LVM /dev/sda8 167 16484 65537136 fd Linux raid autodetect

If the fact that the partitions are out of order bothers you, go ahead and delete and recreate the partitions that come after the resized one. Initial experiments indicate that LVM and mdadm do not care. LVM and mdadm do not care about device IDs because they use UUIDs stored inside and can find their pieces no matter how you number them.

You can now write the partition table to disk, but it is very common that the operating system is unable to accept the new partition table.

Anything that references partitions that have been renumbered (sda7,8,9 became 5,6,7 ; while sda5 should not be referenced by anything any more) will have to be updated. If you're using these partitions purely for LVM then I don't know of any files that reference partitions. LVM scans hard drives and picks up anything that has the right ID and a PV header. For RAIDs, your /etc/mdadm.conf should be identifying things with a uuid=, which means there are no partition names. If your mdadm.conf does use partition names, you will have to adjust them. If you are using any of the partitions for a regular filesystem, you'll have to update /etc/fstab.

If you need to reboot you can do it now. The freshly booted kernel will read your new partition table. If you screwed anything up, be ready to do some hardcore troubleshooting. A printout of your old partition table will probably save your butt.

(If the repartitioned disk only had LVM and RAID on it, you probably do not need a reboot. A reboot is only required if the kernel had a locked copy of the partition table in RAM for its own safety, which I think only happens if there are normal filesystems currently mounted from that drive.)

Resizing a RAID

It is not actually practical to expand all the partitions of a RAID5, but I will show you how it is done anyway.

The problem is that the RAID superblock lives at the end of each partition (which makes it easier for LILO and grub to boot from a RAID1 mirror. They don't even realize it is part of a RAID). As a result each partition resize wrecks that element of the RAID (because the superblock is in the middle of the expanded partition, not the end) and it must be reconstructed. During this reconstruction phase your are vulnerable to a disk failure at the same time you are thrashing the bejeezus out of several hard drives.

Let us imagine a RAID5 called /dev/md0 built from /dev/sda5, b5, and, c5. Let us also imagine that you have emptied out the partitions after each of them in preparation for expansion.

# mdadm --stop /dev/md0 # fdisk /dev/sda expand /dev/sda5. # mdadm --assemble /dev/md0 # mdadm /dev/md0 -a /dev/sda5 # while grep recovery /proc/mdstat; do sleep 60; done this should take a while. Large RAIDs can take hours. # cat /proc/mdstat review this output to make sure the RAID is in a good state. # mdadm --stop /dev/md0 # fdisk /dev/sdb expand /dev/sdb5. # mdadm --assemble /dev/md0 # mdadm /dev/md0 -a /dev/sdb5 # while grep recovery /proc/mdstat; do sleep 60; done # cat /proc/mdstat # mdadm --stop /dev/md0 # fdisk /dev/sdc expand /dev/sdc5. # mdadm --assemble /dev/md0 # mdadm /dev/md0 -a /dev/sdc5 # while grep recovery /proc/mdstat; do sleep 60; done # cat /proc/mdstat # mdadm --stop /dev/md0 # mdadm /dev/md0 --grow -z max # mdadm --assemble /dev/md0 # cat /proc/mdstat

Have I talked you out of it yet?

anecdotes about the SansDigital TR4U