![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)

Setting up the storage on my new machine, I just ran into something really interesting, what seems to be deliberate usable and useful, but completely undocumented functionality in the MD RAID layer.
It's possible to create RAID devices with the initial array having 'missing' slots, and then add the devices for those missing slots later. RAID1 lets you have one or more, RAID5 only one, RAID6 one or two, RAID10 up to half of the total. That functionality is documented in both the Documentation/md.txt of the kernel, as well as the manpage for mdadm.
What isn't documented is when you later add devices, how to get them to take up the 'missing' slots, rather than remain as spares. Nothing in md(7), mdadm(8), or Documentation/md.txt. Nothing I tried with mdadm could do it either, leaving only the sysfs interface for the RAID device.
Documentation/md.txt does describe the sysfs interface in detail, but seems to have some omissions and outdated material - the code has moved on, but the documentation hasn't caught up yet.
So, below the jump, I present my small HOWTO on creating a RAID10 with missing devices and how to later add them properly.
MD with missing devices HOWTO
We're going to create /dev/md10 as a RAID10, starting with two missing devices. In the example here, I use 4 loopback devices of 512MiB each: /dev/loop[1-4], but you should just substitute your real devices.
# mdadm --create /dev/md10 --level 10 -n 4 /dev/loop1 missing /dev/loop3 missing -x 0 mdadm: array /dev/md10 started. # cat /proc/mdstat Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] md10 : active raid10 loop3[2] loop1[0] 1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_] # mdadm --manage --add /dev/md10 /dev/loop2 /dev/loop4 mdadm: added /dev/loop2 mdadm: added /dev/loop4 # cat /proc/mdstat Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0] 1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
Now notice that the two new devices have been added as spares [denoted by the "(S)"], and that the array remains degraded [denoted by the underscores in the "[U_U_]"]. Now it's time to break out the sysfs interface.
# cd /sys/block/md10/md/ # grep . dev-loop*/{slot,state} dev-loop1/slot:0 dev-loop2/slot:none dev-loop3/slot:2 dev-loop4/slot:none dev-loop1/state:in_sync dev-loop2/state:spare dev-loop3/state:in_sync dev-loop4/state:spare
Now a short foray into explaining how MD-raid sees component devices. For an array with N devices total, there are slots numbered from 0 to N-1. If all the devices are present, there are no empty slots. The presence or absence of a device in a slot is noted by the display from /proc/mdstat: [U_U_]. That shows we have a devices in slots 0 and 2, and nothing in slots 1 and 3. The mdstat output does include slot numbers after each device in the listing line: md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]. loop4 and loop2 are in slots 4 and 5, both spare. loop3 and loop1 are in slots 0 and 2. The slot numbers that are greater than the device numbers seem to be extraneous, I'm not sure if they are just an mdadm abstraction, or in the kernel internals only.
Now we want to fix up the array. We want to promote both spares to the missing slots. This is the first item that Documentation/md.txt is really wrong it. The description for the slot sysfs node contains: "This can only be set while assembling an array." This is actually wrong, we CAN write to it and fix our array.
# echo 1 >dev-loop2/slot # echo 3 >dev-loop4/slot # grep . dev-loop*/slot dev-loop1/slot:0 dev-loop2/slot:1 dev-loop3/slot:2 dev-loop4/slot:3 # cat /proc/mdstat Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0] 1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
The slot numbers have changed in the mdstat output and the sysfs, but they no longer match at all. The spare marker "(S)" has also vanished. Now we can follow the sysfs docmentation, and force a rebuild using the sync_action node.
In theory, the mdadm daemon, if running, should have detected that the array was degraded and had valid spares, but I don't know why it didn't. Perhaps another bug to trace down later.
# echo repair >sync_action (wait a moment) # cat /proc/mdstat Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0] 1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_] [=============>.......] recovery = 65.6% (344064/524224) finish=0.1min speed=22937K/sec
The slot numbers still aren't what we set them to, but the array is busy rebuilding still.
# cat /proc/mdstat Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0] 1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]
Now that the rebuild is complete, the slot numbers have flipped to their correct values.
Bonus: regular maintenance ideas
While we can regularly check individual disks with the daemon part of smartmontools, issuing short and long disk tests, there is also a way to check entire arrays for consistency.
The only way of doing it with mdadm is to force a rebuild, but that isn't really a nice proposition if it picks a disk that was about to fail as one of the 'good' disks. sysfs to the rescue again, there is a non-destructive way to test an array, and only promote to repair mode if there is an issue.
# echo check >sync_action (wait a moment) # cat /proc/mdstat Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0] 1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU] [============>........] check = 62.8% (660224/1048448) finish=0.0min speed=110037K/sec
Either make a cronjob to do it, or put the functionality in mdadm. You can safely issue the check command to multiple md devices at once, the kernel will ensure that it doesn't check array that share the same disks.
Upstream it!
Date: 2008-09-08 03:21 pm (UTC)Doesn't mdadm do this already?
Date: 2008-09-11 03:03 am (UTC)mythtv test $ mdadm --create -n 2 -l 1 /dev/md2 /dev/loop1 missing
mdadm: array /dev/md2 started.
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop1[0]
102336 blocks [2/1] [U_]
unused devices:
mythtv test $ mdadm --add /dev/md2 /dev/loop2
mdadm: added /dev/loop2
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop2[1] loop1[0]
102336 blocks [2/2] [UU]
unused devices:
mythtv test $
Re: Doesn't mdadm do this already?
Date: 2008-09-11 03:42 am (UTC)# for i in 0 1 2 3 ; do dd if=/dev/zero of=/block.$i bs=1M count=128 ; losetup /dev/loop${i} /block.$i ; done ;
# mdadm --create -n 4 -l 10 /dev/md99 missing /dev/loop1 missing /dev/loop3
# mdadm --add /dev/md99 /dev/loop0
# grep '(S)' /proc/mdstat
md99 : active raid10 loop1[4](S) loop4[3] loop2[1]
I've done this in a real world server.
Date: 2008-10-04 03:19 am (UTC)(no subject)
Date: 2008-11-08 03:24 am (UTC)http://bugzilla.kernel.org/show_bug.cgi?id=11967
Helluva workaround though..
(no subject)
Date: 2008-11-08 06:56 am (UTC)(no subject)
Date: 2009-05-17 11:46 am (UTC)I fear that this server might have too old a version of sysfs in it.
(no subject)
Date: 2009-05-17 06:55 pm (UTC)(no subject)
Date: 2009-05-17 07:22 pm (UTC)(no subject)
Date: 2009-05-17 07:30 pm (UTC)(no subject)
Date: 2009-05-18 12:48 pm (UTC)(no subject)
Date: 2009-05-18 08:42 pm (UTC)2.6.3 was released in Feb 2004.
(no subject)
Date: 2009-05-18 09:34 pm (UTC)I wonder if there is a live distro out there with a sufficiently recent kernel, and raid support that I could use to do the trick above? If I understand correctly, once I've gotten the right values into the superblocks, rebuilt the array and resynched, I should be able to boot up on my old kernel and have things still work.
Then I can look into upgrading to a newer kernel.
(no subject)
Date: 2009-05-18 09:44 pm (UTC)Given the danger level here, if you have spare disk, I'd suggest imaging your array components using dd as a precautionary measure. Given the age of the machine, you could probably capture all 4 components onto a single modern disk.
(no subject)
Date: 2009-05-19 06:45 am (UTC)Then again, despite a tight budget, it may be worthwhile to buy a 750GB or 1TB drive for the purpose.
(no subject)
Date: 2009-05-24 03:56 am (UTC)(no subject)
Date: 2009-05-24 03:57 am (UTC)(no subject)
Date: 2009-05-24 04:38 am (UTC)(no subject)
Date: 2009-05-24 05:21 am (UTC)So, now I just gotta rebuild everything.
You've been a great help, thanks!
no space left on device ??
Date: 2009-06-11 09:49 am (UTC)I'm trying to "revive" a missing raid6 following your procedure and other ones.
But when I try to 'echo 1 >dev-sda1/slot' I get a message telling that there's no free space on device, and a write error.
Any idea why cannot write to this file?
This is kernel 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux
Re: no space left on device ??
Date: 2009-06-11 10:35 am (UTC)Finnally, I recovered the raid 6.
# echo -n 1 >/dev-sda1/slot (first disk was out, so sda1 is in slot 1 not 0)
The rest of disks automatically occupied the rest of slots.
# echo -n clean array_state
And raid6 is running again.
Now I'm trying to mount the fs stored in lvm...