robbat2 | Linux MD RAID devices and moving spares to missing slots

Setting up the storage on my new machine, I just ran into something really interesting, what seems to be deliberate usable and useful, but completely undocumented functionality in the MD RAID layer.

It's possible to create RAID devices with the initial array having 'missing' slots, and then add the devices for those missing slots later. RAID1 lets you have one or more, RAID5 only one, RAID6 one or two, RAID10 up to half of the total. That functionality is documented in both the Documentation/md.txt of the kernel, as well as the manpage for mdadm.

What isn't documented is when you later add devices, how to get them to take up the 'missing' slots, rather than remain as spares. Nothing in md(7), mdadm(8), or Documentation/md.txt. Nothing I tried with mdadm could do it either, leaving only the sysfs interface for the RAID device.

Documentation/md.txt does describe the sysfs interface in detail, but seems to have some omissions and outdated material - the code has moved on, but the documentation hasn't caught up yet.

So, below the jump, I present my small HOWTO on creating a RAID10 with missing devices and how to later add them properly.

MD with missing devices HOWTO

We're going to create /dev/md10 as a RAID10, starting with two missing devices. In the example here, I use 4 loopback devices of 512MiB each: /dev/loop[1-4], but you should just substitute your real devices.

# mdadm --create /dev/md10 --level 10 -n 4 /dev/loop1 missing /dev/loop3 missing -x 0
mdadm: array /dev/md10 started.
# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
# mdadm --manage --add /dev/md10 /dev/loop2 /dev/loop4
mdadm: added /dev/loop2
mdadm: added /dev/loop4
# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]

Now notice that the two new devices have been added as spares [denoted by the "(S)"], and that the array remains degraded [denoted by the underscores in the "[U_U_]"]. Now it's time to break out the sysfs interface.

# cd /sys/block/md10/md/
# grep . dev-loop*/{slot,state}
dev-loop1/slot:0
dev-loop2/slot:none
dev-loop3/slot:2
dev-loop4/slot:none
dev-loop1/state:in_sync
dev-loop2/state:spare
dev-loop3/state:in_sync
dev-loop4/state:spare

Now a short foray into explaining how MD-raid sees component devices. For an array with N devices total, there are slots numbered from 0 to N-1. If all the devices are present, there are no empty slots. The presence or absence of a device in a slot is noted by the display from /proc/mdstat: [U_U_]. That shows we have a devices in slots 0 and 2, and nothing in slots 1 and 3. The mdstat output does include slot numbers after each device in the listing line: md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]. loop4 and loop2 are in slots 4 and 5, both spare. loop3 and loop1 are in slots 0 and 2. The slot numbers that are greater than the device numbers seem to be extraneous, I'm not sure if they are just an mdadm abstraction, or in the kernel internals only.

Now we want to fix up the array. We want to promote both spares to the missing slots. This is the first item that Documentation/md.txt is really wrong it. The description for the slot sysfs node contains: "This can only be set while assembling an array." This is actually wrong, we CAN write to it and fix our array.

# echo 1 >dev-loop2/slot
# echo 3 >dev-loop4/slot
# grep . dev-loop*/slot
dev-loop1/slot:0
dev-loop2/slot:1
dev-loop3/slot:2
dev-loop4/slot:3
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]

The slot numbers have changed in the mdstat output and the sysfs, but they no longer match at all. The spare marker "(S)" has also vanished. Now we can follow the sysfs docmentation, and force a rebuild using the sync_action node.

In theory, the mdadm daemon, if running, should have detected that the array was degraded and had valid spares, but I don't know why it didn't. Perhaps another bug to trace down later.

# echo repair >sync_action 
(wait a moment)
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
      [=============>.......]  recovery = 65.6% (344064/524224) finish=0.1min speed=22937K/sec

The slot numbers still aren't what we set them to, but the array is busy rebuilding still.

# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]

Now that the rebuild is complete, the slot numbers have flipped to their correct values.

Bonus: regular maintenance ideas

While we can regularly check individual disks with the daemon part of smartmontools, issuing short and long disk tests, there is also a way to check entire arrays for consistency.

The only way of doing it with mdadm is to force a rebuild, but that isn't really a nice proposition if it picks a disk that was about to fail as one of the 'good' disks. sysfs to the rescue again, there is a non-destructive way to test an array, and only promote to repair mode if there is an issue.

# echo check >sync_action 
(wait a moment)
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      [============>........]  check = 62.8% (660224/1048448) finish=0.0min speed=110037K/sec

Either make a cronjob to do it, or put the functionality in mdadm. You can safely issue the check command to multiple md devices at once, the kernel will ensure that it doesn't check array that share the same disks.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Most Popular Tags

airports - 2 uses
alsa - 2 uses
barcampvancouver - 4 uses
bicycle - 3 uses
bugzilla - 3 uses
cacert - 3 uses
computers - 5 uses
conferences - 2 uses
cycling - 2 uses
fc - 3 uses
fibre channel - 3 uses
fibrechannel - 3 uses
forsale - 2 uses
geek - 6 uses
gentoo - 54 uses
git - 3 uses
gitosis - 2 uses
google - 3 uses
gsoc - 2 uses
honeymoon - 2 uses
ipv6 - 2 uses
kernel - 2 uses
lazyweb - 2 uses
linux - 7 uses
livejournal - 4 uses
meme - 3 uses
mirrors - 5 uses
mysql - 4 uses
networking - 3 uses
ols2008 - 6 uses
open source - 3 uses
pgp - 2 uses
photography - 2 uses
php - 2 uses
predictions - 2 uses
quiz - 2 uses
releng - 2 uses
rsync - 3 uses
security - 6 uses
south africa - 2 uses
spam - 7 uses
statistics - 3 uses
stolen - 2 uses
storedge - 2 uses
sun - 3 uses
techbc - 4 uses
translink - 2 uses
travel - 3 uses
wedding - 2 uses
wishlist - 2 uses

Flat | Top-Level Comments Only

From:

spyderous.livejournal.com

You should submit that to lkml for inclusion into the docs!

From: (Anonymous)

mdadm seems to hot add things in for me on raid1:

mythtv test $ mdadm --create -n 2 -l 1 /dev/md2 /dev/loop1 missing
mdadm: array /dev/md2 started.
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop1[0]
102336 blocks [2/1] [U_]

unused devices:
mythtv test $ mdadm --add /dev/md2 /dev/loop2
mdadm: added /dev/loop2
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop2[1] loop1[0]
102336 blocks [2/2] [UU]

unused devices:
mythtv test $

robbat2.livejournal.com

Ok, testing with RAID1 appears to auto-rebuild, but RAID10 does not.

# for i in 0 1 2 3 ; do dd if=/dev/zero of=/block.$i bs=1M count=128 ; losetup /dev/loop${i} /block.$i ; done ;
# mdadm --create -n 4 -l 10 /dev/md99 missing /dev/loop1 missing /dev/loop3
# mdadm --add /dev/md99 /dev/loop0
# grep '(S)' /proc/mdstat
md99 : active raid10 loop1[4](S) loop4[3] loop2[1]

I migrated a real world server from JBOD to RAID5 by installing 2 Hard drives the same size as the primary, and initializing them as RAID5 with one disk missing. I then copied the data across, verified, remounted the filesystem, and hotadded the old primary to the new RAID5

Looks like a known bug.

http://bugzilla.kernel.org/show_bug.cgi?id=11967

Helluva workaround though..

Fun, upstream dismissed it as being a bug originally - and I just moved to 3ware hardware for myself instead.

swestrup

This looks like exactly the solution to the problem I'm currently dealing with, only there is no 'md' node in my /sys/block/md10 dir. All I find there are files called 'dev', 'range', 'size' and 'stat'.

I fear that this server might have too old a version of sysfs in it.

What kernel is on that server?

Its running a 3.6.3 Mandriva kernel.

uname -a please, the distro version says nothing.

Oops. Sorry. Typo above. I was trying to say 2.6.3 kernel. Specifically uname -a says '2.6.3-4mdk'

Ok, that's absolutely ancient. What was the build date in the uname -a string?
2.6.3 was released in Feb 2004.

Machine isn't running right now, but I think it was bought in 2000 and probably hasn't been upgraded since around 2005. That's when Mandrake became Mandriva and the upgrade was known to be problematical, so it was never done.

I wonder if there is a live distro out there with a sufficiently recent kernel, and raid support that I could use to do the trick above? If I understand correctly, once I've gotten the right values into the superblocks, rebuilt the array and resynched, I should be able to boot up on my old kernel and have things still work.

Then I can look into upgrading to a newer kernel.

Grab one of the weekly Gentoo ISOs suitable for your architecture. x86 and amd64 presently use a 2.6.28 kernel in the latest isos.

Given the danger level here, if you have spare disk, I'd suggest imaging your array components using dd as a precautionary measure. Given the age of the machine, you could probably capture all 4 components onto a single modern disk.

Alas, its an array of three 200GB drives. I don't happen to have anything here that's big enough to image all the parts.

Then again, despite a tight budget, it may be worthwhile to buy a 750GB or 1TB drive for the purpose.

Well, I finally got some new drives and imaged the parts of the old ones, but I am now stuck, as writing to the /slot subdirs for the spares just gets me a write error.

Oh, and this is with a 2.6.28 kernel from a Gentoo weekly, like you suggested.

Did you review Documentation/md.txt as per my suggestion? Can you include the contents of the various files (use grep . as above).

I dunno what changed, but I disassembled and reassembled the array a few times, tried it again, and it just worked.

So, now I just gotta rebuild everything.

You've been a great help, thanks!

Very interesting post.
I'm trying to "revive" a missing raid6 following your procedure and other ones.
But when I try to 'echo 1 >dev-sda1/slot' I get a message telling that there's no free space on device, and a write error.
Any idea why cannot write to this file?
This is kernel 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux

was a silly thing. Just echo -n

Finnally, I recovered the raid 6.
# echo -n 1 >/dev-sda1/slot (first disk was out, so sda1 is in slot 1 not 0)
The rest of disks automatically occupied the rest of slots.
# echo -n clean array_state
And raid6 is running again.
Now I'm trying to mount the fs stored in lvm...

Move along, nothing to read

A dis-illusioned software engineer

Linux MD RAID devices and moving spares to missing slots

MD with missing devices HOWTO

Bonus: regular maintenance ideas

Upstream it!

Doesn't mdadm do this already?

Re: Doesn't mdadm do this already?

I've done this in a real world server.

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

no space left on device ??

Re: no space left on device ??

Profile

May 2017

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags