robbat2: (Default)
[personal profile] robbat2

Setting up the storage on my new machine, I just ran into something really interesting, what seems to be deliberate usable and useful, but completely undocumented functionality in the MD RAID layer.

It's possible to create RAID devices with the initial array having 'missing' slots, and then add the devices for those missing slots later. RAID1 lets you have one or more, RAID5 only one, RAID6 one or two, RAID10 up to half of the total. That functionality is documented in both the Documentation/md.txt of the kernel, as well as the manpage for mdadm.

What isn't documented is when you later add devices, how to get them to take up the 'missing' slots, rather than remain as spares. Nothing in md(7), mdadm(8), or Documentation/md.txt. Nothing I tried with mdadm could do it either, leaving only the sysfs interface for the RAID device.

Documentation/md.txt does describe the sysfs interface in detail, but seems to have some omissions and outdated material - the code has moved on, but the documentation hasn't caught up yet.

So, below the jump, I present my small HOWTO on creating a RAID10 with missing devices and how to later add them properly.

MD with missing devices HOWTO

We're going to create /dev/md10 as a RAID10, starting with two missing devices. In the example here, I use 4 loopback devices of 512MiB each: /dev/loop[1-4], but you should just substitute your real devices.

# mdadm --create /dev/md10 --level 10 -n 4 /dev/loop1 missing /dev/loop3 missing -x 0
mdadm: array /dev/md10 started.
# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
# mdadm --manage --add /dev/md10 /dev/loop2 /dev/loop4
mdadm: added /dev/loop2
mdadm: added /dev/loop4
# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]

Now notice that the two new devices have been added as spares [denoted by the "(S)"], and that the array remains degraded [denoted by the underscores in the "[U_U_]"]. Now it's time to break out the sysfs interface.

# cd /sys/block/md10/md/
# grep . dev-loop*/{slot,state}
dev-loop1/slot:0
dev-loop2/slot:none
dev-loop3/slot:2
dev-loop4/slot:none
dev-loop1/state:in_sync
dev-loop2/state:spare
dev-loop3/state:in_sync
dev-loop4/state:spare

Now a short foray into explaining how MD-raid sees component devices. For an array with N devices total, there are slots numbered from 0 to N-1. If all the devices are present, there are no empty slots. The presence or absence of a device in a slot is noted by the display from /proc/mdstat: [U_U_]. That shows we have a devices in slots 0 and 2, and nothing in slots 1 and 3. The mdstat output does include slot numbers after each device in the listing line: md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]. loop4 and loop2 are in slots 4 and 5, both spare. loop3 and loop1 are in slots 0 and 2. The slot numbers that are greater than the device numbers seem to be extraneous, I'm not sure if they are just an mdadm abstraction, or in the kernel internals only.

Now we want to fix up the array. We want to promote both spares to the missing slots. This is the first item that Documentation/md.txt is really wrong it. The description for the slot sysfs node contains: "This can only be set while assembling an array." This is actually wrong, we CAN write to it and fix our array.

# echo 1 >dev-loop2/slot
# echo 3 >dev-loop4/slot
# grep . dev-loop*/slot
dev-loop1/slot:0
dev-loop2/slot:1
dev-loop3/slot:2
dev-loop4/slot:3
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]

The slot numbers have changed in the mdstat output and the sysfs, but they no longer match at all. The spare marker "(S)" has also vanished. Now we can follow the sysfs docmentation, and force a rebuild using the sync_action node.

In theory, the mdadm daemon, if running, should have detected that the array was degraded and had valid spares, but I don't know why it didn't. Perhaps another bug to trace down later.

# echo repair >sync_action 
(wait a moment)
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
      [=============>.......]  recovery = 65.6% (344064/524224) finish=0.1min speed=22937K/sec

The slot numbers still aren't what we set them to, but the array is busy rebuilding still.

# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]

Now that the rebuild is complete, the slot numbers have flipped to their correct values.

Bonus: regular maintenance ideas

While we can regularly check individual disks with the daemon part of smartmontools, issuing short and long disk tests, there is also a way to check entire arrays for consistency.

The only way of doing it with mdadm is to force a rebuild, but that isn't really a nice proposition if it picks a disk that was about to fail as one of the 'good' disks. sysfs to the rescue again, there is a non-destructive way to test an array, and only promote to repair mode if there is an issue.

# echo check >sync_action 
(wait a moment)
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      [============>........]  check = 62.8% (660224/1048448) finish=0.0min speed=110037K/sec

Either make a cronjob to do it, or put the functionality in mdadm. You can safely issue the check command to multiple md devices at once, the kernel will ensure that it doesn't check array that share the same disks.

Upstream it!

Date: 2008-09-08 03:21 pm (UTC)
From: [identity profile] spyderous.livejournal.com
You should submit that to lkml for inclusion into the docs!

Doesn't mdadm do this already?

Date: 2008-09-11 03:03 am (UTC)
From: (Anonymous)
mdadm seems to hot add things in for me on raid1:

mythtv test $ mdadm --create -n 2 -l 1 /dev/md2 /dev/loop1 missing
mdadm: array /dev/md2 started.
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop1[0]
102336 blocks [2/1] [U_]

unused devices:
mythtv test $ mdadm --add /dev/md2 /dev/loop2
mdadm: added /dev/loop2
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop2[1] loop1[0]
102336 blocks [2/2] [UU]

unused devices:
mythtv test $

Re: Doesn't mdadm do this already?

Date: 2008-09-11 03:42 am (UTC)
From: [identity profile] robbat2.livejournal.com
Ok, testing with RAID1 appears to auto-rebuild, but RAID10 does not.

# for i in 0 1 2 3 ; do dd if=/dev/zero of=/block.$i bs=1M count=128 ; losetup /dev/loop${i} /block.$i ; done ;
# mdadm --create -n 4 -l 10 /dev/md99 missing /dev/loop1 missing /dev/loop3
# mdadm --add /dev/md99 /dev/loop0
# grep '(S)' /proc/mdstat
md99 : active raid10 loop1[4](S) loop4[3] loop2[1]

I've done this in a real world server.

Date: 2008-10-04 03:19 am (UTC)
From: (Anonymous)
I migrated a real world server from JBOD to RAID5 by installing 2 Hard drives the same size as the primary, and initializing them as RAID5 with one disk missing. I then copied the data across, verified, remounted the filesystem, and hotadded the old primary to the new RAID5

(no subject)

Date: 2008-11-08 03:24 am (UTC)
From: (Anonymous)
Looks like a known bug.

http://bugzilla.kernel.org/show_bug.cgi?id=11967

Helluva workaround though..

(no subject)

Date: 2008-11-08 06:56 am (UTC)
From: [identity profile] robbat2.livejournal.com
Fun, upstream dismissed it as being a bug originally - and I just moved to 3ware hardware for myself instead.

(no subject)

Date: 2009-05-17 11:46 am (UTC)
swestrup: (Default)
From: [personal profile] swestrup
This looks like exactly the solution to the problem I'm currently dealing with, only there is no 'md' node in my /sys/block/md10 dir. All I find there are files called 'dev', 'range', 'size' and 'stat'.

I fear that this server might have too old a version of sysfs in it.

(no subject)

Date: 2009-05-17 06:55 pm (UTC)
From: [identity profile] robbat2.livejournal.com
What kernel is on that server?

(no subject)

Date: 2009-05-17 07:22 pm (UTC)
swestrup: (Default)
From: [personal profile] swestrup
Its running a 3.6.3 Mandriva kernel.

(no subject)

Date: 2009-05-17 07:30 pm (UTC)
From: [identity profile] robbat2.livejournal.com
uname -a please, the distro version says nothing.

(no subject)

Date: 2009-05-18 12:48 pm (UTC)
swestrup: (Default)
From: [personal profile] swestrup
Oops. Sorry. Typo above. I was trying to say 2.6.3 kernel. Specifically uname -a says '2.6.3-4mdk'

(no subject)

Date: 2009-05-18 08:42 pm (UTC)
From: [identity profile] robbat2.livejournal.com
Ok, that's absolutely ancient. What was the build date in the uname -a string?
2.6.3 was released in Feb 2004.

(no subject)

Date: 2009-05-18 09:34 pm (UTC)
swestrup: (Default)
From: [personal profile] swestrup
Machine isn't running right now, but I think it was bought in 2000 and probably hasn't been upgraded since around 2005. That's when Mandrake became Mandriva and the upgrade was known to be problematical, so it was never done.

I wonder if there is a live distro out there with a sufficiently recent kernel, and raid support that I could use to do the trick above? If I understand correctly, once I've gotten the right values into the superblocks, rebuilt the array and resynched, I should be able to boot up on my old kernel and have things still work.

Then I can look into upgrading to a newer kernel.

(no subject)

Date: 2009-05-18 09:44 pm (UTC)
From: [identity profile] robbat2.livejournal.com
Grab one of the weekly Gentoo ISOs suitable for your architecture. x86 and amd64 presently use a 2.6.28 kernel in the latest isos.

Given the danger level here, if you have spare disk, I'd suggest imaging your array components using dd as a precautionary measure. Given the age of the machine, you could probably capture all 4 components onto a single modern disk.

(no subject)

Date: 2009-05-19 06:45 am (UTC)
swestrup: (Default)
From: [personal profile] swestrup
Alas, its an array of three 200GB drives. I don't happen to have anything here that's big enough to image all the parts.

Then again, despite a tight budget, it may be worthwhile to buy a 750GB or 1TB drive for the purpose.

(no subject)

Date: 2009-05-24 03:56 am (UTC)
swestrup: (Default)
From: [personal profile] swestrup
Well, I finally got some new drives and imaged the parts of the old ones, but I am now stuck, as writing to the /slot subdirs for the spares just gets me a write error.

(no subject)

Date: 2009-05-24 03:57 am (UTC)
swestrup: (Default)
From: [personal profile] swestrup
Oh, and this is with a 2.6.28 kernel from a Gentoo weekly, like you suggested.

(no subject)

Date: 2009-05-24 04:38 am (UTC)
From: [identity profile] robbat2.livejournal.com
Did you review Documentation/md.txt as per my suggestion? Can you include the contents of the various files (use grep . as above).

(no subject)

Date: 2009-05-24 05:21 am (UTC)
swestrup: (Default)
From: [personal profile] swestrup
I dunno what changed, but I disassembled and reassembled the array a few times, tried it again, and it just worked.

So, now I just gotta rebuild everything.

You've been a great help, thanks!

no space left on device ??

Date: 2009-06-11 09:49 am (UTC)
From: (Anonymous)
Very interesting post.
I'm trying to "revive" a missing raid6 following your procedure and other ones.
But when I try to 'echo 1 >dev-sda1/slot' I get a message telling that there's no free space on device, and a write error.
Any idea why cannot write to this file?
This is kernel 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux

Re: no space left on device ??

Date: 2009-06-11 10:35 am (UTC)
From: (Anonymous)
was a silly thing. Just echo -n

Finnally, I recovered the raid 6.
# echo -n 1 >/dev-sda1/slot (first disk was out, so sda1 is in slot 1 not 0)
The rest of disks automatically occupied the rest of slots.
# echo -n clean array_state
And raid6 is running again.
Now I'm trying to mount the fs stored in lvm...

May 2017

S M T W T F S
 123456
78910111213
141516171819 20
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags