Enable reconstruct-write mode


Recommended Posts

b) Add some code put array in mode where all writes are "reconstruct writes" vs. "read-modify-writes".  This requires all the drives to be spun up during the clearing process but would probably let step 3 run 3x faster.

 

Tom, now you touched a point I had on my head for some time now... and I was just thinking for some time now to write a new topic on roadmap forum just to suggest such a mode!!  My main interest in such mode is not just for this case (to improve zeroing speed) but really for normal array usage, in situations that high performance speed would be useful, for eg. when needing to write huge amounts of data to a disk on the array, we could just switch to that mode, surely with the downside of needing to have all disks spinning, but yet it can be really great feature IMO for when we need it. I guess this can improve writes to array to near full hdd's speed (i.e. similar to parity-sync speed), right?

 

Maybe this should actually be moved/copied to a different topic as a separate enhancement request?

 

The code in the driver is already there and there's a tunable called "md_write_method" to enable it, but I took out ability to configure it long ago, though it would be very easy to put back in.  At the time this was implemented we were still mostly using PCI controllers and dual-channel IDE drives, and I found that as the array width increased, the speed benefit decreased sharply.  I think with as little as 4 or 5 drives in the array the speed benefit disappeared and then started slowing down writes as array width increased.  But maybe the situation has changed now, so I'll hook that control back up and run some tests...

 

Thanks Tom, that sounds very good thing to me, if such mode works as I expect then I guess it can possible put aside the main disadvantage of unRAID vs other storage solutions... write performance... surely with the downside of needing all hdd's spinning but yet very good for when we really need it! If there are no other downsides (?) I really think you could even give it more emphasis than just a 'tunable', but you can think on that for the future, after fully tested, etc...

 

I have one question though, imagine we are running array on that mode, on a system with 3 data disks + parity, then we want to write something to disk1... in my understanding it will need to read data from disk2 and disk3, then just write intended data on disk1 and parity - calculated from data it knows of all data disks... right? my question exactly is: what would happen if during this operation a bad block is found for example on disk2 or disk3? will it automatically handle it and read existing parity data to calculate proper parity including the data that was on disk2? my fear is, as you can understand, if such condition couldn't eventually lead to parity being updated without taking in account the good data that was on disk2, leading to impossible to recover it.

 

Btw, wouldn't be better to move Reply #39, Reply #41, and this post to create a new topic about just this enhancement request? As this is not directly related to "Remove Drive Without Losing Parity" - despite the last one may greatly benefit from it for sure, but not exclusively for that - and further discussion about this may be required that will only bloat this topic... maybe some topic title like: Allow using array on "reconstruct writes" mode for improved write performance

Link to comment

I have one question though, imagine we are running array on that mode, on a system with 3 data disks + parity, then we want to write something to disk1... in my understanding it will need to read data from disk2 and disk3, then just write intended data on disk1 and parity - calculated from data it knows of all data disks... right? my question exactly is: what would happen if during this operation a bad block is found for example on disk2 or disk3? will it automatically handle it and read existing parity data to calculate proper parity including the data that was on disk2? my fear is, as you can understand, if such condition couldn't eventually lead to parity being updated without taking in account the good data that was on disk2, leading to impossible to recover it.

No - it does not work like that!

 

The sequence is:

- Read sector from parity disk and from target data disk (i.e. two reads )

- Write new data to data disk

- Calculate new data for parity drive using differences from reads and calculate new parity data and write it to parity drive

 

This means that the only drives involved are the parity drive and the target data drive.  All other drives can remain spun down.  The parity drive will always be spun up and used when writing (unless writing to cache).  It is this requirement for 2 reads and 2 writes for each new data update that is the cause of writing being much slower than reading.  It also means that you want your parity drive to be as fast as possible as it is always involved in any updates that are not going to the cache drive.

 

Link to comment

@itimpi

But where is that speed benefit coming from then?

b) Add some code put array in mode where all writes are "reconstruct writes" vs. "read-modify-writes".  This requires all the drives to be spun up during the clearing process but would probably let step 3 run 3x faster.

Not sure, but I suspect that reading from all drives (which can be done in parallel) followed by a single write ends up being faster.  However that is just a guess.

Link to comment

I have one question though, imagine we are running array on that mode, on a system with 3 data disks + parity, then we want to write something to disk1... in my understanding it will need to read data from disk2 and disk3, then just write intended data on disk1 and parity - calculated from data it knows of all data disks... right? my question exactly is: what would happen if during this operation a bad block is found for example on disk2 or disk3? will it automatically handle it and read existing parity data to calculate proper parity including the data that was on disk2? my fear is, as you can understand, if such condition couldn't eventually lead to parity being updated without taking in account the good data that was on disk2, leading to impossible to recover it.

No - it does not work like that!

 

The sequence is:

- Read sector from parity disk and from target data disk (i.e. two reads )

- Write new data to data disk

- Calculate new data for parity drive using differences from reads and calculate new parity data and write it to parity drive

 

This means that the only drives involved are the parity drive and the target data drive.  All other drives can remain spun down.  The parity drive will always be spun up and used when writing (unless writing to cache).  It is this requirement for 2 reads and 2 writes for each new data update that is the cause of writing being much slower than reading.  It also means that you want your parity drive to be as fast as possible as it is always involved in any updates that are not going to the cache drive.

 

I know that, but my understanding is that is how it works in the "read-modify-writes" mode, i.e. what we have currently, what I described in my post is how I'm guessing it will work in the "reconstruct writes" mode (and I would also like Tom confirmation on that please?), that's why all drives will need to spin, and that's why I have great hope that it should be able to reach write speeds to the array similar to parity-sync speeds (since it will be similar operation, and no multiple operations will be done on the same disk on that mode - what slows it down as you said).

 

In sum that mode may bring a huge improvement in write performance, hopefully just with the downside of needing all drives spinning :)

Link to comment

I have one question though, imagine we are running array on that mode, on a system with 3 data disks + parity, then we want to write something to disk1... in my understanding it will need to read data from disk2 and disk3, then just write intended data on disk1 and parity - calculated from data it knows of all data disks... right? my question exactly is: what would happen if during this operation a bad block is found for example on disk2 or disk3? will it automatically handle it and read existing parity data to calculate proper parity including the data that was on disk2? my fear is, as you can understand, if such condition couldn't eventually lead to parity being updated without taking in account the good data that was on disk2, leading to impossible to recover it.

In this case, if a drive "other" than the one being written suffers an unrecoverable read error, then the operation reverts to a "read-modify-write" that includes Parity.  Also, in this case, the drive which suffered the read error is not re-written - though at this point in the state machine when writes are issued we can reconstruct and write it, so maybe that should be added.

Link to comment

I have one question though, imagine we are running array on that mode, on a system with 3 data disks + parity, then we want to write something to disk1... in my understanding it will need to read data from disk2 and disk3, then just write intended data on disk1 and parity - calculated from data it knows of all data disks... right? my question exactly is: what would happen if during this operation a bad block is found for example on disk2 or disk3? will it automatically handle it and read existing parity data to calculate proper parity including the data that was on disk2? my fear is, as you can understand, if such condition couldn't eventually lead to parity being updated without taking in account the good data that was on disk2, leading to impossible to recover it.

No - it does not work like that!

 

The sequence is:

- Read sector from parity disk and from target data disk (i.e. two reads )

- Write new data to data disk

- Calculate new data for parity drive using differences from reads and calculate new parity data and write it to parity drive

 

This means that the only drives involved are the parity drive and the target data drive.  All other drives can remain spun down.  The parity drive will always be spun up and used when writing (unless writing to cache).  It is this requirement for 2 reads and 2 writes for each new data update that is the cause of writing being much slower than reading.  It also means that you want your parity drive to be as fast as possible as it is always involved in any updates that are not going to the cache drive.

You are describing the current "read-modify-write" sequence, but this thread is talking about proposal to let you choose "reconstruct-write".

 

In reconstruct-write, we read all the "other" data disks, but not parity.  We then calculate parity and write it along with new data.  In 4-drive example: Parity, Disk1, Disk2, Disk3, suppose you write Disk1.  So we read Disk2 and Disk3.  When those reads complete we calculate NewParity = NewDisk1 ^ OldDisk2 ^ OldDisk3 then schedule writes to Parity and Disk1.

 

The speedup happens when we are writing a large file.  In this case the reads of the "other" data disks and the writes to the target data disk and parity end up pipelined and running in parallel.  The potential problem is this (besides having to have all disks spun up): as the array width increases you are using more bus and memory bandwidth in order to calculate parity.  Eventually the volume of data being transferred hits a bottleneck and the operation reaches it's maximum speed.  In the "old days" with PCI controllers, IDE disks and slow RAM this happened after a relatively small array width.  I suspect the situation has improved greatly though.

 

I just released 5.0.2, so I'll see about getting a 5.0.3 out that has this tunable so you guys can do some testing.

 

 

Link to comment
Also, in this case, the drive which suffered the read error is not re-written - though at this point in the state machine when writes are issued we can reconstruct and write it, so maybe that should be added.

Just for curiosity: is that reconstruct/rewrite procedure currently done for data disks in any other situation that a read error is found? when the read error actually happens on the data disk we are trying to read? on a parity check? I guess that may be important even to ensure hdd's eventually "sort" bad sectors, like reallocate them by spares with the good data written on them, right?

 

Thanks for the explanations, will be waiting for 5.0.3 :)

Link to comment

Also, in this case, the drive which suffered the read error is not re-written - though at this point in the state machine when writes are issued we can reconstruct and write it, so maybe that should be added.

Just for curiosity: is that reconstruct/rewrite procedure currently done for data disks in any other situation that a read error is found? when the read error actually happens on the data disk we are trying to read? on a parity check? I guess that may be important even to ensure hdd's eventually "sort" bad sectors, like reallocate them by spares with the good data written on them, right?

 

Thanks for the explanations, will be waiting for 5.0.3 :)

 

Yes:

- in a normal read: if disk read error, read parity and all "other" disks, reconstruct data, write to disk that had read error.

- in a normal write: same thing

- in a parity check: the read error disk is written and a syslog entry is generated

- in a parity sync: just a system log entry is generated, but operation continues - this is like a 2-disk failure

Link to comment

Also, in this case, the drive which suffered the read error is not re-written - though at this point in the state machine when writes are issued we can reconstruct and write it, so maybe that should be added.

Just for curiosity: is that reconstruct/rewrite procedure currently done for data disks in any other situation that a read error is found? when the read error actually happens on the data disk we are trying to read? on a parity check? I guess that may be important even to ensure hdd's eventually "sort" bad sectors, like reallocate them by spares with the good data written on them, right?

 

Thanks for the explanations, will be waiting for 5.0.3 :)

 

Yes:

- in a normal read: if disk read error, read parity and all "other" disks, reconstruct data, write to disk that had read error.

- in a normal write: same thing

- in a parity check: the read error disk is written and a syslog entry is generated

Is a parity error also reported? Is a correction written? If so, what is written?

- in a parity sync: just a system log entry is generated, but operation continues - this is like a 2-disk failure

What is written? Is this reported as a parity error?

Link to comment

Isn't this something a SSD acting as a write cache would solve?

 

I thought this Btrfs cache pool idea was the solution to the poor write speeds?

Yes, among other things.  Adding ability to force reconstruct write mode requires all drives to be spun up, and for large arrays, there are diminishing returns, to the point where it could be slower beyond a certain array width.  For which width this occurs will vary depending on the server hardware.  But my testing so far has shown promising results and this feature is enabled, on an experimental basis in 5.0.3.

Link to comment

How are you implementing the cache pool idea?

 

Will all writes be sent to the SSD first until nearly full and then sent to the disk during idle periods?

 

Essentially I'd like it to be completely transparent to the source machine and any disk can benefit.

 

Problem is the cache drive is the files themselves move, rather than at the block level! (Permission issues and TM doesn't like it!)

Link to comment

How are you implementing the cache pool idea?

 

Will all writes be sent to the SSD first until nearly full and then sent to the disk during idle periods?

The cache "disk" is a subvolume of the cache pool.

 

Essentially I'd like it to be completely transparent to the source machine and any disk can benefit.

 

Problem is the cache drive is the files themselves move, rather than at the block level! (Permission issues and TM doesn't like it!)

a) What permission issues?

b) Come to think of it, I probably haven't tested TM against a share with cache disk enabled... what happens?

Link to comment

My first tests with reconstruct-write mode look good, see:

 

root@unRAID:~# df -h

Filesystem            Size  Used Avail Use% Mounted on

tmpfs                128M  672K  128M  1% /var/log

/dev/sda1              16G  764M  15G  5% /boot

/dev/md1              1.9T  1.6T  281G  85% /mnt/disk1

/dev/md2              1.9T  427G  1.5T  23% /mnt/disk2

/dev/sdc1            1.9T  78G  1.8T  5% /mnt/cache

shfs                  3.7T  2.0T  1.7T  54% /mnt/user0

shfs                  5.5T  2.1T  3.5T  38% /mnt/user

 

root@unRAID:~# mdcmd set md_write_method 0

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 81.3735 s, 33.0 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 71.5807 s, 37.5 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 80.7667 s, 33.2 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 80.4009 s, 33.4 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 76.4149 s, 35.1 MB/s

 

root@unRAID:~# mdcmd set md_write_method 1

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 27.2661 s, 98.5 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 29.6694 s, 90.5 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 40.6462 s, 66.0 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 29.802 s, 90.1 MB/s

root@unRAID:~# dd if=/dev/zero bs=10M of=/mnt/disk2/test.tmp count=256 conv=fdatasync

256+0 records in

256+0 records out

2684354560 bytes (2.7 GB) copied, 27.44 s, 97.8 MB/s

 

Also I could get ~77MB/s vs ~26MB/s to copy to my disk1 on the array (85% filled though) over network with Teracopy on some quick tests.

 

Though I'm finding it hard to do reliable write speed benchmarks at filesystem level, as I always get different speed values, that's why I repeated the dd test 5 times, however I could get the same behavior when writing to the cache disk (not currently empty filesystem), then I'm guessing it may be filesystem allocating data at different physical positions on the hdd each time, so getting obviously speed differences? I guess trying it with a clean filesystem disk may help? as if I remember correctly when I built my system and my disks were empty I think I could get consistent write speeds doing similar test. Anyway it's enough to get an idea of the difference, near 3x, though if someone knows some way to do a more reliable write benchmark with non empty filesystem just let me know.

 

Also sure that my array is currently very small with just 2 data disks (the one on my sign) but I may soon get the cache/hot-spare disk to the array and re-test with 3 disks array, as I may not need cache disk anymore with this mode available anyway :)

Link to comment

How are you implementing the cache pool idea?

 

Will all writes be sent to the SSD first until nearly full and then sent to the disk during idle periods?

The cache "disk" is a subvolume of the cache pool.

 

Essentially I'd like it to be completely transparent to the source machine and any disk can benefit.

 

Problem is the cache drive is the files themselves move, rather than at the block level! (Permission issues and TM doesn't like it!)

a) What permission issues?

b) Come to think of it, I probably haven't tested TM against a share with cache disk enabled... what happens?

 

It just falls over. TM over a network isn't that reliable even with Apples Time Capsules.

 

This was a while back with b6 so it may be fixed by now, not had time to check!

Link to comment

In reconstruct-write, we read all the "other" data disks, but not parity.  We then calculate parity and write it along with new data.  In 4-drive example: Parity, Disk1, Disk2, Disk3, suppose you write Disk1.  So we read Disk2 and Disk3.  When those reads complete we calculate NewParity = NewDisk1 ^ OldDisk2 ^ OldDisk3 then schedule writes to Parity and Disk1.

 

The speedup happens when we are writing a large file.  In this case the reads of the "other" data disks and the writes to the target data disk and parity end up pipelined and running in parallel.  The potential problem is this (besides having to have all disks spun up): as the array width increases you are using more bus and memory bandwidth in order to calculate parity.  Eventually the volume of data being transferred hits a bottleneck and the operation reaches it's maximum speed.  In the "old days" with PCI controllers, IDE disks and slow RAM this happened after a relatively small array width.  I suspect the situation has improved greatly though.

 

This is where the fastest parity disk is going to shine.

I bet those with Areca controllers using RAID0 and write caching are going to see some nice benefits.

 

While we are still limited to the slowest array disks for reading, the parallel reads along with the burst of fast parity writes helps.

I've seen this when monitoring the difference of a parity check vs a parity generate.

 

This is an exciting change!

Link to comment

[thinking out loud for discussion purposes]

 

I can't decide if this is useful or not for those with a cache drive. 

 

Obviously it doesn't help with writes to the cache which is already fast too, and the mover happens at night when I'm not around to care about speed.  But I suppose there is something to be said for getting the mover done as soon as possible.  But not sure why. 

 

I suppose one MO might be to change the mover function such that it runs before much more frequently (like every hour or X idle-minutes after a write operation) that way writes to the cache are more akin to writes directly to the array (for data protection purposes) and having the mover operate in reconstruct-write mode would certainly make sense.

 

Beyond that, I could see wanting to turn this on temporarily while doing a bunch of file structure manipulation, especially moving stuff from one share to another over windows.  Though MC or FTP make that a very fast operation.

 

Otherwise for most other mundane write operations I guess staying in read-write-modify would be better.

 

Here is a question / idea: What about this being a per-disk or per-share setting?  That way my /BACKUP share (with huge writes happening at night) always operates in fast mode but my /TORRENTS folder (with long term small writes limited by d/l speed) does not.  It would really suck to keep my array spinning all the time when pulling in a torrent.

 

Thoughts???

Link to comment

[thinking out loud for discussion purposes]

 

I can't decide if this is useful or not for those with a cache drive. 

 

Obviously it doesn't help with writes to the cache which is already fast too, and the mover happens at night when I'm not around to care about speed.  But I suppose there is something to be said for getting the mover done as soon as possible.  But not sure why. 

 

I suppose one MO might be to change the mover function such that it runs before much more frequently (like every hour or X idle-minutes after a write operation) that way writes to the cache are more akin to writes directly to the array (for data protection purposes) and having the mover operate in reconstruct-write mode would certainly make sense.

 

Beyond that, I could see wanting to turn this on temporarily while doing a bunch of file structure manipulation, especially moving stuff from one share to another over windows.  Though MC or FTP make that a very fast operation.

 

Otherwise for most other mundane write operations I guess staying in read-write-modify would be better.

 

Here is a question / idea: What about this being a per-disk or per-share setting?  That way my /BACKUP share (with huge writes happening at night) always operates in fast mode but my /TORRENTS folder (with long term small writes limited by d/l speed) does not.  It would really suck to keep my array spinning all the time when pulling in a torrent.

 

Thoughts???

The mode requires all drives spun up.  First bottleneck will be speed of slowest drive - it can't run faster than that.  As array width increases, next bottleneck will probably be PCI bus utilization.  This will be governed by your disk controller and m/b south bridge chip.  Next bottleneck will probably be memory bandwidth.  Last bottleneck CPU speed - but certainly other bottleneck will hit first.

 

Easiest way to code this is to have a flag which, if set, says (at the driver level), "if a write request is received and all 'other' array disks are spun up, then do reconstruct-write, else do read-modify-write".  ('other' means all drives besides parity and target disk)

 

This way if you want the mode in effect you need to spin up all the drives.

Link to comment

[scratching head] sorry if I'm missing a nuance to your first points here Tom ... ok yeah got what you're saying about bottlenecks. 

 

But for anyone using a cache drive I'm trying to understand when and why I might use this.  Because I'm just not sure what the use case is except when you are doing operations that do not use the cache drive OR ...

 

where you want the mover to operate faster.

 

Which does make the second part of your comment interesting in so much as what I hear is, it might be worth adding an option to the mover script to spin up all drives first and before moving.  By your design suggestion that would mean fast writes happen FTW

 

Similarly then I guess I would say if it were possible to add a config option for a share to spin up all drives before writing, then once again your design suggestion would mean fast writes happen FTW.

 

How hard would those sorts of configurable options be?  I imagine the mover would be easy since it is just a script.  I'm not sure what it would take to designate some shares as "start all drives before writing" and other "not".

 

Again these are all just out loud thoughts on how such a fast write feature could benefit those running with a cache drive.  For everyone not running with a cache drive I can easily see the benefit.

Link to comment

Cache drive is still valuable for larger array widths.  For a small array, if you don't mind all your disks spinning, maybe less benefit.  Obvious question, how many drives constitute a "small array"?  That will depend on those bottleneck factors which depend on your exact h/w config.

 

Making all disks spin up on-demand when writing to top-level share is "hard".  Need to think about this more.

Link to comment

Ahh now I understand your implication. 

 

"hard" is what I figured.  I'm not sure if it is worth it, but it seemed like a reasonable idea as I thought about the two extremes my usage; single large (50gb) writes happening once a day or less versus many small writes happening through out the day.

 

Thinking outloud again ... how/when will the driver decide to write using what mode?  That is to say, what happens if I start writing a 50GB file and in the middle of that all the drives get spin up?  Will the write mode change?  If so, maybe it would be feasible to have a service/script/daemon/whatever looking for writes to a given share, spin up the other drives, and then the write mode switches as a result?

Link to comment

But for anyone using a cache drive I'm trying to understand when and why I might use this.  Because I'm just not sure what the use case is except when you are doing operations that do not use the cache drive OR ...

 

where you want the mover to operate faster.

 

I don't see much of a benefit if you have a fast cache drive.

 

For me the benefit "may" be in my main file server which happens to be small, but also where I do the most read/write operations all day long. I'm constantly managing MP3's all day long. Adding/Changing tags, moving files around, importing artwork.

 

When I want to work on the array, I like the idea of spinning up the drives and having them operate as fast as possible.  So far I've done some kernel tunes that let me burst up to 60MB/s for small directories of operations. When I work on a larger directory that starts to slow down over time to about 35MB/s.. which still isn't too shabby.

 

If the new reconstruct-write mode bumps up my write speeds over 60MB/s for a longer duration, it's a big benefit to me.

 

It may even enable me to rip music directly to the file server instead of the local drive, which would save me time in the long run.

 

Since my current server is an HP micro server, I do not have the luxury of a cache drive yet.

Link to comment