Company RAID failure: first response before rebuild

The worst RAID decisions usually happen in a hurry: a Warsaw office cannot open invoices, a NAS starts beeping, VMware loses a datastore, and somebody clicks Rebuild because the panel offers it. Before that click, stop and record the array state. Disk order and metadata may be more valuable than another restart.

First rule after a RAID alert

Pause virtual machines, databases, backup jobs and file shares. Label the drives by bay order, write down the exact message and avoid any rebuild or resync until the failure pattern is clear.

What not to do in the first minutes

Do not start Rebuild, Resync, Initialize or Repair on the original array.
Do not move disks between bays to “test” whether the NAS sees them.
Do not replace a second drive because the first rebuild failed.
Do not accept controller or NAS prompts if you do not know what metadata they will write.
Do not run random surface tests on drives that may still be needed for imaging.

What to record before anything changes

Take photos of the bay order, drive labels and screen messages. Export logs if the NAS or controller allows it without writing to the array. Record the RAID level, number of disks, device model, serial numbers, volume names and the time when symptoms started.

For company environments, also list the services that used the array: accounting, SQL databases, ERP, file shares, surveillance recordings, VMware, Hyper-V, backup repositories or user profiles. This helps set priorities before diagnostics begin.

Why rebuild is not a universal rescue button

A rebuild is designed for a known single-disk failure with the remaining drives healthy. Real incidents are often messier. RAID 5 may have a second disk with unreadable sectors. RAID 6 may have stale metadata after a power event. A NAS may mark the wrong disk as failed after a controller or firmware problem.

If the rebuild reads unstable disks for hours, it can stress the remaining members and write a new, incorrect state. In recovery work, an untouched inconsistent array is usually easier to analyse than an array that has been rebuilt several times in the wrong order.

Common scenarios in Polish offices

QNAP or Synology shows degraded after one drive drops out, but another disk already has pending sectors.
A server restart hides a volume because controller metadata or disk order no longer matches the previous state.
VMware or Hyper-V cannot see a datastore even though the physical disks still spin.
Backup NAS fails during restore, leaving the company with neither production data nor a verified backup.
Someone starts a rebuild on the wrong replacement disk because labels and bay order were not preserved.

If QNAP or Synology is involved

Do not rely only on the colour of the LED or a single dashboard label. Export or photograph the storage pool status, RAID group state, disk serial numbers and recent system events if the panel is still responsive. If the NAS suggests repair, migration or expansion, pause until you know whether it will write new metadata.

If VMware, Hyper-V or a SAN is involved

A missing datastore is not just a file-copy problem. VMFS, VHDX chains, snapshots and thin-provisioned volumes need consistency across metadata and data blocks. Keep the hypervisor powered down or isolated from the affected storage so it does not keep retrying writes while the array state is unknown.

How to protect disk order during transport

Label every drive before it leaves the enclosure: bay number, serial number and original position. Pack disks separately so labels remain readable. In RAID recovery, a correct disk-order photo can save hours of reverse engineering and may prevent a wrong reconstruction attempt.

How to brief management and users

Tell the team not to reconnect shares, restart services or copy “just the most important folder” from the damaged volume. For management, separate two decisions: business continuity and data preservation. A temporary workaround is useful only if it does not write to the failed array.

What the laboratory reconstructs from

The lab does not need the NAS panel to “look healthy”. It needs stable images of member disks, metadata fragments, parity layout, stripe size and the sequence of events. That is why the original state matters more than a rushed attempt to make the enclosure mount again.

When the failed array is also the backup repository

Backup repositories deserve extra caution because they often contain compressed, deduplicated or versioned data. If a restore job was running during the failure, note which source and destination were involved. Do not start a new backup run until you know whether it would overwrite the last usable restore point.

Safe first-response checklist for IT and management

Stop all writes to the affected volume.
Photograph the front of the enclosure and label every disk before removal.
Save screenshots of controller, NAS and hypervisor messages.
Write down what changed: power loss, update, disk replacement, noise, slow access, failed backup.
Decide business priorities: which folders, databases or virtual machines must be recovered first.
Prepare a contact person who knows the infrastructure and can answer technical questions.

When the case should go to a laboratory

If the array holds accounting data, production files, SQL databases, virtual machines or a whole-company backup, do not treat it as a casual repair. A laboratory workflow starts by imaging member disks and reconstructing the array from copies, not by experimenting on the originals.

This is especially important when the incident involves degraded or offline RAID, VMware / Hyper-V / SAN data recovery or the first 24 hours after a server or NAS failure.

What to send with the case description

Send the RAID level, number of drives, device model, disk order, error messages, last known good state and actions already taken. If the array contains personal data or business-critical files, describe the priority folders and any legal or operational deadline. Clear context helps avoid diagnosis based on guesswork.