VMware ESXi cannot see the datastore after a restart – LUN recovery [Enterprise case]

Restarting a VMware server after a power outage is the kind of event that can paralyze an entire IT infrastructure. Imagine bringing the server back up only to discover that the datastore is gone and none of the virtual machines can start. Alarms in vSphere point to missing VMDK files, while stress and panic keep rising. Incidents like this are every IT administrator’s nightmare, especially when they are responsible for business-critical corporate data.

In this article, we take a detailed look at what can cause this situation and how to regain access to the LUN effectively so that normal operations can be restored as quickly as possible.

In our case study, we show how we recovered 40 virtual machines for a financial corporation in just 48 hours. Thanks to detailed analysis, expert tools and a well-planned procedure, we not only saved the data but also minimized financial losses that could easily have reached hundreds of thousands of złoty. Are you ready to see how this operation unfolded and learn how to prepare for a similar event in your own environment? Read on to better understand VMFS recovery and effective datastore incident handling in VMware. When dealing with arrays and NAS systems, the safest route is to move straight to RAID data recovery — without blind rebuild attempts.

Why does the datastore disappear after a VMware server restart? Causes and symptoms

Restarting a VMware server after a power outage can lead to a situation in which the datastore disappears or becomes unavailable. The causes may vary, but the most common include damage to the LUN pointer on the SAN array, multipath errors and RAID controller failures. If the VMFS partition table that stores partition information is damaged, the server is unable to load the data correctly.

As a result, the datastore may appear in vSphere as unknown or may not appear at all, while the virtual machine icons turn grey and cannot be started.

Symptoms of datastore loss after a restart

Symptoms of datastore issues are usually clear — administrators may notice warnings in ESXi logs such as Lost access to volume/vmfs/volumes/[UUID]. Even though the LUN may still be visible and online on the SAN array, the ESXi host may refuse to mount it, which means no access to the virtual machines and their VMDK files. In such a situation, it is critical not to take rash actions such as rescanning the HBA or removing the LUN, because this can destabilize the metadata and make the situation worse.

Step by step to LUN recovery: our proven procedure In these cases, RAID/NAS reconstruction is based on rebuilding the actual layout and parameters of the array, not on guesswork.

Step by step: datastore recovery

To recover a lost datastore effectively, you need to act methodically and carefully. The first step is to analyze and secure the current state of the infrastructure. It is important to stop all repair attempts immediately, because accidental actions may cause further damage. Next, we verify the condition of the SAN array and confirm that the LUN is in an optimal state. We also recommend creating a full ESXi configuration backup using vSphere CLI and, if possible, taking a snapshot of the LUN on the array.

The next step is deep diagnostics, including disk imaging and RAID structure analysis. This allows us to reconstruct the array in a virtual environment and locate the beginning of the VMFS partition. If we find any damage to the partition table, we use specialist tools to repair VMFS according to its exact version.

It is extremely important to carry out all of these actions with caution in order to minimize the risk of further data loss during recovery. In cases of serious damage, we rely on proven methods that have repeatedly allowed us to restore virtual machines to operational condition efficiently.

Case study: how we recovered 40 virtual machines in 48 hours

Faced with a crisis, the financial company found itself in a situation that could have crippled its operations. After a planned VMware server restart, two out of five datastores disappeared, making it impossible to start forty virtual machines. In a rushed attempt to save the situation, the administration team decided to rescan the storage adapter, which unfortunately damaged the partition table. Knowing how critical speed is in incidents like this, our team started emergency actions immediately.

The first four hours after the report were spent collecting the drives from the client’s office in Warsaw. We then focused on imaging 24 drives, each with a capacity of 2 TB. The next phase was a virtual reconstruction of the RAID 10 array and VMFS repair, which took another eight hours.

We divided the entire process into staged tasks to maximize efficiency. In the end, after 48 hours, we restored all 40 virtual machines, reducing downtime to only eight hours and saving the company an estimated PLN 500,000 in losses.