Skip to main content
Success Story #3

The Midnight
Switch Run

5pm Friday. vSAN switch dead. Production at risk. Dell says 48 hours. We drove to three suppliers across the Bay Area and had the client back online by 1am.

8 hrs
Total resolution time
3
Suppliers visited
11pm
Switch located
1am
Production restored
The Situation

Friday at 5pm. The worst time.

There's a reason IT problems seem to happen on Friday afternoons. Hardware doesn't actually know what day it is—but failures on Friday afternoon mean you're racing against the weekend, when vendors close, when suppliers shut down, when normal support channels disappear.

When this client's Cisco Nexus switch died, their vSAN cluster went to single-path mode. That's not immediate catastrophe, but it's one failure away from catastrophe. Every minute they ran like that was a gamble.

Dell's official answer: 48 hours minimum for replacement hardware. That meant running at risk through the weekend. Unacceptable.

The Timeline

8 hours across the Bay

5:12 PM

The Failure

Client's vSAN monitoring alerts go red. One of two 10Gb switches in the cluster has failed. Production VMs are limping on a single path. One more failure and everything goes down.

5:30 PM

Assessment

On-site. Switch is dead—no lights, no response. This is a Cisco Nexus 3048. Enterprise hardware. Not something you grab at Best Buy.

5:45 PM

The Dell Call

Called Dell (who supplied the original hardware). Next-day delivery isn't available. Earliest they can get a replacement: Monday morning. 48+ hours away.

6:00 PM

Option B

We start calling every enterprise hardware supplier, VAR, and reseller in the Bay Area. Most are closed. Some don't have compatible switches. A few might.

7:30 PM

First Stop: San Jose

Found a VAR with 'something compatible' in San Jose. Drove down. It's a different model, wrong port count. No good.

9:00 PM

Second Stop: Oakland

Another lead in Oakland. They have a Nexus 3000 series—but it's the 3064. Different config. Might work, might not. Too risky.

11:15 PM

Third Stop: Fremont

Last chance: a smaller VAR in Fremont staying late because they're shipping to a data center tomorrow. They have a 3048. Same model. Same firmware capability. We buy it on the spot.

12:00 AM

Installation

Back at the client site. Switch racked, cabled, configured. vSAN rebuild begins. Watching the data resync.

1:15 AM

Production Online

vSAN healthy. Both paths active. Full redundancy restored. Eight hours from failure to full recovery.

The Technical Details

Why this was complicated

Why This Was Critical

vSAN requires network redundancy for data resilience. With one switch down, a single cable failure would have taken down the entire cluster—and every VM running on it.

The Hardware Challenge

Cisco Nexus 3048 isn't commodity hardware. It's a specific enterprise switch with specific features. You can't just substitute any 10Gb switch and expect vSAN to work.

Configuration Requirements

The replacement switch needed specific VLAN configurations, jumbo frames, and port channels to integrate with the existing vSAN setup. Wrong config = cluster problems.

The Risk

Every hour the client operated on single-path connectivity was an hour where one more failure—a cable, a port, anything—would have been catastrophic.

Lessons Learned

What made this work

Vendor SLAs have limits

Next-business-day is great until Friday at 5pm. Enterprise support contracts don't always mean instant solutions. Sometimes you need to find your own.

Relationships matter

We knew which VARs to call because we've worked with them before. Those relationships—and their willingness to stay late—made this possible.

Know your hardware

Understanding exactly what was needed (not just 'a 10Gb switch') let us quickly evaluate options and avoid wasting time on incompatible equipment.

Spare parts are insurance

After this incident, the client keeps a spare switch on-site. The cost of one spare switch is nothing compared to eight hours of scrambling.

The Morning After

Monday came and went normally.

When employees came in Monday morning, they had no idea anything had happened. No outages. No data loss. No "we're working on it" emails. Just normal Monday.

That's the goal, really. When IT is done right, nothing happens. The drama stays in the server room. The midnight drives stay between us and the client. Everyone else just works.

Dell's replacement switch arrived Monday afternoon. We kept it as the hot spare. The client now has redundancy for their redundancy. And they know that "48 hours" isn't always the final answer.

Need IT support that won't quit?

When vendors say 48 hours and you need 4, that's when experience matters. Let's talk about your infrastructure.