As you might know vINCEPTION is my term for what many others called “Nested” virtualization. It’s the peculiar moment when you realise VMware software can eat itself – by being able to run VMware ESXi in a Virtual Machine running on top of either VMware Fusion, Workstation or even VMware ESXi itself. I’ve been experimenting with running a nested version of VSAN in my home lab, with the prime reason of wanting to be able to run my own private version of EVO:RAIL in a nested environment.
As you probably/hopefully know by now EVO:RAIL is physcial 2U appliance housing 4 independent server nodes. The EVO:RAIL is delivered by partners in a highly controlled process. So it’s not like I could just slap the binaries that make up EVO:RAIL (that I have private access too from our buildweb) on to my existing homelab servers and expect it all to work. The EVO:RAIL team have worked very hard with our Qualified Partners to ensure consistency of experience – it’s the partnership between a software vendor and hardware vendors that delivers the complete package.
Nonetheless we can, and do have EVO:RAIL running in a nested environment (with some internal tweaks) and it’s sterling bit of work by one of our developers Wit Riewrangboonya – I’m now responsible for maintaining, improving and updating our HOL – and if I’m honest I do feel very much like I’m standing on the shoulders of giants. If you have not checked out the EVO:RAIL HOL it’s over here – HOL-SDC-1428 VMware EVO:RAIL Introduction. Anyway, I wanted to go through the process of reproducing that environment on my homelab, mainly so I could absorb and understand what needed to be done to make it all work. And that’s what inspired this blogpost. It turns out the problem I was experiencing had nothing to do with EVO:RAIL. It was a VSAN issue, and specifically a mistake I had made in the configuration of the vESXI node…
I managed to get the EVO:RAIL part working beautifully. The trouble was the VSAN component was not working as expected. I kept on getting “Failed to join the host in VSAN Cluster on my 2nd nested EVO:RAIL appliance. Not being terrifically experienced with EVO:RAIL (I’m in Week8) or VSAN (I’m into chapter 4 of Duncan & Cormac’s book) I was bit flummoxed.
I wasn’t initially sure if this was – a problem with EVO:RAIL, a VSAN networking issue (multicast and all that) or some special requirement needed in my personal lab to make it work (like some obscure VMX file entry that everyone else, but me knows about). Looking back there’s some logic here that would have prevented me barking up the wrong tree. For instance, if the first 4-nodes (01-04) successful joined and formed a VSAN cluster – then why wouldn’t nodes (05-08)? As I was working in a nested environment was concerned perhaps I was meeting the network requirements properly. This blogpost was very useful in convincing me this was NOT the case. But I’m referencing it because it’s a bloody good troubleshooting article for situations where it is indeed the network!
http://blogs.vmware.com/vsphere/2014/09/virtual-san-networking-guidelines-multicast.html
You could kinda understand me think it was network related – after all status messages on the host would appear to indicate this as a fact:
But this was merely symptom not a cause. The host COULD communicate with each other – but only if osfsd starts. No osfsd, no VSAN communication. That was indicated by the fact that the VSAN Service whilst enabled, had not started.
and after all the status on the VSAN cluster clearly indicated that networking was not an issue. If it was the status would state a “misconfiguration” in the network status…
As an experiment I setup the first nested EVO:RAIL appliance – and tried doing the 2nd appliance on my own as if it was just another bunch of servers – pretty much I got exactly the same error. That discounted in my mind that this issue had anything to do with EVO:RAIL Configuration engine, and that source of my problem laid elsewhere.
Of course, a resolution had been staring me in the face from way back. Whenever you get errors like this – then google is your friend. In fact (believe it or not) I would go so far to say I love really cryptic and obtuse error messages. Search on “Failed to start osfsd (return code 1)” is like to yield more specific results than some useless generica error message like “Error: An error has occurred”. This took me to this community thread which is quite old. It dates back to 6 months ago or more, and is about some of the changes to VSAN introduce at GA. I must admit I did NOT read it closely enough.
https://communities.vmware.com/thread/473367?start=0&tstart=0
It lead me to Cormac Hogan’s VSAN Part 14 – Host Memory Requirements where I read the following:
At a minimum, it is recommended that a host has at least 6GB of memory. If you configure a host to contain the maximum number of disks (7HDDs x 5 disk groups), then we recommend that the host contains 32GB of memory.
Sure enough following this link to the online pubs page confirmed the same (not that EVER doubted the Mighty Cormac Hogan for second!)
A quick check of my vNested revealed that nodes01-04 had only 5GB of RAM assigned to them, and inexplicably I’d configured nodes05-08 with 4GB RAM. I’d failed to meet the minimum pre-reqs. Of course, you can imagine my response to this – Total FacePalm.
Well you live and don’t learn – always read the pre-reqs and RTFM before jumping in with both boots before embarking on something, especially if you deviating from the normal config.