Contoso Labs Series - Table of Contents
In the prior post, we discussed how our major decision revolved around storage. The choices we made there would dictate the cost of the rest of the solution, so we had to make a decision there first, and carefully at that. Once we knew we'd be using Scale-Out File Servers and a converged fabric, we had to evaluate what that meant for our network.
Speeds and Feeds
Our first inclination was to get excited by all the possibilities and performance that the SOFS+SMB 3.0 stack could enable. How awesome is your life when you technologies like SMB Multichannel and SMB Direct (RDMA) at your disposal? Once our budget numbers came back, we realized we had to scale back our ambitions a little. While SMB Direct is amazing, it requires special RDMA-enabled 40GbE, 10GbE, or Infiniband cards. That presented a few serious problems.
- Dual port, RDMA-capable cards are expensive. Well over $500 a piece for a 10GbE example, with limited options because it is newer technology. Do the math for one per 288 nodes, and you see a tremendous cost.
- High speed NICs require drastically more capable and expensive switching and routing infrastructure. 48-port 10GbE switches cost 2-4x more than their 1Gbe counterparts. SFP+ cables often cost 5-10x or more of an equivalent length CAT-6 cable. When you need a thousand of them? Ouch.
- On top of that, because our compute nodes are Gen6 servers, we learned that it's actually frighteningly easy to overwhelm the PCIe bus in some RDMA scenarios. Our network could be TOO FAST FOR OLDER PCIe. What a great problem to have.
We decided to do the math to see if we could get away with using the existing connectivity in our scenario. The compute nodes currently have four 1GbE ports each. Frankly, that's not very much. 4Gbit of throughput, split between management, storage, AND tenant traffic is almost criminally low. We had to debate long and hard about whether we could get away with it. After long deliberation, we decided that our users and workload could manage with this limited throughput, since there won't be heavy stress being generated during steady-state. Our worst case scenario is boot storms, which we think we can manage. However, we're acknowledging there's serious risk here. Until we get users on and load test, we can't be sure what kind of user base we can support. Just like our customers, sometimes we need to learn useful things by doing them and seeing what happens.
In all honesty: This is not an architecture we'd recommend in any production environment meant for serious workloads. It does not provide enough room for I/O growth, and could too easily produce bottlenecks. On the flip side, we think it's an excellent dev/lab design that could make good use of older hardware, and provide a good experience. Keep that in mind as you read more of our architecture in the future.