Structured testing of data centres
09 May 2016
A new data centre is expected to operate continuously for the design life of the facility. Even the shortest outage to the power or cooling systems can be disastrous to the IT services operating in the data centre.
Equipment failure and replacement strategies must be included in the design and implementation. Dave Wolfenden of Heatload explains how thorough testing of the data centre before it is handed over will ensure the new facility meets up to expectations.
Any form of testing needs to take a structured approach. It doesn't matter whether you are testing software, hardware, a data centre or doing the MOT on a car. Without a structured approach it is easy to miss things that could later turn out to be a major challenge.
ASHRAE (American Society of Heating, Refrigerating and Air-Conditioning Engineers) defines five levels of testing:
• Level 1 – Factory acceptance testing
• Level 2 – Field component verification
• Level 3 – System construction verification
• Level 4 – Site acceptance testing
• Level 5 – Integrated System Testing (IST)
Level 1 to 3 testing begins with testing individual components at the manufacturing plant; such as fully testing a UPS. As construction progresses, systems are tested together to ensure all of the components operate together as a complete system, as designed. Typically this level of testing utilises large load banks located outside of the IT space. For example, the load banks test the load capacity of chillers, generators, transformers and UPS individually and as a system.
Level 4 and 5 testing brings together all of the combined power and cooling systems supporting the data centre, as well as life safety, security etc. The testing requires the load to be located within the IT space and much more granular in terms of size, location and capacity of the load banks. The load banks should replicate the final layout and capacity of the IT equipment.
There is a temptation to fill the data centre with large space heaters, each of 50kw or more. This type of load is fine for testing the total capacity of the room. However, it is not suitable for room validation or testing the IT layout. Larger capacity units are likely to bypass the power distribution, with direct connection to PDUs. This type of load is unsuitable for data centres that have some or all of the IT racks deployed.
Heatload recommends for optimum testing that the heat load should be sized and distributed to replicate the IT layout and be connected to the power distribution.
Typically, Heatload is engaged during levels 4 and 5 of data centre testing; teaming up with a number of partners who are able to provide load banks in support of level 1 to 3.
Test the Computational Fluid Dynamics (CFD) model
It is likely that the design has been tested using Computational Fluid Dynamics (CFD) modelling tools. Level 5 testing should include room validation that proves the model. This will give the end user of the facility full confidence. Once the data centre is operational it is unlikely that further heat load testing can be achieved; so changes in the IT infrastructure should be modelled to fully understand the impact to the data centre. The best way of proving the model is to fully replicate the heat load layout. This means using heat load that replicates that of the model and fully monitoring the testing using temperature sensors. Ideally the data centre should be flooded with temperature sensors, with multiple sensors at the front and rear of every rack location.
Recent integration between the sensor manufacturer’s capture software and the CFD modelling software allows real time modelling, using real data. The original predictive model can then be compared with the real time one. During 2016, Heatload is looking to offer this as a service.
Test at different IT capacities
The temptation is to just test the data centre at 100 percent IT capacity because IT has specified the load and migration of IT equipment will be rapid. This is not always the case. IT may have over specified the load and the migration plan might be too aggressive for the end users. This can result in a very slow ramp up of IT load, if it ever achieves the design load. If the design team has assumed rapid ramp up of load and IT capacity is correct, they may have designed the data centre in such a way that causes significant problems with the operation. In some circumstances if the IT load is so low UPS and generators will not hold the load or cooling cannot be managed effectively. Massive amounts of energy can be wasted by lightly loaded systems.
It is optimistic to expect a recently commissioned data centre to go from zero to 100 percent capacity at the flick of a switch. Heatload recommend starting with a load of no more than 10 percent, allowing the facility to stabilise before moving on to the next incremental step increase.
Really test your data centre
Running exhaustive testing is crucial for a data centre. Test everything. When the power fails and the generators are starting, does the emergency lighting come on, avoiding panic and accidents or is there a period of pitch black? The design may have several minutes of autonomy in the UPS and batteries to maintain the IT systems but does the cooling continue at a sufficient level to prevent overheating? Test cause and effect, for example when the fire alarms are set off do you want the security doors to unlock or remain locked? Should the cooling be shut down or just the fresh air ventilation closed off?
Move the heat load around
Over the design life of the data centre, IT will go through several technology refreshes; this could result in hotter or cooler systems being deployed. As space is occupied the original layout may no longer be valid. IT system’s technical limitations may dictate a higher density layout than expected. Several customers test their data centres with different layouts;
• a uniform load, with heat load spread evenly throughout the data centre
• a varied load to try to represent the predicted IT load layout, with different racks having differing heat load capacities
• uneven, worst case load scenario with the majority of the heat load in one or more rows of racks with very little heat load in the rest of the data centre
During testing, problems will be discovered that require a fix to resolve the situation. The temptation is to apply the fix, complete the failed test and move on to the next test. The risk is that the ‘fix’ may have inadvertently invalidated previous tests. It is imperative that the fix be fully investigated and any previous tests that could be affected be re-run.
Contact Details and Archive...