Blog release: Global IT outage – next steps → 

AWS Case Study | Fed Gov | PROTECTED Environment

PROTECTED Environment with NetApp

Industry: Federal Government

The customer faced several urgent issues in their on-premises IT environment that would soon have a significant impact on their ability to run mission critical business applications. Of particular concern was the capacity of their on-premises NetApp storage volumes, which were rapidly approaching capacity. They were fast approaching 700TB and due to hit a physical limit within the next 6 months.

Hitting this would severely restrict the customer’s ability to perform new fraud investigations, one of the core tenets of the organisations mission. In addition, hardware associated with compute and networking in the environment was reaching end of life and in need of refresh.  Due to delays with ordering and receiving new hardware during the pandemic, an on-premises solution was ruled out as it was unable to address these issues in time.  

 

The Solution

 

CyberCX previously worked with the customer on the design and implementation of a PROTECTED grade AWS environment. The secure Landing Zone/Control Tower environment was already running multiple production workloads and was fully integrated with the customer’s security ecosystem that included federated IDAM, SIEM and dynamic threat detection/monitoring. A project was initiated to leverage this existing environment for the migration of the customer’s core digital forensics platform. This included a range of compute (EC2, EKS and Lambda)  and database instances (RDS) as well as dynamic high performance Amazon Workspaces to be used for investigations.

In order to satisfy the capacity, availability and performance requirements of the current on-premises solution we chose to use the Amazon FsX-N cloud native NetApp storage service. This AWS managed service offering is a fully featured NetApp solution that allows for secure, efficient and reliable replication of data via the standardised Snapmirror feature. By establishing a native Storage Virtual Machine (SVM) peering relationship between the on-prem and cloud hosted volumes we were able to manage all cluster volumes centrally from both the NetApp console and CLI. 

We deployed the cloud hosted FSxN volumes in a Multi-AZ (scale-up) configuration in active/standby mode. Due to throughput limits on VPN tunnel connections to the AWS Transit Gateway, we configured the individual throughput capacity of 1024MBps an in-memory caching of 128GB, NVMe caching of 1900 and a baseline SSD drive IOPS of 40000.

Once we enabled multipath secure VPN tunnels over a dedicated 10GB Direct Connect we were able to parallel stream data across an end-to-end encrypted link whilst maintaining application availability at source. One of the key benefits of Snapmirror is that it can effectively compress and deduplicate data to maximise replication efficiency as well as reduce storage redundancy. With this configuration, we were able to achieve close to the theoretical disk throughput of 1024 MBps per VPN tunnel. A total of 576TB was securely and successfully transferred in just over 16 days.  

 

End-to-end encryption in transit

 

Once all data had been replicated, we were able to establish effective storage tiering policies to optimise both performance (for regularly accessed and cached data ) and cost (for archive and infrequently accessed data), leverage the benefits of native FSx, S3 and Glacier AWS services. 

Whilst the storage replication was taking place, we built out the compute and database infrastructure under a dedicated account over multiple AZs to improve the availability and recoverability of the new solution. This was done through the use of Infrastructure as Code (IAC) and CI/CD leveraging CloudFormation and the customer’s existing DevOps tooling. 

After completing integration, performance and usability testing, the production cutover of the new solution was implemented by a simple DNS change outside of business hours. Aside from an initial issue relating to populating cache for ongoing investigations, and some performance tuning related to tiering efficiency, users reported that responsiveness and performance of the new system was excellent, with no noticeable degradation  of service between the cloud and on-prem platforms. 

This solution was fully integrated into the customer’s established security ecosystem and was delivered to the PROTECTED standard. 

High level design

 

The Result

 

This project was successfully delivered on time and on budget within a 16-week period with no outages to production or loss of data. We were able to avoid hitting capacity for the on-premises storage arrays and move towards a scalable and performant cloud-based solution. Additional benefits included a significantly reduced recovery time in the event of systems/datacentre failure as well as the ability to spin up and down environments on demand. 

Ready to get started?

Find out how CyberCX can help your organisation manage risk, respond to incidents and build cyber resilience.