TCP is designed to deal with the internet.TCP makes communication easy it handles things like Reliability and Packet ordering and a general-purpose library and not particularly optimized for HPC. TCP is a general-purpose communication library and it’s not particulary well optimized for HPC application first TCP runs in the kernal of the OS and this makes it optimal shairng network connection amongst all the process running on your box but sharing requires overhead and this introduce latency and variablity and is too costly when micro seconds matters and latency consistnecy is paramount so an additional problem wiht TCP is that its designed to deal with the internet and things on the internt happen in milli seconds but within our DC the packet delivery time is measured in micro second and as result tcp can introduce order of mangitude latency variabliity and this just won’t work for HPC workloads and finally TCP handles general networking condition quite well but its not particaluy well suited for HPC challanges that HPC application introduce one of this situations is a condition known as Incast, this is where a lot of servers are trying to send a lot of data to one single server effectivly overwhleming that single server Incast and it can happen during that synchronization phase that showed on below image, the surge of traffic overloads the receiver and that drops packet and inefficenct.
TCP overreacts the packet loss and it doesn’t recover nearly fast enough and that is the reason behind variability.
There is a problem of synchronization in jobs running on HPC and to solve this problem AWS build EFA and is a networking stack that’s designed for the most demanding applications and its optimized to take advantage of the AWS DC and Nitro controller and EFA starts with a communication library that you install on your instance and this library enables your application to send message directly to Nitro controller so they completely bypass your kernel a TCP and no resource usage from your instance instead EFA hands those packets to the Nitro controller and this is where the hard work is done and they implemented transmission protocol on the Nitro controller called SRD, this a networking protocol that they designed internally using deep knowledge of how their network operates and how they perform operationally and this deep knowledge allowed them to take maximum advantage of excess throughput in the network. It also allowed them to detect packet loss or delay in microsecond retransmit several orders of magnitude faster than TCP and because all is done in Nitro controller so there is no resource consumption on the main server.
The HPC application there gets to use all those resources for their running code but they can still have a maximum network performance and scalability and EFA is designed for a very demanding networked application so of course, it deals with problems listed in Incast.
The following example shows variability in latency for existing TCP usage and compared it with new EFA and SRD and that shows it improvement on variability part where Fastest sender on TCP side becomes slower sender on Incast EFA side when we used SRD.
The following current HPC software stack in AWS EC2 looks like following:
Following is the new Stack with EFA where userspace libraries are built and contributed by AWS to work with new EFA drivers:
SRD quickly detects the Incast problem and adjusts in microseconds to give each stream of fair share.
EFA is a networking adapter designed to support user-space network communication, initially offered in the Amazon EC2 environment. The first release of EFA supports datagram send/receive operations and does not support connection-oriented or read/write operations. EFA supports unreliable datagrams (UD) as well as a new unordered, scalable reliable datagram protocol (SRD). SRD provides support for reliable datagrams and more complete error handling than typical RD, but, unlike RD, it does not support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added to expose this new queue pair type.
User verbs are supported via a dedicated userspace libfabric provider.
Kernel verbs and in-kernel services are initially not supported. EFA enabled EC2 instances to have two different devices allocated, one for ENA
(netdev) and one for EFA, the two are separate pci devices with no in-kernel communication between them.
How to attach EFA to instances during EC instance creation:
#aws ec2 run instances count=4 region us east 1 image id ami ABCD –instance type c5n.18xlarge placement GroupName=ABCD –network-interfaces DeleteOnTermination=true,DeviceIndex=0, SubnetId
=subnet-ABCD,InterfaceType=efa –security-group-ids sg-ABCD
1. Subnet-local communication
2. Must have both an “allow all traffic within SG” Ingress and Egress rule
3. 1 EFA ENI per Instance
4. EFA ENI’s can only be added at instance launch or to a stopped instance
Scalable Reliable Datagram(SRD):
New protocol designed for AWS’s unique datacenter network
a. Network aware multipath routing
b. Guaranteed delivery
c. Orders of magnitude lower tail latency
d. No ordering guarantees
Implemented as part of our 3 rd generation Nitro chip
EFA exposes SRD as a reliable datagram interface
NOTE: Most of the things like EFA and SRD all are related to AWS Placement groups as only thorugh this AWS helping you to give those benefits. Finally, In case you would like to read more then you can read more on google around same topic.