Switches and Servers
All devices in a Hedgehog Fabric are divided into two groups: switches and servers, represented by the corresponding
Switch
and Server
objects in the API. These objects are needed to define all of the participants of the Fabric and their
roles in the Wiring Diagram, together with Connection
objects (see Connections).
Switches
Switches are the main building blocks of the Fabric. They are represented by Switch
objects in the API. These objects
consist of basic metadata like name, description, role, serial, management port mac, as well as port group speeds, port breakouts, ASN,
IP addresses, and more. Additionally, a Switch
contains a reference to a SwitchProfile
object that defines the switch
model and capabilities. More details can be found in the Switch Profiles and Port Naming section.
In order for the fabric to manage a switch the profile needs to include either the serial
or mac
need to be defined in the YAML doc.
RDMA over Converged Ethernet (RoCE) version 2
RDMA over converged ethernet (RoCE) allows for RDMA communication over conventional
ethernet devices. RoCE isn't available on every switch, check the switch
catalog for RoCE: true
. Enabling RoCE on a switch
requires the switch to reboot in order to configure the hardware and associated
queues. Once a switch is in RoCE mode the port breakouts cannot be changed.
Warning
Users are advised to set the port breakouts as desired, and confirm the link is up before enabling RoCE.
RoCE Lossless mode
When enabling RoCE on a switch, the buffers inside the switch are configured to be lossless, and ingress traffic is classified based on the DSCP value inside the IP packet header.
Purpose | DSCP Values | Traffic Class |
---|---|---|
RDMA | 24 | 3 |
RDMA | 26 | 3 |
Congestion Notification | 48 | 6 |
unknown | all others | 0 |
The counters associated with the traffic classes are viewable using the
kubectl fabric inspect
command. Users are advised to test traffic and track
the counters to ensure that proper end host configuration is achieved. Often
RDMA enabled software bypasses the host software stack. This bypass means that
configuration with utilities like: nft
,iptables
, and iproute2
will not
affect RDMA traffic leaving the host.
When RoCE traffic is using VXLAN, the inner packet DSCP information is copied to the outer packet at the time of encapsulation. Likewise the outer DSCP information is copied to the inner packet when the packet is deencapsulated. This process preserves the traffic classification even through a VXLAN tunnel.
RoCE QPN Hashing Mode
RoCE traffic adds another input for hashing of traffic to ensure load
sharing. The ecmp.roceQPN
option will enable the use of the queue pair
number as part of the hashing calculation. It is recommended that RoCE
users also enable this ecmp
setting.
Switch Groups
The SwitchGroup
is just a marker at that point and doesn't have any configuration options.
SwitchGroup.yaml | |
---|---|
Redundancy Groups
Redundancy groups are used to define the redundancy between switches. It's a regular SwitchGroup
used by multiple
switches and currently it could be MCLAG or ESLAG (EVPN MH / ESI). A switch can only belong to a single redundancy
group.
MCLAG is only supported for pairs of switches and ESLAG is supported for up to 4 switches. Multiple types of redundancy groups can be used in the fabric simultaneously.
Connections with types mclag
and eslag
are used to define the servers connections to switches. They are only
supported if the switch belongs to a redundancy group with the corresponding type.
In order to define a MCLAG or ESLAG redundancy group, you need to create a SwitchGroup
object and assign it to the
switches using the redundancy
field.
Example of switch configured for ESLAG:
SwitchGroup-Switch-example.yaml | |
---|---|
And example of switch configured for MCLAG:
MCLAG-switchgroup.yaml | |
---|---|
In case of MCLAG it's required to have a special connection with type mclag-domain
that defines the peer and session
links between switches. For more details, see Connections.
Servers
Regular workload server: