RTGWG Working Group                                              P. Huo
Internet Draft                                                  G. Chen
Intended status: Informational                                ByteDance
Expires: February 23,2025                                        C. Lin
                                                   New H3C Technologies
                                                                 H. Dai
                                                              ByteDance
                                                        August 23, 2024


          A OSF Framework for Artificial Intelligence (AI) Network
                     draft-hcl-rtgwg-osf-framework-01


Abstract

   This document describes a framework for Artificial Intelligence (AI)
   network, Particularly, the document identifies a set of AI network
   components, describes their interactions, and exemplifies the
   workflow of the control and data planes.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF). Note that other groups may also distribute
   working documents as Internet-Drafts. The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on February 23, 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document. Code Components extracted from this
   document must include Simplified BSD License text as described in


hcl, et al.                  Expires February 23, 2025        [Page 1]

Internet-Draft       A OSF Framework for AI Network        August 2024
   Section 4.e of the Trust Legal Provisions and are provided without
   warranty as described in the Simplified BSD License.

Table of Contents


   1. Introduction...................................................3
      1.1. Requirements Language.....................................3
      1.2. Terminology...............................................3
   2. OSF Framework and Components...................................4
      2.1. Framework Overview........................................4
      2.2. OSF Functional Components.................................5
      2.3. OSF-TM....................................................5
      2.4. OSF-Ingress...............................................5
      2.5. OSF-Egress................................................5
      2.6. OSF-Forwarder.............................................6
      2.7. OSF-CFC...................................................6
   3. Deployment Considerations......................................6
   4. OSF Framework Workflow.........................................6
      4.1. OSF Topology Manage.......................................6
      4.2. Load Balancing in OSF Packet Transmission.................7
      4.3. OSF Congestion Control....................................8
         4.3.1. Credit-based Flow Control............................9
         4.3.2. Congestion Control Based on Link Quality Detection..10
      4.4. Rapid Link Failure Switchover in OSF.....................11
   5. Security Considerations.......................................11
   6. IANA Considerations...........................................11
   7. References....................................................11
      7.1. Normative References.....................................11
      7.2. Informative References...................................11
   Authors' Addresses...............................................12






















                            Expires  February 23, 2025        [Page 2]

Internet-Draft       A OSF Framework for AI Network        August 2024


1. Introduction

   With the widespread application of Artificial Intelligence (AI), the
   demand for AI networks is increasing. As described in [I-D.draft-
   hcl-ai-network-problem-00], with the development of AI networks, the
   model parameters for AI training are becoming increasingly large. In
   order to meet the demands of large-scale AI training, AI training
   networks typically adopt a distributed cluster approach, which
   presents the following new requirements for the network:

   o AI training networks need a new load balancing method to mitigate
      the impact of uneven loads caused by burst traffic and achieve as
      much load balancing as possible.

   o AI training networks require a new congestion control mechanism
      that can quickly detect congestion when it occurs locally,
      communicate the congestion state, and then perform global
      congestion control. This is more efficient than performing
      congestion control locally, and thus necessitates a global end-
      to-end congestion control mechanism.

   o AI training networks need to have the ability for fast fault
      recovery.

   This document proposes an AI network architecture to meet the new
   requirements for AI networks.

1.1. Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

1.2. Terminology

   AI: Artificial Intelligence

   OSF: Open Scheduled Fabric

   OSF-TM: OSF Topo Manager

   OSF-Ingress: OSF Ingress router

   OSF-Egress: OSF Egress router

   OSF-Forwarder: OSF Forwarder Router

   OSF-CFC: OSF Credit-based Flow Control

                            Expires  February 23, 2025        [Page 3]

Internet-Draft       A OSF Framework for AI Network        August 2024
2. OSF Framework and Components

2.1. Framework Overview

   A high-level view of the OSF framework, without expanding the
   functional entities in the network, is illustrated in Figure 1.



         +----------------------------------+
         |         Management Plane         |
         +----------------------------------+
         |           Control Plane          |
         +----------------------------------+
                         /\
                         ||
                         \/
         +----------------------------------+
         |           Data Plane             |
         +----------------------------------+
                        Figure 1: OSF Interactions

   For the OSF network, define the following layers:

   o OSF Management Layer: Mainly responsible for monitoring,
      configuring, and maintaining OSF devices.

   o OSF Control Layer: Mainly responsible for maintaining OSF network
      topology information, congestion detection, and fast switching.

   o OSF Data Layer: Mainly responsible for encapsulating, forwarding,
      and decapsulating packets based on the routing information issued
      by the control layer, and sending packets to the authorization
      system to achieve congestion control.


















                            Expires  February 23, 2025        [Page 4]

Internet-Draft       A OSF Framework for AI Network        August 2024
2.2. OSF Functional Components

                        +--------+
                        |OSF-TM  |
   Management           +--------+
   Plan
   -------------------------------------------------------
                             ^
   Control                   |
   Plan                      V
                         +--------+
                         |OSF-TM  |
                       /-+        +--
                      /  |        |  \
                     /   +---+----+   \
                    /        |         \
     +----+        /         |          \          +----+
     |Host|       /          |           \         |Host|
     +-+--+      /           |            \        +-+--+
       |        /            |             \         |
       | +--------+      +---+-----+      +-+------+ |
       +-+OSF-    +------+OSF-     +------+OSF-    +-+
         |Ingress |      |Forwarder|      |Egress  |
         |        |      |         |      |        |
         |OSF-TM  |      |OSF-TM   |      |OSF-TM  |
   ------|--------|------|---------|------|--------|--------
   Data  |OSF-CFC |      |OSF-CFC  |      |OSF-CFC |
   Plan  |        |      |         |      |        |
         +--------+      +---------+      +--------+



2.3. OSF-TM

   Discovery and Maintenance of OSF Network Topology involves
   collecting node information, internal connection information between
   each interface, and external interface information for each node to
   generate the OSF topology.

   OSF-TM is also responsible for maintaining the quality of all links,
   in order to select the best path for packet transmission based on
   link quality and avoid network congestion.

2.4. OSF-Ingress

   The entry point for OSF data packets, where path selection and load
   balancing are performed based on OSF-MS, encapsulating the packets
   and sending them towards the exit.

2.5. OSF-Egress

   The exit point for OSF data packets, where the packets are
   decapsulated, reordered, and delivered to the recipient.
                            Expires  February 23, 2025        [Page 5]

Internet-Draft       A OSF Framework for AI Network        August 2024
2.6. OSF-Forwarder

   The forwarder between OSF entrance and OSF exit forwards based on
   destination interface information, disregarding the content of the
   packets.

2.7. OSF-CFC

   OSF-CFC operates at the data layer and is used for congestion
   control during OSF packet forwarding. For details on the specifics
   of congestion control, refer to section 4.3.

3. Deployment Considerations

   The OSF-TM and OSF-MS components at the control layer can operate in
   a centralized processing mode, a distributed processing mode, or a
   hybrid mode.

   In the centralized mode, all topology information, network quality
   information, and metric information are mOSFtOSFed centrally.

   In the distributed mode, all topology information and network
   quality information are maintained in a distributed manner across
   devices and synchronized.

   In the hybrid mode, stable information such as topology information
   is maintained in a distributed manner, while information that
   changes frequently is maintained centrally to reduce flooding in the
   network, such as network quality information. Ultimately, network
   metric information is generated based on network quality information
   and maintained centrally.

4. OSF Framework Workflow

4.1. OSF Topology Manage

   In an OSF network, the OSF topology can be generated through a
   topology discovery protocol for use in load balancing across
   multiple paths during data forwarding.

   During load balancing, it is necessary to exclude paths with poor
   link quality. OSF-TM is responsible for maintaining quality
   information for each link. Link quality information is detected and
   reported by the switches, and link state synchronization is achieved
   through the link state protocol.

   In the event of link failure, the switches near the failure point
   need to be the first to detect and quickly notify the head end,
   which then performs global traffic scheduling.

   OFP-TM can be deployed in a distributed manner or in a centralized
   manner.

                            Expires  February 23, 2025        [Page 6]

Internet-Draft       A OSF Framework for AI Network        August 2024
            +-------------+           +-------------+
            |             |           |             |
            |Spine1       |           |Spine2       |
            +--+--+--+--+-+           +-+--+---+--+-+
               /  |  |  x              /  /    |   \
              /   |  |   \            /  /     +    \
             / +--(--(----(----------+  /     /      \
            / /   |  |   +-(-----------+     /        \
           / /    |   \  |  +---------------)---+      \
         ++ /     ++   +-)---------+       /     \      \
        /  /        \    |          \     /       \      \
     +-+--+--+     +-+---+-+       +-+---+-+     +-+------+-+
     |       |     |       |       |       |     |          |
     |Leaf1  |     |Leaf2  |       |Leaf3  |     |Leaf4     |
     +-+---+-+     +-+---+-+       +-+---+-+     +-+-----+--+
       |   |         |   |           |   |         |     |
       H1  H2        H3  H4          H5  H6        H7    H8


   As shown in the figure, topology discovery is performed between
   devices to dynamically maintain the network topology.

   Each device maintains the link quality with its neighbors, and the
   overall link quality for all links is ultimately maintained by OSF-
   TM.

   When OSF-Ingress sends packets to OSF-Egress, path selection is
   based on the topology and link quality information.

   For example, in the diagram, when H1 sends a packet to H7 and
   network congestion occurs between Spine1 and Leaf4, Spine1 detects
   the congestion and notifies Leaf1. Leaf1 then reselects the path,
   changing the forwarding path to H1->Leaf1->Spine2->Leaf4->H7 based
   on the new information.

4.2. Load Balancing in OSF Packet Transmission

   Traditional load balancing typically involves hashing based on the
   five-tuple of packets. However, for AI networks, the small amount of
   traffic and the large load per flow can lead to imbalanced loads.

   OSF-Ingress needs to dynamically calculate the bandwidth for each
   path based on the path bandwidth maintained in OSF-TM, while
   disregarding paths with excessively high congestion levels.

   OSF-Ingress performs load balancing for packets destined for the
   same interface, allowing packet aggregation. For oversized packets,
   they can be fragmented into smaller segments for transmission to
   ensure a more balanced load. After packet aggregation or
   fragmentation, the packets need to be sorted and numbered, then sent
   sequentially through ECMP links.


                            Expires  February 23, 2025        [Page 7]

Internet-Draft       A OSF Framework for AI Network        August 2024
   The intermediate OSF-Forwarder devices forward the packets based on
   the final destination interface information, ultimately sending the
   packets to OSF-Egress. There is no need to identify the content of
   the packets, handle packet reordering, or process packet
   fragmentation and aggregation.

   The OSF-Forwarder needs to process authorization requests through
   the OSF-CFC component. For details about the specific handling
   process of OSF-CFC, refer to section 4.3.1.

   After the packets are sent to OSF-Egress, it is necessary to ensure
   that the packets are delivered sequentially before handing them over
   to the receiver. If the received packet is composed of multiple
   original packets aggregated together, it should be separated into
   original packets before delivery. If the packet is composed of
   fragments of a large packet, they need to be reassembled into a
   complete packet before delivery.

   As shown in the diagram below, for three packets p1, p2, and p3 of
   the same flow, OSF-Ingress no longer performs hash-based route
   selection. Instead, it sequentially selects the optimal ECMP paths
   for all packets, ensuring the maximum utilization of the bandwidth
   across all paths. To ensure sequential delivery at the receiving
   end, packets need to be numbered and sorted by OSF-Ingress before
   sending. The OSF-Forwarder devices along the path are responsible
   for forwarding the packets to OSF-Egress. At OSF-Egress, to ensure
   that the receiver can receive packets in the original order, the
   received packets need to be sorted before being delivered to the
   receiver. In the diagram, the order of the packets received by OSF-
   Egress is p3->p2->p1. After sorting by OSF-Egress, the packets are
   delivered to the receiver in the original order p1->p2->p3.

   If the packets have undergone fragmentation or aggregation at OSF-
   Ingress, they also need to be reconstructed into the original
   packets by OSF-Egress before being delivered to the receiver. This
   document does not specify the specific format and encapsulation of
   packet numbering.

                       +------------+
      +--------+-->p1  | ECMP1      +-   +-->p3 +---------+-->p1
      |OSF-    |       ============== \ /       |OSF-     |
      |        +-->p2  | ECMP2      +------->p2 |         +-->p2
      |Ingress |       ============== /\        |Egress   |
      |        +-->p3  | ...        +-  +--->p1 |         +-->p3
      +--------+       ==============           +---------+
                       | ECMPn      |
                       +------------+


4.3. OSF Congestion Control

   If network congestion occurs, network performance will severely
   deteriorate. Therefore, we need to ensure that network congestion is
                            Expires  February 23, 2025        [Page 8]

Internet-Draft       A OSF Framework for AI Network        August 2024
   minimized as much as possible. OSF-CFC implements end-to-end network
   congestion control to reduce the likelihood of network congestion.

   Congestion control includes both active and passive control.

   Passive congestion control typically involves testing the current
   network state and providing feedback so that the sending end can
   quickly react to congestion. It controls network congestion by
   allocating rates based on measurement information.

   The active congestion control scheme aims to prevent congestion by
   only sending data when the network has sufficient available
   bandwidth. In this context, we will mainly introduce the
   implementation of active congestion control.

   OSF adopts a passive congestion control mechanism at the control
   layer, which involves detecting the link state and adjusting the
   forwarding path to control network congestion.

   At the data layer, OSF adopts an active congestion control mechanism
   by deploying an authorization mechanism internally. Before sending
   packets, a scheduling process is initiated. The packet is only sent
   after successful scheduling, thereby implementing proactive
   congestion control and achieving deterministic and precise
   congestion control to strike the optimal balance between congestion
   and link load.

4.3.1. Credit-based Flow Control

   As shown in the diagram, before OSF-Ingress at the data layer sends
   packets to OSF-Egress, it first sends Credit Request messages along
   the path to request credit. Upon receiving the Credit Request, the
   OSF-Forwarder reserves bandwidth on the receiving interface and then
   continues to send Credit Request messages downstream to request
   credit, until all devices on the packet transmission path receive
   the Credit Request and respond with Credit ACK, reserving the
   receiving bandwidth.

         +--------+           +---------+          +---------+
         |OSF-    |           |OSF-     |          |OSF-     |
         |Ingress +-----------+Forwarder+----------+Egress   |
         +--------+           +---------+          +---------+
                   1: Credit             2:Credit
                      Request             Request
                   --------->            -------->

                   4:Credit              3: Credit
                     ACK                    ACK
                  <----------            <---------

                   5: Send Pkt           6: Send Pkt
                  ------------>          ----------->

                            Expires  February 23, 2025        [Page 9]

Internet-Draft       A OSF Framework for AI Network        August 2024


   When transmitting within the OSF network, congestion control can be
   autonomously conducted by the data layer.

   It is recommended to deploy an authorization mechanism internally,
   where packets are scheduled before being sent. This proactive
   congestion control approach aims to achieve deterministic and
   precise congestion control by maintaining the optimal balance
   between congestion and link loads.

   If network congestion occurs in the network, network performance
   will significantly degrade. Therefore, efforts should be made to
   prevent network congestion. OSF-CFC implements end-to-end network
   congestion control to minimize incidents of network congestion.

   The workflow of OSF-CFC is as follows:

   Step 1: The OSF-Ingress sender initiates an authorization request
   when sending a packet, requesting the required queue bandwidth
   resources along the transmission path.

   Step 2: The next-hop node receives the authorization request,
   reserves queue bandwidth resources on the local egress interface. If
   resources are insufficient, an authorization rejection is issued,
   leading to Step 6.

   Step 3: If the destination interface is not local, resources are
   reserved on the egress interface and the authorization request is
   further forwarded to the next-hop node.

   Step 4: The final authorization request reaches the destination
   node, which responds with an authorization reply.

   Step 5: Upon receiving the authorization reply, the initiator sends
   the packet to the next-hop node.

   Step 6: Upon receiving an authorization rejection, the reserved
   resources on the local egress interface are released. If the device
   is not the initiator of the authorization request, the rejection is
   forwarded to the initiator.

   Step 7: Upon receiving an authorization rejection, the initiator
   releases the reserved resources on the local egress interface and
   notifies OSF-MS of the network congestion message.

4.3.2. Congestion Control Based on Link Quality Detection

   The control layer precisely tests and explicitly feeds back network
   link state information using all devices, and selects paths on OSF-
   Ingress based on the network state information, excluding paths with
   high congestion characteristics.

                            Expires  February 23, 2025       [Page 10]

Internet-Draft       A OSF Framework for AI Network        August 2024
   The process involves link state detection, link state announcement,
   and path selection. All of these processes are completed within the
   OSF-TM component.

4.4. Rapid Link Failure Switchover in OSF

   As the size of the AI network grows, the number of network cards and
   optical modules increases, leading to a corresponding rise in the
   probability of failures.

   Current failure switchover mechanisms typically involve nearby
   devices handling path switching, but lack a global path scheduling
   method for rapid failure switchover. For OSF networks, it is
   essential for the control plane to incorporate a fast failure
   detection and notification mechanism, enabling nodes near the point
   of failure to swiftly detect issues and promptly notify the OSF-MS
   component. This allows the OSF-MS component to quickly adjust paths
   on a global scale in response to the detected failure.

5. Security Considerations

   TBD.

6. IANA Considerations

This document does not request any IANA allocations.


7. References

7.1. Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate

             Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC

              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,

              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

7.2. Informative References

   TBD







                            Expires  February 23, 2025       [Page 11]

Internet-Draft       A OSF Framework for AI Network        August 2024
Authors' Addresses

   PengFei Huo
   ByteDance
   China
   Email: huopengfei@bytedance.com


   Gang Chen
   ByteDance
   China
   Email: chengang.gary@bytedance.com



   Changwang Lin
   New H3C Technologies
   China

   Email: linchangwang.04414@h3c.com


   Huichen Dai
   ByteDance
   China
   Email: daihuichen@bytedance.com


























                            Expires  February 23, 2025       [Page 12]