I have been selected as the Routing Directorate reviewer for this draft. The Routing Directorate seeks to review all routing or routing-related drafts as they pass through IETF last call and IESG review, and sometimes on special request. The purpose of the review is to provide assistance to the Routing ADs. For more information about the Routing Directorate, please see http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir Although these comments are primarily for the use of the Routing ADs, it would be helpful if you could consider them along with any other IETF Last Call comments that you receive, and strive to resolve them through discussion or by updating the draft. Document: draft-rtg-dt-encap-02 Reviewer: Jamal Hadi Salim Review Date: 6/30/15 (later than requested, sorry) Intended Status: Informational WG LC End Date: unknown Summary: The document has significant good work and recommendations for encapsulation design. Many years of experience in issues found with encapsulation deployments are discussed. There are times where i lost track what the document was about because issues were being discussed without making recommendations on what is needed from an encapsulation perspective to deal with those issues; otoh, a good read is section 18 which would mention an issue and in the same breath suggests how a design should handle said issue. The document needs at least one more pass. I have some minor concerns about this document that I believe are resolvable. Annotated comments attached. cheers, jamal > > > > RTGWG E. Nordmark (ed) > Internet-Draft Arista Networks > Intended status: Informational A. Tian > Expires: November 22, 2015 Ericsson Inc. > J. Gross > VMware > J. Hudson > Brocade Communications Systems, > Inc. > L. Kreeger > Cisco Systems, Inc. > P. Garg > Microsoft > P. Thaler > Broadcom Corporation > T. Herbert > Google > May 21, 2015 > > > Encapsulation Considerations > draft-rtg-dt-encap-02 > [...] > 2. Overview [..] > > [I-D.wijnands-bier-architecture], and > [I-D.wijnands-mpls-bier-encapsulation]. We assume the reader has > some basic familiarity with those proposed encapsulations. The > Related Work section points at some prior work that relates to the > encapsulation considerations in this document. > > Encapsulation protocols typically have some unique information that > they need to carry. In some cases that information might be modified > along the path and in other cases it is constant. The in-flight > modifications has impacts on what it means to provide security for > the encapsulation headers. > o NVO3 carries a VNI Identifier edge to edge which is not modified. > There has been OAM discussions in the WG and it isn't clear > whether some of the OAM information might be modified in flight. > o SFC carries service meta-data which might be modified or > unmodified as the packets follow the service path. SFC talks of Being a little picky, how about: "SFC carries service meta-data which might be modified as the packets follow the service path." > some loop avoidance mechanism which is likely to result in > modifications for for each hop in the service chain even if the > meta-data is unmodified. > o BIER carries a bitmap of egress ports to which a packet should be > delivered, and as the packet is forwarded down different paths > different bits are cleared in that bitmap. > > Even if information isn't modified in flight there might be devices > that wish to inspect that information. For instance, one can > envision future NVO3 security devices which filter based on the > virtual network identifier. > > The need for extensibility is different across the protocols > o NVO3 might need some extensions for OAM and security. > o SFC is all about carrying service meta-data along a path, and > different services might need different types and amount of meta- > data. > o BIER might need variable number of bits in their bitmaps, or other > future schemes to scale up to larger network. > The extensibility needs and constraints might be different when > considering hardware vs. software implementations of the > encapsulation headers. NIC hardware might have different constraints > than switch hardware. > [...] [..] > > 6. Terminology > > The capitalized keyword MUST is used as defined in > http://en.wikipedia.org/wiki/Julmust > Missing the context on what looks like a high calorie delicious drink. and should that be https?;-> > TBD: Refer to existing documents for at least NVO3 and SFC > terminology. We use at least the VNI ID in this document. > > > 7. Entropy > > In many cases the encapsulation format needs to enable ECMP in > unmodified routers. Those routers might use different fields in TCP/ > UDP packets to do ECMP without a risk of reordering a flow. > > The common way to do ECMP-enabled encapsulation over IP today is to > add a UDP header and to use UDP with the UDP source port carrying > entropy from the inner/original packet headers as in LISP [RFC6830]. > The total entropy consists of 14 bits in the UDP source port (using > the ephemeral port range) plus the outer IP addresses which seems to > be sufficient for entropy; using outer IPv6 headers would give the > option for more entropy should it be needed in the future. > > In some environments it might be fine to use all 16 bits of the port > range. However, middleboxes might make assumptions about the system > ports or user ports. But they should not make any assumptions about > the ports in the Dynamic and/or Private Port range, which have the > two MSBs set to 11b. > > The UDP source port might change over the lifetime of an encapsulated > flow, for instance for DoS mitigation or re-balancing load across > ECMP. Shouldnt the above statement bear a little more discussion/comment? What happens to packet ordering then? > > There is some interaction between entropy and OAM and extensibility > mechanism. It is desirable to be able to send OAM packets to follow > the same path as network packets. Hence OAM packets should use the > same entropy mechanism as data packets. While routers might use > information in addition the entropy field and outer IP header, they > can not use arbitrary parts of the encapsulation header since that > might result in OAM frames taking a different path. Likewise if > routers look past the encapsulation header they need to be aware of > the extensibility mechanism(s) in the encapsulation format to be able > to find the inner headers in the presence of extensions; OAM frames > might use some extensions e.g. for timestamps. > [..] > Note that in the proposed BIER encapsulation > [I-D.wijnands-mpls-bier-encapsulation], there is an an 8-bit field > which specifies an entropy value that can be used for load balancing > purposes. This entropy is for the BIER forwarding decisions, which > is independent of any outer delivery ECMP between BIER routers. Thus > it is not part of the delivery ECMP discussed in this section. > [Note: For any given bit in BIER (that identifies an exit from the > BIER domain) there might be multiple immediate next hops. The > BIER entropy field is used to select that next hop as part of BIER > processing. The BIER forwarding process may do equal cost load > balancing, but the load balancing procedure MUST choose the same > path for any two packets have the same entropy value.] "... two packets that have the same ..." > > In summary: > o The entropy is associated with the transport, that is an outer IP > header or MPLS. > o In the case of IP transport use >=14 bits of UDP source port, plus > outer IPv6 flowid for entropy. > Looks like a typo. <=14 bits? > > 8. Next-protocol indication > [..] > > Secondly, the encapsulation needs to indicate the type of its > payload, which is in scope for the design of the encapsulation. We > have existing protocols which use Ethernet types (such as GRE). Here > each encapsulation header can potentially makes its own choices > between: > o Reuse Ethernet types - makes it easy to carry existing L2 and L3 > protocols including IPv6, IPv6, and Ethernet. Disadvantages are > that it is a 16 bit number and we probably need far less than 100 > values, and the number space is controlled by the IEEE 802 RAC > with its own allocation policies. If i understood correctly what "reuse" implies: you are suggesting a new super-ethertype whose content space will carry an additional type semantic so you never have to go back to IEEE? > o Reuse IP protocol numbers - makes it easy to carry e.g., ESP in > addition to IP and Etnernet but brings in all existing protocol Run a spell checker "Ethernet" above.. > numbers many of which would never be used directly on top of the > encapsulation protocol. IANA managed eight bit values, presumably > more difficult to get an assigned number than to get a transport > port assignment. > o Define their own next-protocol number space, which can use fewer > bits than an Ethernet type and give more flexibility, but at the > cost of administering that numbering space (presumably by the > IANA). > > Thirdly, if the IETF ends up defining multiple encapsulations at > about the same time, and there is some chance that multiple such > encapsulations can be combined in the same packet, there is a > question whether it makes sense to use a common approach and > numbering space for the encapsulation across the different protocols. > A common approach might not be beneficial as long as there is only > one way to indicate e.g., SFC inside NVO3. > > Many Internet protocols use fixed values (typically managed by the > IANA function) for their next-protocol field. That facilitates > interpretation of packets by middleboxes and e.g., for debugging > purposes, but might make the protocol evolution inflexible. Our > collective experience with MPLS shows an alternative where the label > can be viewed as an index to a table containing processing > instructions and the table content can be managed in different ways. Would it not be useful to provide a reference here? Just reading this has questions popping for me - who populates this tag-indexed table of instructions and could interop be impacted? > Encapsulations might want to consider the tradeoffs between such more > flexible versus more fixed approaches. > > In summary: > o Would it be useful for the IETF come up with a common scheme for > encapsulation protocols? If not each encapsulation can define its > own scheme. > In my view it would be hard to come up with a ring to rule them all. There are cases where simple is good enough and asking someone to carry a christmas tree is the wrong answer. And, yes, there are cases where (to quote Mencken) the answer is clear, simple and wrong (especially in one-off-use-cases which then are refactored to fit into square pegs). My suggestion is to not be too clever in answering the question above. > > 9. MTU and Fragmentation > > A common approach today is to assume that the underlay have > sufficient MTU to carry the encapsulated packets without any > fragmentation and reassembly at the tunnel endpoints. That is > sufficient when the operator of the ingress and egress have full > control of the paths between those endpoints. And it makes for > simpler (hardware) implementations if fragmentation and reassembly > can be avoided. > [..] > Encapsulations could also define an optional tunnel fragmentation and > reassembly mechanism which would be useful in the case when the > operator doesn't have full control of the path, or when the protocol > gets deployed outside of its original intended context. Such a > mechanism would be required if the underlay might have a path MTU > which makes it impossible to carry at least 1518 bytes (if offering > Ethernet service), or at least 1280 (if offering IPv6 service). The > use of such a protocol mechanism could be triggered by receiving a > PTB. But such a mechanism might not be implemented by all > encapsulators and decapsulators. [Aerolink is one example of such a > protocol.] > Reference to Aerolink and the sins committed would be useful. I googled aerolink and found references of some radio thing running over IP. Given IP provides the fragmention service above, why is aerolink not capable of this mechanism? I think there's a simple answer; just reading this didnt help. > Depending on the payload carried by the encapsulation there are some > additional possibilities: > [..] > In summary: > o In some deployments an encapsulation can assume well-managed MTU > hence no need for fragmentation and reassembly related to the > encapsulation. > o Even so, it makes sense for ingress to track any ICMP packet too > big addressed to ingress to be able to log any MTU > misconfigurations. > o Should an encapsulation protocol be depoyed outside of the spell checker: deployed? > original context it might very well need support for fragmentation > and reassembly. > > > 10. OAM > > The OAM area is seeing active development in the IETF with > discussions (at least) in NVO3 and SFC working groups, plus the new > LIME WG looking at architecture and YANG models. > > The design team has take a narrow view of OAM to explore the > potential OAM implications on the encapsulation format. > > In terms of what we have heard from the various working groups there > seem to be needs to: > o Be able to send out-of-band OAM messages - that potentially should > follow the same path through the network as some flow of data > packets. > * Such OAM messages should not accidentally be decapsulated and > forwarded to the end stations. > * Be able to add OAM information to data packets that are > encapsulated. Discussions have been around Add a semicolon so it reads "Discussions have been around:" and then more indentation is needed for the next two bullets below to fit under above bullet. > * Using a bit in the OAM to synchronize sampling of counters > between the encapsulator and decapsulator. > * Optional timestamps, sequence numbers, etc for more detailed > measurements between encapsulator and decapsulator. > o Usable for both proactive monitoring (akin to BFD) and reactive > checks (akin to traceroute to pin-point a failure) > > To ensure that the OAM messages can follow the same path the OAM > messages need to get the same ECMP (and LAG hashing) results as a > given data flow. An encapsulator can choose between one of: > > [..] > > > o Limit ECMP hashing to not look past the UDP header i.e. the > entropy needs to be in the source/destination IP and UDP ports > o Make OAM packets look the same as data packets i.e. the initial > part of the OAM payload has the inner Ethernet, IP, TCP/UDP > headers as a payload. (This approach was taken in TRILL out of > necessity since there is no UDP header.) Any OAM bit in the > encapsulation header must in any case be excluded from the > entropy. > Does it make sense to have inband OAM info? i.e carried alongside the data (sure request for a path trace doesnt fit; but inband healthinfo may fit); in such a case OAM info could be carried in something like a TLV. > There can be several ways to prevent OAM packets from accidentally > being forwarded to the end station using: > o A bit in the frame (as in TRILL) indicating OAM > o A next-protocol indication with a designated value for "none" or > "oam". > This assumes that the bit or next protocol, respectively, would not > affect entropy/ECMP in the underlay. However, the next-protocol > field might be used to provide differentiated treatement of packets > based on their payload; for instance a TCP vs. IPsec ESP payload > might be handled differently. Based on that observation it might be > undesirable to overload the next protocol with the OAM drop behavior, > resulting in a preference for having a bit to indicate that the > packet should be forwarded to the end station after decapsulation. [..] > > 11. Security Considerations > > Different encapsulation use cases will have different requirements > around security. For instance, when encapsulation is used to build > overlay networks for network virtualization, isolation between > virtual networks may be paramount. BIER support of multicast may > entail different security requirements than encapsulation for > unicast. > > In real deployment, the security of the underlying network may be > considered for determining the level of security needed in the > encapsulation layer. However for the purposes of this discussion, we > assume that network security is out of scope and that the underlying > network does not itself provide adequate or as least uniform security > mechanisms for encapsulation. I found the above paragraph awkward to read. How about simplifying: "This document assumes that the underlying network does not itself provide adequate or at least uniform security mechanisms for encapsulation. The authors understand that the underlying network security could provide useful input into the security needs of the encapsulation layer but ignore it to provide a focus on the discussion." > > There are at least three considerations for security: > o Anti-spoofing/virtual network isolation > o Interaction with packet level security such as IPsec or DTLS So would IPSEC not be considered "underlying network security"? > o Privacy (e.g., VNI ID confidentially for NVO3) > Confidentially is one - but what about integrity of the VNI? > This section uses a VNI ID in NVO3 as an example. A SFC or BIER > encapsulation is likely to have fields with similar security and > privacy requirements. > > 11.1. Encapsulation-specific considerations > > Some of these considerations appear for a new encapsulation, and > others are more specific to network virtualization in datacenters. > o New attack vectors: > * DDOS on specific queued/paths by attempting to reproduce the > 5-tuple hash for targeted connections. > * Entropy in outer 5-tuple may be too little or predictable. > * Leakage of identifying information in the encapsulation header > for an encrypted payload. > * Vulnerabilities of using global values in fields like VNI ID. > o Trusted versus untrusted tenants in network virtualization: > * The criticality of virtual network isolation depends on whether > tenants are trusted or untrusted. In the most extreme cases, > tenants might not only be untrusted but may be considered > hostile. So would confidentiality then become a requirement to address this? It is more readable to make suggestions on each issue on what needs to be done. > * For a trusted set of users (e.g. a private cloud) it may be > sufficient to have just a virtual network identifier to provide > isolation. Packets inadvertently crossing virtual networks > should be dropped similar to a TCP packet with a corrupted port > being received on the wrong connection. > * In the presence of untrusted users (e.g. a public cloud) the > virtual network identifier must be adequately protected against > corruption and verified for integrity. This case may warrant > keyed integrity. Ok, i guess integrity does show up here; should have mentioned it earlier? > o Different forms of isolation: > * Isolation could be blocking all traffic between tenants (or > except as allowed by some firewall) > * Could also be about performance isolation i.e. one tenant can > overload the network in a way that affects other tenants > * Physical isolation of traffic for different tenants in network > may be required, as well as required restrictions that tenants > may have on where their packets may be routed. > o New attack vectors from untrusted tenants: > * Third party VMs with untrusted tenants allows internally borne > attacks within data centers > * Hostile VMs inside the system may exist (e.g. public cloud) > * Internally launched DDOS > * Passive snooping for mis-delivered packets > * Mitigate damage and detection in event that a VM is able to > circumvent isolation mechanisms [...] [..] > 11.4. In summary: > > o Encapsulations need extensibility mechanisms to be able to add > security features like cookies and secure hashes protecting the > encapsulation header. > o NVO3 proably has specific higher requirements relating to > isolation for network virtualization, which is in scope for the > NVO3 WG/ "remove the "/" > o Our collective IETF experience is that succesful protocols get > deployed outside of the original intended context, hence the > initial assumptions about the threat model might become invalid. > That needs to be considered in the standardization of new > encapsulations. So whats the recommendation here? Over-engineer in case something is needed later? > > > 12. QoS > > In the Internet architecture we support QoS using the Differentiated > Services Code Points (DSCP) in the formerly named Type-of-Service > field in the IPv4 header, and in the Traffic-Class field in the IPv6 Its been at least a decade since the change, do you really need to say "formerly named ToS"? > header. The ToS and TC fields also contain the two ECN bits. Provide a cross-reference to section 13 for ECN? > > We have existing specifications how to process those bits. See > [RFC2983] for diffserv handling, which specifies how the received > DSCP value is used to set the DSCP value in an outer IP header when > encapsulating. (There are also existing specifications how DSCP can > be mapped to layer2 priorities.) > [..] > > 13. Congestion Considerations > > Additional encapsulation headers does not introduce anything new for > Explicit Congestion Notification. It is just like IP-in-IP and IPsec > tunnels which is specified in [RFC6040] in terms of how the ECN bits > in the inner and outer header are handled when encapsulating and > decapsulating packets. Thus new encapsulations can more or less > include that by reference. > There are additional considerations around carrying non-congestion > controlled traffic. These details have been worked out in > [I-D.ietf-mpls-in-udp]. As specified in [RFC5405]: "IP-based traffic > is generally assumed to be congestion-controlled, i.e., it is assumed > that the transport protocols generating IP-based traffic at the > sender already employ mechanisms that are sufficient to address > congestion on the path Consequently, a tunnel carrying IP-based "." needed between "path" and "Consequently" > traffic should already interact appropriately with other traffic > sharing the path, and specific congestion control mechanisms for the > tunnel are not necessary". Those considerations are being captured > in [I-D.ietf-tsvwg-rfc5405bis]. > [..] > > One could make the encapsulation header be extensible to that it can > carry sufficient information to be able to measure resource usage, > delays, and congestion. The suggestions in the OAM section about a > single bit for counter synchronization, and optional timestamps > and/or sequence numbers, could be part of such an approach. There > might also be additional congestion-control extensions to be carried > in the encapsulation. Overall this results in a consideration to be > able to have sufficient extensibility in the encapsulation to be > handle to handle potential future developments in this space. > get rid of "to be handle" so it reads: "...extensibility in the encapsulation to handle ..." > Coarse measurements are likely to suffice, at least for circuit- > breaker-like purposes, see [I-D.wei-tsvwg-tunnel-congestion-feedback] > and [I-D.briscoe-conex-data-centre] for examples on active work in > this area via use of ECN. [RFC6040] Appendix C is also relevant. > The outer ECN bits seem sufficient (at least when everything uses > ECN) to do this course measurements. Needs some more study for the > case when there are also drops; might need to exchange counters > between ingress and egress to handle drops. > > Circuit breakers are not sufficient to make a network with different > congestion control when the goal is to provide a predictable service > to different tenants. The fallback would be to rate limit different > traffic. > > In summary: > o Leverage the existing approach in [RFC6040] for ECN handling. > o If the encapsulation can carry non-IP, hence non-congestion > controlled traffic, then leverage the approach in > [I-D.ietf-mpls-in-udp]. > o "Watch this space" for circuit breakers. > Hopefully coming soon ;-> > > 14. Header Protection > > Many UDP based encapsulations such as VXLAN [RFC7348] either > discourage or explicitly disallow the use of UDP checksums. The > reason is that the UDP checksum covers the entire payload of the > packet and switching ASICs are typically optimized to look at only a > small set of headers as the packet passes through the switch. In > these case, computing a checksum over the packet is very expensive. > (Software endpoints and the NICs used with them generally do not have > the same issue as they need to look at the entire packet anyways.) > [..] > verify that checksum or, if incapable, drop the packet. The > assumption is that configuration and/or control-plane capability > exchanges can be used when different receiver have different checksum > validation capabilities. > > In summary: > o Encapsulations need extensibility to be able to add checksum/CRC > for the encapsulation header itself. > o When the encapsulation has a checksum/CRC, include the IPv6 > pseudo-header in it. > o The checksum/CRC can potentially be avoided when cryptographic > protection is applied to to the encapsulation. > get rid of one of the "to" > > 15. Extensibility Considerations > > Protocol extensibility is the concept that a networking protocol may > be extended to include new use cases or functionality that were not > part of the original protocol specification. Extensibility may be > used to add security, control, management, or performance features to > a protocol. A solution may allow private extensions for > customization or experimentation. > [..] > > In some cases it might be more appropriate to define a new inner > protocol which can carry the new functionality instead of extending > the outer protocol. Examples where this works well is in the IP/ > transport split, where the earlier architecture had a single NCP Is a ref for NCP needed? > protocol which carried both the hop-by-hop semantics which are now in > IP, and the end-to-end semantics which are now in TCP. Such a split > is effective when different nodes need to act upon the different > information. Applying this for general protocol extensibility > through nesting is not well understood, and does result in longer > header chains. Furthermore, our experience with IPv6 extension > headers [RFC2460] in middleboxes indicates that the approach does not "...indicates that the header chaining approach does not" Is this bad experience documented somewhere? A reference or some clarification would help. > help with middlebox traversal. > > Many protocol definitions include some number of reserved fields or > bits which can be used for future extension. VXLAN is an example of > a protocol that includes reserved bits which are subsequently being > > [..] > > Extending a protocol header with new fields can be done in several > ways. > o TLVs are a very popular method used in such protocols as IP and > TCP. Depending on the type field size and structure, TLVs can > offer a virtually unlimited range of extensions. A disadvantage > of TLVs is that processing them can be verbose, quite complicated, > several validations must often be done for each TLV, and there is I think if you make such strong comments you need to quantify them. A TLV is a formal structure with well defined characteristics. You could write efficient code to parse, identify and validate TLVs. How is it verbose to process etc? > no deterministic ordering for a list of TLVs. TCP serves as an The reason deterministic ordering would matter is if there's dependencies between the TLVs. If that is a huge need, then the document needs to provide a sample space or explanation why that is important. > example of a protocol where TLVs have been successfully used (i.e. > required for protocol operation). IP is an example of a protocol > that allows TLVs but are rarely used in practice (router fast > paths usually that assume no IP options). Note that TCP TLVs are > implemented in software as well as (NIC) hardware handling various > forms of TCP offload. > o Extension headers are closely related to TLVs. These also carry > type/value information, but instead of being a list of TLVs within > a single protocol header, each one is in its own protocol header. The main difference seems to be in the fact that in a list of header extensions, the current extension describes the next; whereas in TLVs there is no such relationship; otherwise the T in TLV is an extension header. One imposes ordering, the other doesnt really. > IPv6 extension headers and SFC NSH are examples of this technique. > Similar to TLVs these offer a wide range of extensibility, but > have similarly complex processing. Another difference with TLVs > is that each extension header is idempotent. This is beneficial > in cases where a protocol implements a push/pop model for header > elements like service chaining, but makes it more difficult group > correlated information within one protocol header. > [..] > o Flag-fields are a non-TLV like method of extending a protocol > header. The basic idea is that the header contains a set of > flags, where each set flags corresponds to optional field that is > present in the header. GRE is an example of a protocol that > employs this mechanism. The fields are present in the header in > the order of the flags, and the length of each field is fixed. > Flag-fields are simpler to process compared to TLVs, having fewer > validations and the order of the optional fields is deterministic. > A disadvantage is that range of possible extensions with flag- > fields is smaller than TLVs. Qualify with "much smaller" maybe? > > The requirements for receiving unknown or unimplemented extensible > elements in an encapsulation protocol (flags, TLVs, optional fields) > need to be specified. There are two parties to consider, middle > boxes and terminal endpoints of encapsulation (at the decapsulator). > [..] > For handling unknown options at terminal nodes, there are two > possibilities: drop packet or accept while ignoring the unknown > options. Many Internet protocols specify that reserved flags must be > set to zero on transmission and ignored on reception. L2TP is > example data protocol that has such flags. GRE is a notable > exception to this rule, reserved flag bits 1-5 cannot be ignored > [RFC2890]. For TCP and IPv4, implementations must ignore optional > TLVs with unknown type; however in IPv6 if a packet contains an > unknown extension header (unrecognized next header type) the packet > must be dropped with an ICMP error message returned. The IPv6 > options themselves (encoded inside the destinations options or hop- > by-hop options extension header) have more flexibility. There bits sub/There/The > in the option code are used to instruct the receiver whether to > ignore, silently drop, or drop and send error if the option is > unknown. Some protocols define a "mandatory bit" that can is set > with TLVs to indicate that an option must not be ignored. > Conceptually, optional data elements can only be ignored if they are > idempotent and do not alter how the rest of the packet is parsed or > processed. > > Depending on what type of protocol evolution one can predict, it > might make sense to have an way for a sender to express that the "... have a way..." > > > [...] > > 16. Layering Considerations > [...] > The layering also has some implications for middleboxes. > o A device on the path between the ingress and egress is allowed to > transparently inspect all layers of the protocol stack and drop or > forward, but not transparently modify anything but the layer in > which they operate. What this means is that an IP router is > allowed modify the outer IP ttl and ECN bits, but not the > encapsulation header or inner headers and payload. And a BIER > router is allowed to modify the BIER header. > o Alternatively such a device can become visible at a higher layer. > E.g., a middlebox could become an decapsulate + function + > encapsulate which means it will generate a new encapsulation > header. "a middlebox could first decapsulate, perform some function then encapsulate; which means it will generate a new encapsulation header." > > The design team asked itself some additional questions: > o Would it make sense to have a common encapsulation base header > (for OAM, security?, etc) and then followed by the specific > information for NVO3, SFC, BIER? Given that there are separate > proposals and the set of information needing to be carried > differs, and the extensibility needs might be different, it would > be difficult and not that useful to have a common base header. > o With a base header in place, one could view the different > functions (NVO3, SFC, and BIER) as different extensions to that > base header resulting in encodings which are more space optimal by > not repeating the same base header. The base header would only be > repeated when there is an additional IP (and hence UDP) header. > That could mean a single length field (to skip to get to the > payload after all the encapsulation headers). That might be > technically feasible, but it would create a lot of dependencies > between different WGs making it harder to make progress. Compare > with the potential savings in packet size. > Agreed. > > 17. Service model > > The IP service is lossy and subject to reordering. In order to avoid > a performance impact on transports like TCP the handling of packets > is designed to avoid reordering packets that are in the same > transport flow (which is typically identified by the 5-tuple). But > across such flows the receiver can see different ordering for a given > sender. That is the case for a unicast vs. a multicast flow from the > same sender. > > There is a general tussle between the desire for high capacity > utilization across a multipath network and the import on packet where you say "import" did you mean: "importance" or "impact"? > ordering within the same flow (which results in lower transport > protocol performance). That isn't affected by the introduction of an > encapsulation. However, the encapsulation comes with some entropy, > and there might be cases where folks want to change that in response > to overload or failures. For instance, might want to change UDP "For instance, one might want ..." > source port to try different ECMP route. Such changes can result in > packet reordering within a flow, hence would need to be done > infrequently and with care e.g., by identifying packet trains. > Is there a reference to work which says quiet periods (which i am implicitly reading that in the text above) can be used to change the hash selection? I would think that one needs to closely observe packet trends to make such a decision. So please provide some ref to some scholarly or engineering work. > There might be some applications/services which are not able to > handle reordering across flows. The IETF has defined pseudo-wires > [RFC3985] which provides the ability to ensure ordering (implemented > using sequence numbers and/or timestamps). > What are you recommending? To use techniques defined in RFC3985? > Architectural such services would make sense, but as a separate layer > on top of an encapsulation protocol. They could be deployed between > ingress and egress of a tunnel which uses some encaps. Potentially > the tunnel control points at the ingress and egress could become a > platform for fixing suboptimal behavior elsewhere in the network. > That would clearly be undesirable in the general case. However, > handling encapsulation of non-IP traffic hence non-congestion- > controlled traffic is likely to be required, which implies some > fairness and/or QoS policing on the ingress and egress devices. > > But the tunnels could potentially do more like increase reliability > (retransmissions, FEC) or load spreading using e.g. MP-TCP between > ingress and egress. > > > 18. Hardware Friendly > > Hosts, switches and routers often leverage capabilities in the > hardware to accelerate packet encapsulation, decapsulation and > forwarding. > > Some design considerations in encapsulation that leverage these > hardware capabilities may result in more efficiently packet > processing and higher overall protocol throughput. > > While "hardware friendliness" can be viewed as unnecessary > considerations for a design, part of the motivation for considering > this is ease of deployment; being able to leverage existing NIC and > switch chips for at least a useful subset of the functionality that > the new encapsulation provides. The other part is the ease of > implementing new NICs and switch/router chips that support the > encapsulation at ever increasing line rates. > > [disclaimer] There are many different types of hardware in any given > network, each maybe better at some tasks while worse at others. We > would still recommend protocol designers to examine the specific > hardware that are likely to be used in their networks and make > decisions on a case by case basis. > > Some considerations are: > o Keep the encap header small. Switches and routers usually only > read the first small number of bytes into the fast memory for > quick processing and easy manipulation. The bulk of the packets > > > > Nordmark (ed), et al. Expires November 22, 2015 [Page 27] > > Internet-Draft Encapsulation Considerations May 2015 > > > are usually stored in slow memory. A big encap header may not fit > and additional read from the slow memory will hurt the overall > performance and throughput. > o Put important information at the beginning of the encapsulation > header. The reasoning is similar as explained in the previous > point. If important information are located at the beginning of > the encapsulation header, the packet may be processed with smaller > number of bytes to be read into the fast memory and improve > performance. > o Avoid full packet checksums in the encapsulation if possible. > Encapsulations should instead consider adding their own checksum > which covers the encapsulation header and any IPv6 pseudo-header. > The motivation is that most of the switch/router hardware make > switching/forwarding decisions by reading and examining only the > first certain number of bytes in the packet. Most of the body of > the packet do not need to be processed normally. If we are > concerned of preventing packet to be misdelivered due to memory > errors, consider only perform header checksums. Note that NIC > chips can typically already do full packet checksums for TCP/UDP, > while adding a header checksum might require adding some hardware > support. > o Place important information at fixed offset in the encapsulation > header. Packet processing hardware may be capable of parallel > processing. If important information can be found at fixed > offset, different part of the encapsulation header may be > processed by different hardware units in parallel (for example > multiple table lookups may be launched in parallel). It is easier > for hardware to handle optional information when the information, > if present, can be found in ideally one place, but in general, in > as few places as possible. That facilitates parallel processing. > TLV encoding with unconstrained order typically does not have that > property. > o Limit the number of header combinations. In many cases the > hardware can explore different combinations of headers in > parallel, however there is some added cost for this. > I think this section is well done. In regards to TLVs, I understand now a little more where the earlier comments come from (IMO: you will need to point to a reference to this section from the earlier reference). Having said that, lets weigh out the pros and cons: pro: TLVs very flexible - almost give you future proofness in terms of extensibility. cons: Harder to parallelize in hardware. I think the pro side should be driving things. I would say to the hardware folks - get busy now! I am still unsure why this is hard to do in h/ware given all the benefits. At the expense of getting tomatoes thrown at me: sounds like there's an extra parsing step of hardware processing to find each individual TLVs "fixed offset" and after that you can parallelize. > 18.1. Considerations for NIC offload > > This section provides guidelines to provide support of common > offloads for encapsulation in Network Interface Cards (NICs). > Offload mechanisms are techniques that are implemented separately > from the normal protocol implementation of a host networking stack > and are intended to optimize or speed up protocol processing. > Hardware offload is performed within a NIC device on behalf of a > host. > > There are three basic offload techniques of interest: > o Receive multi queue > o Checksum offload > o Segmentation offload > > 18.1.1. Receive multi-queue > > Contemporary NICs support multiple receive descriptor queues (multi- > queue). Multi-queue enables load balancing of network processing for > a NIC across multiple CPUs. On packet reception, a NIC must select > the appropriate queue for host processing. Receive Side Scaling > (RSS) is a common method which uses the flow hash for a packet to > index an indirection table where each entry stores a queue number. > > UDP encapsulation, where the source port is used for entropy, should > be compatible with multi-queue NICs that support five-tuple hash > calculation for UDP/IP packets as input to RSS. The source port > ensures classification of the encapsulated flow even in the case that > the outer source and destination addresses are the same for all flows > (e.g. all flows are going over a single tunnel). > And the recommendation is to do what? > 18.1.2. Checksum offload > > Many NICs provide capabilities to calculate standard ones complement > payload checksum for packets in transmit or receive. When using > encapsulation over UDP there are at least two checksums that may be > of interest: the encapsulated packet's transport checksum, and the > UDP checksum in the outer header. > > 18.1.2.1. Transmit checksum offload [...] > 18.1.3. Segmentation offload > > Segmentation offload refers to techniques that attempt to reduce CPU > utilization on hosts by having the transport layers of the stack > operate on large packets. In transmit segmentation offload, a > transport layer creates large packets greater than MTU size (Maximum > Transmission Unit). It is only at much lower point in the stack, or > possibly the NIC, that these large packets are broken up into MTU > sized packet for transmission on the wire. Similarly, in receive > segmentation offload, small packets are coalesced into large, greater > than MTU size packets at a point low in the stack receive path or > possibly in a device. The effect of segmentation offload is that the > number of packets that need to be processed in various layers of the > stack is reduced, and hence CPU utilization is reduced. > What is the recommendation for the protocol design? > 18.1.3.1. Transmit Segmentation Offload > > Transmit Segmentation Offload (TSO) is a NIC feature where a host > provides a large (larger than MTU size) TCP packet to the NIC, which > in turn splits the packet into separate segments and transmits each > one. This is useful to reduce CPU load on the host. > > The process of TSO can be generalized as: > o Split the TCP payload into segments which allow packets with size > less than or equal to MTU. > o For each created segment: > 1. Replicate the TCP header and all preceding headers of the > original packet. > 2. Set payload length fields in any headers to reflect the length > of the segment. > 3. Set TCP sequence number to correctly reflect the offset of the > TCP data in the stream. > 4. Recompute and set any checksums that either cover the payload > of the packet or cover header which was changed by setting a > payload length. > > Following this general process, TSO can be extended to support TCP > encapsulation UDP. For each segment the Ethernet, outer IP, UDP > header, encapsulation header, inner IP header if tunneling, and TCP > headers are replicated. Any packet length header fields need to be > set properly (including the length in the outer UDP header), and > checksums need to be set correctly (including the outer UDP checksum > if being used). > > To facilitate TSO with encapsulation it is recommended that optional > fields should not contain values that must be updated on a per > segment basis-- for example an encapsulation header should not > include checksums, lengths, or sequence numbers that refer to the > payload. If the encapsulation header does not contain such fields > then the TSO engine only needs to copy the bits in the encapsulation > header when creating each segment and does not need to parse the > encapsulation header. Thanks - that was crystal clear. > > 18.1.3.2. Large Receive Offload > > Large Receive Offload (LRO) is a NIC feature where packets of a TCP > connection are reassembled, or coalesced, in the NIC and delivered to > the host as one large packet. This feature can reduce CPU > utilization in the host. > > LRO requires significant protocol awareness to be implemented > correctly and is difficult to generalize. Packets in the same flow > need to be unambiguously identified. In the presence of tunnels or > network virtualization, this may require more than a five-tuple match > (for instance packets for flows in two different virtual networks may > have identical five-tuples). Additionally, a NIC needs to perform > validation over packets that are being coalesced, and needs to > fabricate a single meaningful header from all the coalesced packets. > > The conservative approach to supporting LRO for encapsulation would > be to assign packets to the same flow only if they have identical > five-tuple and were encapsulated the same way. That is the outer IP > addresses, the outer UDP ports, encapsulated protocol, encapsulation > headers, and inner five tuple are all identical. Another excellent section. > > 18.1.3.3. In summary: > > In summary, for NIC offload: > o The considerations for using full UDP checksums are different for > NIC offload than for implementations in forwarding devices like > routers and switches. > o Be judicious about encapsulations that change fields on a per- > packet basis, since such behavior might make it hard to use TSO. > >