This document has been reviewed as part of the transport area review team's ongoing effort to review key IETF documents. These comments were written primarily for the transport area directors, but are copied to the document's authors and WG to allow them to address any issues raised and also to the IETF discussion list for information. When done at the time of IETF Last Call, the authors should consider this review as part of the last-call comments they receive. Please always CC tsv-art@ietf.org if you reply to or forward this review. This document is a useful update to previous work, and it’s good to see this methodology being refreshed. Thanks to the authors and the WG for their work on this! From a transport perspective, I’m concerned that some things are over-specified (details of TCP implementations) and others are underspecified (how throughput is measured, how loss and delay are tested). There are some nods towards non-TCP transports (UDP port configuration and HTTP/3 tests), but these are inconsistent. I’d like to see transports (TCP/UDP/QUIC/other) be treated more consistently throughout the document, particularly since non-TCP traffic will become increasingly relevant for the devices these tests are targeting. Section 4.1 says: “Testbed configuration MUST ensure that any performance implications that are discovered during the benchmark testing aren't due to the inherent physical network limitations such as the number of physical links and forwarding performance capabilities (throughput and latency) of the network devices in the testbed.” This seems like a hard thing to make normative. It’s not about interoperability, but rather talking about the necessary conditions to get useful measurements. I’d suggest a non-normative phrasing that explains the consequence of not checking this. Overall, I suggest reviewing your use of normative language, and prefer to use non-normative language when you are just describing facts rather than necessary actions on the behalf on an implementation or test setup. Also in section 4.1: can (DUT/SUT) be defined in the sentence earlier, when the terms are first introduced? It would be good to explain/define “fail-open” behavior. I assume I know what you mean, but there can be various types of failing open. In the criteria to match in Figure 3, the Transport layer discusses filtering on TCP/UDP ports, and the IP layer discusses filtering on addresses. There is no mention, however, of how alternate transport IP protocols (like SCTP, not TCP or UDP) would be filtered, or how such a security device may have separate rules for different transport protocols that run over UDP (like QUIC and SCTP-over-UDP) could be treated. This may not be standard practice, but it may be interesting and useful to point to. The client configuration section 4.3.1.1 details TCP stack configuration, but does not address other transports. Discussing QUIC seems like it will be relevant soon. Overall, for this section, I am struck that there’s a lot of detail that seems over-specified, with lots of normative language. For example, the TCP connection MUST end with a three- or four-way handshake. What if there’s a RST? I don’t understand what we’re requiring of these TCP implementations apart from being a functional and compliant TCP implementation. How much of this is actually required? Also, RFC 793 is not provided as a link, but is just a text reference. In Section 4.3.1.3, there are problems with the references to QUIC. QUIC is not an acronym for “Quick UDP Internet Connections”; that should be removed. Also, the text contrasts HTTP/2 and QUIC. If you are comparing to HTTP/3, please reference the draft (and soon-to-be RFC) for HTTP/3, not QUIC. It also doesn’t make sense to say “if you are testing HTTP/3, you MUST use QUIC”. HTTP/3 is definitionally HTTP over QUIC, so this normative requirement is unnecessary—just as it would be unnecessary to say that HTTP/2 MUST run over TCP. The client sections mentions HTTP/3, but that is not mentioned for the server (Section 4.3.2.3). Should the tests include results when including synthetic loss or delay to better emulate realistic network conditions? I understand many of the recommendations to remove uncontrolled delay and loss (in section 5), but various transport properties will change with loss and delay (and congestion, and buffering), which will influence performance of the DUT. Ideally, these DUTs would perform well in those conditions as well. Section 6.3 is again quite TCP-centric, and should be expanded for other transports. For metrics like “HTTP throughput” (section 7.7), how is throughput being rigorously defined? Is this measuring TCP bytes, TLS stream bytes, or actual application payload bytes?