[re-sent with attempts to preserve lists, indents, etc] This document has been reviewed as part of the transport area review team's ongoing effort to review key IETF documents. These comments were written primarily for the transport area directors, but are copied to the document's authors and WG to allow them to address any issues raised and also to the IETF discussion list for information. When done at the time of IETF Last Call, the authors should consider this review as part of the last-call comments they receive. Please always CC tsv-art@ietf.org if you reply to or forward this review. ============================================================================== This draft is a useful contribution. I was good to see that the ability to monitor delay percentiles had been included. This review was written against draft-10, but then checked against draft-11 via the diff. I'm afraid I have 14 comments, some of which are fairly major (e.g. #3, #4 & #12). It should be borne in mind that this implies that everything else in the draft is just fine :) However, all reviews tend to look fairly negative, because they necessarily focus on points of concern and disagreement. ==1. Granularity of Units== Measurement interval ("measurement-interval"): Specifies the performance measurement interval, in seconds. Making the minimum granularity 1 s seems too coarse. 1 s is quite a common interval for certain metrics (e.g. utilization), so it seems wrong to also make it the minimum. However, I wouldn't fight hard if you disagree. typedef percentile { type decimal64 { fraction-digits 2; 2 decimals doesn't seem enough for percentiles. I suggest 3 at least, so that five-9s percentiles can be specified (for instance, at a packet rate of 100,000 pkt/s, a 99.999%-ile delay statistic would imply 1 pkt/s is above the percentile). Is there a reason why the default percentiles are 10% and 90%? I think these defaults would be rarely used. If this is just an arbitrary choice, 1% and 99% might be better choices. ==2. Recursion== I don't think this draft precludes a VPN over a VPN over a physical network (e.g. a CVLAN over an SVLAN). However, it doesn't mention the possibility either. An example with multiple layers of VPNs would be useful. ==3. Is the definition of TP adequate to determine different types of loss?== I could not work out whether this YANG model would enable an operator to identify whether losses are due to: * tail loss (buffer full), * selective early discard (AQM), * or discards due to transmission errors. I read RFC8345 which was cited as the reference for the definitions of link and TP. However, the definition of TP was 'a physical or logical port or, more generally, an interface', which is not specific enough to determine exactly where the TP of a physical or logical link is. Is a TP before or after the output buffer? Before or after the input buffer? Before or after packet error checking? Also, is the TP of a VPN before or after encap. Is it before or after decap? Whether per-class-id PM statistics include a packet or not will be highly dependent on the answer to this question. Because encap and decap can alter the class of a packet if the operator is using the pipe model where the DSCP is not copied to and from the outer on encap and decap [RFC2983]. ==4. Per traffic descriptor PM statistics?== "§4.4 Link & TP PM Augmentation" includes the following | +--ro one-way-pm-statistics-per-class* [class-id] Is there a reference for how class-id can be used? The description says: "The class-id is used to identify the class of service. This identifier is internal to the administration."; but that seems unworkable. The administration of a VPN is often separate from the administration of the network it runs over. So how does one administration know what the other means by each class-id? The whole point of OAM standardization is to ensure that network administration doesn't need to be complemented by ad hoc manual techniques, which are prone to human error. Ideally, rather than class-id being just an opaque string, it would be associated with a generic traffic descriptor (filter) that could also use wildcards, for example: dst_addr==unicast next_header!=udp # in IPv6, equivalent to protocol ID in IPv4 DSCP==0b000??? ECN==0b?1 Is there anything like this in YANG already? If not, it surely needs to be created for this PM draft. I presume it would have to encompass Ethernet, MPLS etc, not just IP. ==5. Why only unicast and non-unicast?== +--ro inbound-unicast? yang:counter64 +--ro inbound-non-unicast? yang:counter64 It seems rather arbitrary to pick this particular traffic filter in the TP part of the YANG tree in Fig 7. First, it begs the question why no distinction can be made between anycast and multicast. But, more generally, it begs the question why more general traffic filters are not possible, as described in my previous point (which would allow unicast and non-unicast to be specified as just one of many other possible filters). ==6. ECN marking== An operator might be just as interested in ECN marking as loss (Congestion Experienced in the IP-ECN field [RFC3168], or the equivalent in MPLS [RFC5129]). But it does not appear in the model. Also, an operator might be interested in loss statistics for ECN-capable packets separately from non-ECN-capable, for instance, to detect overload when loss of ECN packets is introduced [RFC7567; §4.2.1]. Again, stats per generic traffic descriptor, as described above, would support this without needing a specific set of statistics. ==7. Overload events== An operator is likely to want alarm notifications when loss or ECN marking exceeds a configurable threshold. ==8. Utilization per class-id?== The inbound-octets or outbound-octets counters in Fig 7 allow utilization to be determined for a TP. But why is there not (AFAICT) an equivalent ability to monitor the utilization of a link per class-id? ==9. Example test-case== FWIW, while reviewing, I had in mind some performance monitoring requirements in one of my own recent drafts. I am not suggesting you cite them, but you might want to use them as a test case; to see whether this YANG model could define all the required statistics: https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-aqm-dualq-coupled-25#section-2.5.2 ==10. Fragmentation statistics?== Given the encap process for a VPN can lead to packet-too-big errors or fragmentation, wouldn't it be useful to monitor a PTB error count and/or a count of fragments created during encap? Or perhaps this is more fault diagnostics than performance monitoring (nonetheless, it impacts performance). ==11. Extensibility, backward and forward compatibility== There is no discussion of extensibility in the draft. What are the implications (if any) of adding more metrics or new topology features in future? Considering cases like: * old device and new mgmt system; * new device and old mgmt system; * some devices in a network support updated elements of the model and others don't. This might just require citing an RFC about what is meant to happen when two different versions of a YANG model interact, e.g. default ignore? ==12. Security Considerations specific to this YANG model== §6 gives useful generic security advice for protocols used to access the content of this model (NETCONF, RESTCONF). And it states the (fairly obvious) implications of not protecting the main subsets of content with write or read access. However, it doesn't state how protecting each main subset impacts on access to the other subsets, For instance: * Is read access to VPN PM statistics possible without read access to VPN topology? * Do some write operations on VPN topology require read access to the underlying network topology? * Is read access to VPN topology possible without read access to underlying network topology? * Is read access to underlying network PM statistics advisable in order to diagnose VPN performance issues? * etc. On this last point, it is common for an overlay to conceal occasional errors in lower layers (e.g. HARQ conceals lower layer losses from higher layers; link protection conceals lower layer link failures from higher layers). However, once lower layer errors exceed a certain point, it becomes impossible to conceal them, resulting in potential catastrophic failure. If a higher layer is not monitoring underlying errors that are being concealed, it will not know to initiate appropriate remedial actions (which might include dual-homing so it can switch to an alternative network operator). I learned insights like this by reading the outputs of the Resilinets project [ https://resilinets.org/ ], which is well worth a look if you haven't already. ==13. Normative references== There seem to be an excessively large number of normative references. I suggest they are checked. I didn't do this systematically myself, except I happened to notice the following seem to be informatively cited, not normatively: [RFC9182] [ITU-T-Y-1731] These are cited as example statistics that _can_ be collected, which seems informative. [RFC6241] [RFC8040] NETCONF and RESTCONF are given as example ways of accessing this YANG model, which seems informative - if another protocol were used, it would not alter this spec. [RFC6242] [RFC8446] Altho' SSH is mandatory to implement if NETCONF is used and TLS is mandatory to implement if RESTCONF is used, neither NETCONF nor RESTCONF is mandatory to implement /for this YANG model/. etc. ==14. Nits== §1. Intro The module can be used to monitor and manage network performance on the topology level or the service topology between VPN sites, in particular. Delete 'in particular' (don't know what purpose it's serving here)? §3. Model Usage Before using the model, the controller needs to establish complete topology visibility of the network and VPN. Requiring /complete/ visibility seems brittle and unnecessary. E.g. if there's a partial failure of the management plane, there's no reason to stop monitoring performance of the parts that are visible. s/Protocol(STAMP)/Protocol (STAMP)/ The performance monitoring information reflecting the quality of the network or VPN service (e.g., end-to- end network performance data between source node and destination node in the network or between VPN sites) Pls make it clear that 'end-to-end' is being used in the Bell-head sense (not the Net-head sense as in 'the end-to-end principle' or 'end-to-end protocol' which means 'application end-point to application end-point'). The measurement and report intervals that are associated with these performance data usually depend on the configuration of the specific measurement method or collection method or various combinations. This document defines a network-wide measurement interval to align measurement requirements for networks or VPN services. The second sentence makes it sound like there has to be one interval network-wide. This seems inconsistent with the first sentence, that says different intervals are required depending on the specifics of each metric. §3.2 Collecting Data On Demand s/Data On-demand/Data On Demand/ There is no hyphen in 'Data On Demand', unlike 'on-demand data', which is a compound adjective. [ https://www.grammarbook.com/punctuation/hyphens.asp ] To obtain a snapshot of a large amount of performance data... For example, a specified "link-id" of a VPN can be used as a filter in a RESTCONF GET request to retrieve per-link VPN PM data. Surely one link isn't an example of a _large_ amount of data. Figure 3. The mappings using strings of ':'s in the ASCII art didn't work for me. I genuinely thought that VN1, VN3, N1 and N2 were all intended to be mapped to each other in some weird way. Only when I read the text did I work out that these are two mappings that cross without intersecting. A gap either side of the intersection for one of the lines might work better. §4.1 Layering Relationship Apart from the association between the VPN topology and the underlay topology, VPN Network PM YANG module can also provide the performance status of the underlay network and VPN services. For example, the module can provide... Performance status is the primary purpose of the PM module, so it's odd to describe it as if it's incidental to the topology associations ('can also provide'). You might mean that the performance statements are all optional so theoretically none needs to exist. If that's what you were trying to say, it needs saying explicitly. The second 'can provide' is fine because it's an example. §4.3 Node Level PM Augmentation For VPN service performance monitoring, the module defines one attribute: "role": The role attribute is solely about topology, so it seems wrong to say it's for performance monitoring. § 4.4. Link & TP Level PM Augmentation ... Lists p erformance measurement statistics... ...Indicat es the abstract link... (Inappropriate line breaks) §6 Security Considerations s/It is thus important to control read access/It thus might be important to control read access/ (Rationale: for consistency with the 'may be' in the previous sentence.) Perhaps the point should also be made that there is a trade-off between confidentiality and monitoring performance properly (see earlier point about catastrophic failures due to higher layers concealing lower layer performance problems). Each time the word 'access' is used, pls specify which type. For example: * In the first three bullets s/Unauthorized access/Unauthorized write access/ * In the last three bullets s/Unauthorized access/Unauthorized read access/ Also, it would have been preferable to write repetitive bullets like these as a table that covers all the cases. Otherwise, all the repetition just makes it hard to identify what the differences between each bullet are. Eg. Unauthorized access to the following subtrees could have the following impacts: +--------+----------------------+------------------+ | Access | Node | Potential impact | +--------+----------------------+------------------+ | /nw:networks/nw:network/nw:network-types | | write | service type | render invalid | | write | VPN identifier | render invalid | | write | VPN service topology | render invalid | | write | any of the above | disable VPN PM | +--------+----------------------+------------------+ | /nw:networks/nw:network/nw:node | | etc....