I am by no means an expert in this area so please take this into account while reading my comments... Content comments: * The document assumes "human" communication, i.e., where text originates at a speed of a human and politeness is used to resolve concurrency conflicts. This seems to be a fair assumption for the considered use cases but what happens if this assumption is not met? Can systems or RTP mixers detect and handle such situations gracefully or is the idea that any resulting "jerkiness" must be accepted if senders misbehave? * The solution does not provide end-to-end security since the mixer must be trusted to have access to the texts in order do the mixing. This is mentioned in the security considerations and in section 2 where alternatives are considered. The reason to not select a solution providing end-to-end security is give in section 1.2. Is there work planned to address this issue, i.e., to complement this solution with a solution providing end-to-end security? * Perhaps the recommendation in section 4.2.6 that the mixing method for multi-party unaware endpoints is not RECOMMENDED to be used should be repeated in the security considerations? It seems there are serious limitations, in particular also related to the creation of a presentation that can make it impossible to detect masquerade attacks. Yes, masquerading is mentioned but from an outside security point of view it feels like there was a strong security solution that was discarded due to lack of implementation support, there is a somewhat OK solution (but not able to provide end-to-end security), and there is a pretty ugly solution to accommodate endpoints with no support for the other solution. If this is a fair summary, perhaps explaining this clearly in the security considerations would be a good thing. * I am confused about Figures 5 and 6 since the mixed identities of the sources are once shown in square brackets and once in parenthesis. Are labels like [Alice] or [Bob] not inserted by the mixer? If so, why would the format on the endpoint be different? Is the idea that endpoints try to parse the mixed text in order to render it differently? Or was the idea to show that different mixers can use different styles to generate labels, i.e., I should not really compare Figure 5 and 6? Editorial comments: * I suggest to cite [T140] when you first refer to it in the Introduction: OLD A requirement related to multi-party sessions from the presentation level standard T.140 for real-time text is: "The display of text from NEW A requirement related to multi-party sessions from the presentation level standard T.140 [T140] for real-time text is: "The display of text from * as defined -> are defined and missing full stop OLD The terms SDES, CNAME, NAME, SSRC, CSRC, CSRC list, CC, RTCP, RTP- mixer, RTP-translator as defined in [RFC3550] NEW The terms SDES, CNAME, NAME, SSRC, CSRC, CSRC list, CC, RTCP, RTP- mixer, RTP-translator are defined in [RFC3550]. * Add reference(s) to WebRTC in the terminology section?