1 Alert Management Working Group Chairperson: Louis Steinberg/IBM CURRENT MEETING REPORT Reported by Lee Oattes AGENDA o Introduction o Discussion of draft flow control document o Preliminary discussion of alert-generation document note: this was shelved due to a lack of time ATTENDEES 1. Bierbaum, Neal/vitam6!bierbaum@vitam6 2. Carter, Glen/gcarter@ddn1.dca.mil 3. Cohn, George/geo@ub.com 4. Cook, John/cook@chipcom.com 5. Denny, Barbara/denny@sri.com 6. Easterday, Tom/tom@nisca.ircc.ohio-state.edu 7. Edwards, David/dle@cisco.com 8. Fedor, Mark/fedor@nisc.nyser.net 9. Hunter, Steven/hunter@ccc.mfecc.llnl.gov 10. Kincl, Norman/kincl@iag.hp.com 11. Malkin, Gary/gmalkin@proteon.com 12. Oattes, Lee/oattes@utcs.utoronto.ca 13. Paw, Edison/esp@esd.ecom.com 14. Replogle, Joel/replogle@ncsa.uiuc.edu 15. Salo, Tim/tjs@msc.umn.edu 16. Sheridan, Jim/jsherida@ibm.com 17. Taft, Vladimir/vtaft@hpinddf.hp.com 2 18. Waldbusser, Steve/sw0l@andrew.cmu.edu 19. Wintringham, Dan/danw@osc.edu 20. Steinberg, Louis/louiss@ibm.com 3 MINUTES 1. The meeting of the Alert Management Working Group began with an introduction from the Chairman (Lou Steinberg). 2. A discussion of several independent implementations of feedback/pin and polled, logged alerts led to an agreement to adopt these mechanisms in some form. 3. The following questions were answered by discussion and consensus: (a) Can we have a read-only alerts_enabled mib object, by limiting the transmission rate of alerts (no shutoff) and not use feedback? No. We need a total shutoff mechanism in case a number of alert generators are "screaming" all at once. The total traffic might be too much for the manager, and this "stable" situation cannot improve (while a disabling mechanism would tend to be self-correcting). Total shutoff implies the use of a resetable, read-write mib object. An automated, timer-based reset mechanism was discussed but it was felt that such a system might tend to sync resets of multiple generators and could still lead to an over-reporting condition. (b) Might an automated-reset of alerts_enabled from the manager station create a "blast-off-blast-off..." alert traffic pattern? Yes, but such a manager would still tend to only get as much traffic as he could handle. A re-enable would only be sent when the manager isn't swamped (i.e., is capable of sending one). A manager experiencing such a traffic pattern should readjust his window prior to setting alerts_enabled TRUE. (c) When pin disables alerts due to the generation of many similar alerts (e.g., link flapping) might we also lose an unrelated alert from the same system prior to resetting alerts_enabled? Yes, but the rate limiting (as opposed to shutoff) technique has the same problem; the probability of sending a single, specific alert is much lower than the probability of sending any one of many identical alerts. This problem is minimized by using polled, logged alerts along with feedback/pin (could still lose alerts if log is overwritten). (d) Should we allow the implementation to decide if alerts are totally disabled or limited to a max rate? No. Implementations should be consistent since this affects the way we manage our alert generators. 4 (e) Can the alert log in polled, logged alerts be overfilled? Yes, but the standard suggests that a manager should attempt to keep the log empty by removing known alerts. If an individual implementation has no mechanism for removing old alerts (no set) then the log must wrap when full and the manager might lose alerts. (f) If using the SNMP get-next, do we want the oldest logged element first, or the newest first? Clearly the manager wants the oldest first if a full log will wrap...this gives him the most chance to see the oldest alert (in a full log) before losing it. No real concensus here. It seems as though this should be implementation specific since it only applies to SNMP, and since the log, actually being a table, makes this a question of "are new table entries added at the table top or bottom?". (g) Can we shrink the log size by stripping out only the "important" information from each alert? We can, but this is something we decided we shouldn't do. It requires a different parser at the manager (can't run it through the alert parser), and we did not know how do decide what information might be needed (it varied with the protocol and alert type). (h) How about only logging alerts, and sending an "alert logged" alert for each new log entry? The manager gets the asynch. "alert logged" notice and reads the alert log to determine what happened. While this is an interesting concept, it was felt that it might tend to aggravate some of the other logging problems (e.g., if the log is filled and not over-writing, the only chance of getting the alert information is from the async alert...this removes the asynch alert information and replaces it with "see the log" information). (i) A discussion of the cpu cycles and memory needed for keeping a log followed. Since the log size might be settable (to 0) it was felt that systems could allow managers to disable logging. It was also felt that the performance and memory hits were not large, but numbers to confirm this were not available. 4. The following were decided by vote: (a) Feedback/Pin Mandatory mib objects: 5 alerts_enabled read/ write window (time) read/ optional write max_alerts read/ optional write Do not include alert counters as mib objects for this document. Individual implementors will decide if they need total dropped and/or sent, but not everybody likes the idea of adding more counters as (even optional) mib objects. Do not optionally allow a reduced rate mode on the over reporting condition...require total async. Alerts to be shutoff for reasons given in earlier discussion. (b) Polled, Logged Alerts Remove time field from the table, as most alerts are time stamped and the information in an alert should be defined by the protocol...not us.