Incidents are events outside of the normal operations that disrupt those operational processes. An incident can be a relatively minor event, such as running out of disk space on a desktop machine, or a major disruption, such as a breach of database security and the loss of private and confidential customer information. Incident management is a set policies and processes for responding to incidents, the goals of which are to:
- Restore normal operations as quickly as possible
- Track information about incidents for further analysis
- Support problem management by analyzing patterns of incidents
Incident management begins with defining what constitutes an incident, categorizing those incidents, and measuring there occurrences.
Characteristics of Incidents
Something as generalized as "any event outside of normal operations" covers quite a large space of possible events. By focusing on just those that are so disruptive that they cause a call to the Help desk or other IT support services, you can limit the discussion to a manageable domain.
Within this domain of incidents, you can categorize incidents by several characteristics:
- Cause of problem
- Asset or assets causing the incident
- Role of personnel experiencing disruption
- Resolution method
The cause of problems covers a wide range of topics.
Incidents should be categorized by severity; at the very least a three-point scale of minor, moderately severe, and severe should be used. For each level of severity, IT organizations should define acceptable resolution times, escalation procedures, and reporting procedures. For example, minor incidents, such as password resets, should not consume too much time or resources from the Help desk. A security breach, however, should immediately escalate, trigger reporting to management and executives, and require rapid resolution.
The asset or assets causing an incident are important dimensions for tracking incident trends. If a particular version of desktop application is causing an inordinate number of support calls, IT managers should be able to detect this during problem management procedures. (There is more information about problem management later in this chapter.)
Just as assets involved in incidents should be tracked, so should the users encountering the disruptions. If a large number of personnel from a single department are generating a large number of Help desk calls, there might be a problem with training or an application specific to that department.
The method for resolving an incident should also be tracked. This data can help determine guidelines for selecting the appropriate response to an incident. For example, data about resolution methods reveal that most OS problems that require more than 2 hours to solve eventually require reinstallation. Given that, a support desk policy is instituted requiring that OS errors that cannot be resolved within 2 hours will be addressed by formatting the OS drive and restoring it from an image backup. These characteristics are especially useful when measuring incident rates and analyzing trends by these characteristics.
Defining the cause of a problem can be more difficult than it seems at first because there are sometimes multiple pre-conditions that must be in place for an incident to occur. Consider a few examples. Password resets are one of the most common incidents reported to Help desks. The causes of this type of incident include users allowing passwords to expire and forgetting passwords -- especially when users are expected to remember passwords to multiple systems while not reusing passwords. All of these causes can factor in a single password reset incident.
In another example, an employee is saving a document to a network drive when the save operation fails. An error message is displayed stating the network drive cannot be found. Because the employee had been saving the document regularly, something must have occurred since the last save operation. After the user has contacted the service desk, the service desk technician tests several possible causes and determines that the problem is a failed network interface device. In this case, determining the exact cause of the failure is not relevant unless the problem occurs repeatedly; hardware has well-known mean times between failures (MTBFs) and further root cause analysis is not likely to help reduce these types of incidents.
The final example is more complex. A security breach results in a large number of customer account and credit card numbers being exposed to attackers. The causes could include:
- Improperly configured firewalls that allow traffic on a port that should have been closed
- An unpatched database listener (a program that accepts requests to connect to the database) that is vulnerable to known attacks
- Access controls within the database that do not adequately limit read access to sensitive data
- Vulnerability in a database management system that allows for escalation of privileges
- Lax OS privileges that allow execute privileges on database administration tools
- Poorly designed applications that use over privileged database accounts
A database breach is a case in which a series of vulnerabilities must be in place for a successful attack to occur. Had one of the vulnerabilities been compensated for with adequate countermeasures, the attack would not have occurred as it did. For example, had the access controls on database tables and views been sufficiently restrictive, the attacker could not query the sensitive data even though he or she had made it through network, OS, and database authentication security measures.
The general categories of incident causes that cut across these examples include:
- Improper documentation
- Insufficient user training
- Configuration errors
- Previously unknown bug
- Known but unpatched vulnerabilities
- Unexpected changes in operating loads
- User error
Determining the cause of incidents is essential to understand both how to resolve the problems and how many resources to commit to reduce the likelihood of those problems in the future.
Of all the topics in service support, the most time could be spent on resolving incidents; in fact, it could be the topic of a very long book. The problem with resolving incidents is that there are so many types and each can require a customized response. In some ways, resolving incidents is like cooking -- there is a different recipe for every dish, and there is a different response to every incident. At the same time, general principals can be found that apply to a broad range of challenges, whether culinary or technical. The general principals for resolving incidents include:
- The time, effort, and resources committed to incident resolution must be commensurate with the impact of the incident.
- Responses should be formalized with well-defined procedures that are more frameworks than strict, precise sets of steps. Formulating such procedures would be too time consuming to be practical.
- All incidents and the response should be documented. In some cases, this can be as trivial as incrementing a count of simple incidents, such as password resets, or as complex as a detailed report describing a security breach.
- As with other service support operations, coordinate incident resolution information with other asset information.
- Consider examples from the extremes of resolving incidents: Password resets are one of the simplest types of incidents to resolve. Many organizations now use self-service methods to address them. One could attempt to drive down the number of password resets, but after a certain point, the economics do not justify the effort to do so because the marginal cost of resetting a password with a self-service system is small. As the next section on trend analysis will show, password vulnerabilities could become a factor in broader security management issues in which the costs of poor password management grow much higher.
- Security incidents are some of the most costly. According to the FBI/Computer Sec Security Institute (CSI) Computer Crime and Security Survey, 639 respondents reported a total loss of almost $43 million due to virus attacks and more than $31 million due to unauthorized access. Individual incidents can be extremely costly. For example, 40 million credit card accounts were compromised at CardSystems Solutions, a credit card processor, causing it to lose major credit card customers.
Implementing System Management Services
Home: Deploying Service Support
Part 1: Elements of Service Support
Part 2: Incident Management
Part 3: Problem Management
Part 4: Configuration Management
Part 5: Change Management
Part 6: Release Management
The above tip is excerpted from Chapter 5, "Implementing System Management Services, Part 1: Deploying Service Support" of The Definitive Guide to Service-Oriented Systems Management by Dan Sullivan. Get a copy of this ebook at Realtime Publishers.
About the author: Chief Technology Officer of Redmont Corporation. Dan's 17 years of IT experience include engagements in enterprise content management, data warehousing, database design, natural language processing and artificial intelligence. Dan has developed significant expertise in all phases of the system development lifecycle and in a broad range of industries, including financial services, manufacturing, government, retail, gas and oil production, power generation, and education. In addition to authoring various books, articles and columns, Dan is the leader of The Realtime Messaging and Web Security Community where he posts to his Messaging and Web Security weblog and produces his expert podcast.
This was first published in February 2007