Part 3 – Incident Handling

Introduction:

As discussed in Part 1 – Incident Detection and Part 2 – Incident Classification ,  identifying and accurately classifying an incident based on category and severity are the most important and foremost steps in an Incident Response process. Now comes the most important part of Incident Handling. Readers may have known or used Incident handling process and procedures for a long period of time, but if we were to compare each of them side by side, they would all be similar in purpose but different in execution. Consider Incident handling to be like an organization’s signature – “Unique and cannot be replicated easily”

With this post, we are trying to provide our “unique signature” regarding Incident handling.

Pre-requisites:

Before actually getting the work started, it is important to define the foundational blocks. Without these pre-requisites, a structured Incident response will be difficult. These pre-requisites are:

  1. Responder Groups – People who “do the analysis and investigation” on the ground are called Responder groups. These need to be defined as part of the CSIRT governance function. In smaller organizations, this can be one or two persons. They are typically the analysts, reverse engineers, forensic experts etc. They are the first line of defence.
  2. Resolver Groups – People who  “do the re-mediation” are the resolver groups. Typically, this group gets into action mostly post-incident. However, these groups also assist during the incident investigation from an infrastructure angle. They are typically comprised of Network teams, Server teams, Application teams etc.
  3. Management Groups – People who are the “top brass in the organization”  are the management groups. These groups are very important and need to be activated if the incident impact is going to be enterprise wide. They are typically comprised of ISM (info-sec manager), CISO, CIO, CTO etc.
  4. External Communication Groups – People outside of the IT department like Legal, HR, Crisis team,  regulators etc. are called external communication groups. These groups take care of interacting with the external agencies like law and order, media, shareholders etc.
  5. Communication Protocols – Defining “How” to communicate among the various groups is very important because these can’t be established during a live Cyber Incident. Some of the protocols can be email, phone call, template forms, ticketing systems, encrypted communication lines etc.
  6. Service Level Agreements (if any) – Organizations as they mature in their process of operating a CSIRT want to track performance efficiency of their people and process. This can be done using SLA metrics by defining the “time to respond” and the “time to resolve” or in ITIL terms “Response SLA” and “Resolution SLA”.

Incident Analysis:

Every qualified and classified Incident (Part 1 and Part 2 of CSIRT Function) has to be analysed as per its merit. While the skeletal for the analysis is the same, the content and the context differs from organization to organization. In general, every analysis starts with the following 2 questions:

  1. What we know? – The answer to the question typically lies in gathering the details regarding the incident. The details can be as follows:
    • Victim user/machine details like user name, machine name, IP Address etc.
    • Logs that triggered the incident. The logs are typically from SIEM or the point products themselves.
    • Attacker information from the logs like Attacker IP Address, Domain etc.
    • Attack pattern if it is a signature alert from IDS/IPS/WAF etc.

2. What we don’t know? – This is everything else about the incident that we are yet to investigate or determine. This is the perfect jump off point for investigation. Some of the most common items in this list are as follows:

  • Forensic Analysis of the machine like disk analysis and memory analysis.
  • Attack Vector synthesis
  • Static and Dynamic analysis of malcode, Reversed binaries etc.
  • Impact and spread of the attack in terms of data stolen, machines compromised, monetary impact etc.

Once the “known and the unknown” are identified, listing down the course of action becomes easy. This will primarily assist in a timely and coordinated response. In this post, we will not be discussing the individual tools used in analysis, however, we will be talking about the overall process involved.

Incident Communication:

Once the Incident analysis is under way, there is bound to be a constant flow of information coming from the responder groups. Communicating this to the appropriate stakeholders is key to effective Incident handling. Different organizations have different communication protocols and as mentioned above in the Pre-requisite sections, defining this can be along these lines:

  • Establish a communication protocol – Who to call? What is the number to call? What times to call?
  • Primary and Secondary contact persons
  • Communication template – Email, Report, SMS, Ticket updates, calls etc.
  • Timelines – For example, First update – within 30 minutes, Second update – Within 1 hour of First update etc..

In our opinion, incident communication is one of the most under-rated aspects of incident handling and getting this right is important.

Post-Analysis

Once the analysis is complete, a decision needs to be made based on the collected facts. The decision can be to continue with the Incident containment function or move directly to the Incident Recovery function.

Part 4 – Incident Containment

Introduction

Every incident requires careful investigation and response. One of the oft used strategies by CSIRT teams is Incident Containment. By definition Incident containment is a function that assists to limit and prevent further damage from happening along with ensuring that there is no destruction of forensic evidence that may be needed for legal actions against the attackers later.

Firstly, Containment is a strategy:

Usually, organizations think that containment is a process step that we need to follow during Incident Response. But in our opinion, Incident containment should be a Strategy. Once a containment strategy is defined, the respective tools & technologies can be selected to participate in the fulfilment of the strategy. Process pieces will eventually follow. Containment strategies can be defined based on the focus area in the IT Infrastructure. It can be at the perimeter, extended perimeter, internal tier or at the end point or it can also be a combination of any of the above. Mostly, the strategy is dependent on understanding your IT infrastructure and making the best use of the infrastructure. That is why it is not the same for every organization and rightly so. We would like to list down a few examples of such containment strategies below:

Examples of Perimeter & Extended Perimeter Strategy – Stop the outbound communication from infected machine, block inbound traffic, IDS/IPS Filters, Web Application Firewall policies, null route DNS, fail-over to backup link, switch to secondary data centre etc…

Examples of Internal Networks Strategy – Switch based VLAN isolation, router based segment isolation, port blocking, IP or MAC Address blocking, ACLs etc..

Examples of Endpoint Strategy – Disconnecting the laptop/desktop, powering off the servers, blocking rules in Desktop firewall, HIPS etc…

Based on these examples, you can get an idea of what each of the strategies look like. It is also important to categorize them as being effective for the various “Incident Categories” defined in Part 2 – Incident Classification, thereby making it easier to define process and procedures specific to the categories defined. Also, it is imperative to define which strategy is “Short Term” and which is “Long Term”

What is Short Term Containment? – Typically short term containment is break fix or quick heal. The objective of the short term containment is to prevent the asset or the user from causing further damage in the organization. It is akin to a Quarantine mechanism in AV software, where it is not removed, however its potential to create further damage has been quelled. Everyone reading this post would definitely have implemented short term containments in their CSIRT life. Remember “pull the plug”, “block the mac”, “disable the user” etc. However, it is important to note that this does not fix the real reason an incident happens. It also does not stop an incident from recurring on a different asset in the organization. This is where Long term containment comes into play.

What is Long Term Containment? – Long term containment is a enterprise wide fix that is a step short of complete re-mediation of an incident root cause or attack vector. The objective of Long term containment is to stop other users or assets in the organization from getting impacted by the same incident. Input to long term containment comes from the Incident Handling phase where the appropriate investigations have been done and the possible attack vectors or infection methods have been identified. Till a full fledged enterprise wide re-mediation efforts are carried out, steps like putting a WAF behavioural policy, a custom SNORT signature to block the attack pattern, a HIPS policy for system lock down, etc. can be considered as long term containment strategies.

Validating the Strategy: Once a strategy is identified and categorized, it has to be tested for effectiveness in the field. Now, such validations cannot happen during a live incident. Hence it is important to validate the efficacy of the strategy, the timeliness of execution, the responsible parties, potential pitfalls etc. This validation also will pave way for planning the process steps required for the containment plans to work. This can be done using simulations and test runs of incidents, which will help fine tune the strategy and co-ordination of the teams.

Monitoring Effectiveness: Now that you have a validated Incident Containment strategy, the next step is to ensure that your strategy was effective against the Attack Vector. This is where monitoring of the Attack Vector, Targeted Victims, Outbound Traffic from the victims etc. become important measures of effectiveness. This can be a simple monitoring rule in SIEM products with a forward looking time frame, or it could be a completely monitored network segmentation.

In our opinion, a validated containment strategy, a detailed containment plan and an effective monitoring routine together make Incident Containment whole and meaningful. The next steps after containment are Incident Recovery.

Go back or Continue reading Part 5 – Incident Recovery

Part 5 – Incident Recovery

Introduction

Incidents can’t be avoided entirely, however the damage can be greatly minimized by a mature Incident Detection and Response function. In the CSIRT Series, we have been looking in detail at the various functions that make up a good IR process framework. Incident Containment and Incident Recovery are complimentary processes. While  containment is aimed at stopping the spread of a breach, Recovery is all about getting back on feet by reversing to a “Known good state”. The “known good state” in our opinion is very ambiguous in its meaning. It may apply to a single machine, or an entire network. However, in our opinion, Recovery process or getting back to a “Known good state” is a combination of three sub-steps:

  1. Pre-Recovery – Forensics Evidence Collection in our opinion is a Pre-Recovery step.  This is a critical process and is important for collecting and maintaining evidence that may be required to pursue future legal actions.
  2. Recovery from Backup – Ensure that systems or networks are returned to the pre-breach state.
  3. Post-Recovery –  As a post-recovery step, Remediation of the threat vector is crucial. A process to ensure that the infection or threat vector is a non-issue.

Let us look at each of these sub-steps in detail

Pre-Recovery – In cases which need legal course of action, it is important that we clearly document how all evidence has been collected, preserved and handled so that it is admissible in court. This is called Forensic Evidence Collection. It is key to note that legal requirements vary from region to region, jurisdiction to jurisdiction and a forensic person should be aware of that. It is recommended to have some of the team members obtain computer forensics training and certification to be able to handle the entire process end to end. However, it is not un-common to get professional third party help for conducting Forensic evidence collection and investigations during an Incident. Forensics is a standalone field in itself and to detail all the process steps her would be impractical. Hence, we have tried to give a succinct summary of what forensics entails:

    1. Determine legal issues regarding the incident that may cause an impact
    2. Determine technology and processes within the scope of the forensic analysis
    3. Identify evidence from the infected machine or person. The evidence can be electronic or physical.
      • Document and Collect the identified Evidence following the chain of custody
      • Perform Forensics Investigation and analysis.

Once the incident forensic process has been initiated, it is possible that the incident may need to be reclassified based on the results. Based on this, the entire recovery and remediation process attains a different color. For Example: A malicious code incident was originally triaged and classified as a medium security incident. The forensic analysis reveals that the malicious code has installed hidden back-door processes that can now be traced to additional systems that were not originally identified as being affected. The incident should be reclassified from a medium security incident to a high security incident.

Recovery from Backup – If the incident fits the criteria of high severity and or high impact, the CSIRT team should determine if IT business continuity, disaster recovery, and or backup restoration procedures should be initiated. The reason this is limited to high severity and high impact incidents is nothing but practical consideration. The goal of the Recovery phase is to safely put the impacted systems back into production. To complete the recovery process the following three steps have to be followed:

  • Validation  of the recovered systems  – Involves asking the user base if the system is operating properly or comparing that the ports and services of the system are consistent using profiling tools
  • Restoring Operations – Involves placing the system into full production, allowing it to interact fully with other devices on the network
  • Monitoring – Involves checking systems for back-doors or any other issues which may have escaped previous detection. If possible, host-based and network based monitoring should be used to compare that the attacker did not leave any back-doors on the system

Ultimately, when services are restored, the system should have an effective defence against future attacks of the same nature. Any access methods which may have been used to conduct such an attack should be corrected. When restoring services, systems or data from archived backups, consideration should be taken based on the type of attack, the data affected and most importantly the timeline in which the attack initially took place. This information should have been discovered and documented as part of the forensic analysis. This step in the recovery process is critical so that vulnerabilities, malware or corrupted data is not re-introduced into the operating environment. Depending on the severity of the incident, it may be required to do a full system rebuild in order to re-establish the integrity of the system.

Post-Recovery – Once the recovery is completed, Incident remediation steps should be followed. Most of the times, the Threat Vector will be a System vulnerability or Network vulnerability. For such vectors, available patches or system updates should be applied. System hardening techniques may also need to be applied and core deployment images may need to be updated to prevent the introduction of the weakness elsewhere in the organization. In the case of Non-Vulnerability related vector, the root cause should be identified and appropriate fixes have to be implemented.

Conclusion

It is important to have a well defined and smooth functioning recovery capability in the CSIRT team.  Without recovery capabilities, the probability of a security incident or issue recurring persists.
Go back or Continue to Part 6 – Continuous Improvement