Category Archives: What you need to know?

Big Data – What you need to know?

Big Data is the buzzword in IT Circle nowadays. The major reason for this is the exploding “Netizen” base. Today Everything is happening Online and Online Data is estimated in zettabytes. The wealth of information one can carve from Online data is undeniably attractive for several organizations for marketing and sales. Organizations like Google, Yahoo, Facebook, Amazon etc process several Petabytes of data on a daily basis. Many more organizations are moving towards being able to collect, store and make sense of data in the Internet to further their interests.That is where “Big Data” has caught the imagination of people around the world. But What is Big Data and How can I jump into this bandwagon. Fret not, for in the blog post, you are going to find all about it.  The structure of this blog will be typical of a What you need to know? series posted at Infosecnirvana.com. So lets get started!!!

What is Data?
Data is anything that provides value in a structured or unstructured format. It is the lowest level of abstraction in Computing terms because after this, it is binary digits only. Data is typically stored in File Systems

Introducing File Systems
File Systems are the basis of storing and accessing data from a hardware device. It is nothing but an abstraction layer of software/firmware that gives you the capability to store data in a structured format, remember the structure and when queried, help retrieve it as quickly as possible. There are 2 major and common types of File Systems – Disk Based (local access) and Network Based (remote access). To give a simple example, FAT is a Windows Disk based File System wheres NFS is a Network based File System.

Even though both the file systems continue to dominate IT space, more and more relevance is given to Network based File Systems for obvious reasons like Distributed Data storage, redundancy, fault tolerance capabilities etc. This is the basis of “Big Data Tools and Technologies”.

Introducing DFS
Distributed File Systems are Network based File Systems that allow data to be shared across multiple machines across multiple networks. This makes it possible for multiple users on multiple machines to share files and storage resources. The client machines don’t have direct access to the Storage disk itself (as in a Disk based file system), but are able to interact with the Data using a File System protocol. One classic example of DFS is Microsoft SMB where All Windows machines are SMB Clients and access a common SMB Share on the File Server. But SMB suffers from issues pertaining to scalability and fault tolerance. This is where systems like Google File System – GFS (Google uses this in their search engine) and Hadoop Distributed File System – HDFS (Yahoo and others) come into prominence. What these File Systems do is provide a mechanism to effectively manage big data collection, storage and processing across multiple machine nodes.

Introducing HDFS:

Hadoop Distributed File Systems or shortly HDFS is similar to the other DFS file systems talked above, however it is significantly different as well. HDFS can be deployed on Commodity Hardware, is Highly Fault Tolerant and is very capable of handling large data sets. Originally HDFS was developed as part of the Apache NUTCH Project for an alternate Search Engine akin to Google. Some of the most prominent software players for HDFS are “Apache Hadoop”, “Greenplum”, Cloudera etc.

In this post, we will be looking at Log Collection and Management using the Hadoop Platform.

APACHE Hadoop: The Apache Hadoop architecture in a Nutshell consists of the following components:

  • HDFS is a Master Slave Architecture
  • Master Server is called a NameNode
  • Slave Servers are called DataNodes
  • Underlying Data Replication across Nodes
  • Interface Language – Java

Installing Hadoop: Installation of Apache Hadoop is not a very easy task, but at the same time it is not too complex either. Understanding of the Hardware Requirements, Operating System Requirements and Java Programming Language can help you install Apache Hadoop without any issues. Installing Hadoop can be either a Single Node Installation or a Cluster Installation. For this post, we will look at only Single Node Installation steps:

  1. Install Oracle Java on your machine – Ubuntu
  2. Install OpenSSH Server
  3. Create a Hadoop Group and Hadoop User and set Key Based Login for SSH
  4. Download the Latest Distribution of Hadoop from http://www.apache.org/dyn/closer.cgi
  5. Installation is just extracting the Hadoop files into a folder and editing some property files
  6. Provide the location for the JAVA home in the following file location- hadoop/conf/hadoop-env.sh
  7. Create a working folder in Hadoop User Home Directory /home//tmp
  8. Add the relevant details about the host and the home directory following configuration elements in /hadoop/conf/core-site.xml
    conf/core-site.xml —>
    
    hadoop.tmp.dir
    /home//tmp
    A base for other temporary directories.
    
    fs.default.name
    hdfs://localhost:54310
    The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri’s scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri’s authority is used to
    determine the host, port, etc. for a filesystem.
  9. Then we need to edit the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)
    conf/mapred-site.xml —>
    
    mapred.job.tracker
    localhost:54311
    The host and port that the MapReduce job tracker runs
    at. If “local”, then jobs are run in-process as a single map
    and reduce task.
  10. Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:
    conf/hdfs-site.xml —>
    
    dfs.replication
    1
    Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
  11. Before running the Hadoop Installation, the most important step is to format the NameNode or the Master Server. This is critical because, Without the NameNode, the DataNodes will not be setup. In a Single Node Installation, NameNode and DataNodes will reside on the same host, where as in Cluster Installation, NameNodes and DataNodes will reside on different hosts. In order to format the NameNode using Hadoop commands, Run the following command – /hadoop/bin/hadoop namenode -format
  12. In order to start the Hadoop Instance, from hadoop/bin run ./start-dfs.sh and Running the commands will start up Hadoop and when you query the Java Process, you should be able to see the following components of Hadoop Running:
    NameNode
    DataNode
    SecondaryNameNode
    JobTracker
    TaskTracker
  13. If you have successfully completed till this, then you now have a Hadoop Single Node Instance running on your machine.

Getting Data in/out of Hadoop:

Once the installation is completed, the next thing we need to worry about is getting data in and out of Hadoop File System. Typically in order to get the data into the system, we need a API interface into HDFS. This typically is a JAVA or HTTP API. Tools like FluentD, Flume etc help in getting data in and out of Hadoop. Both the tools have plugins for receiving HTTP data, Streaming data and Syslog Data as well.

MapReduce: Hadoop and Big data discussions are incomplete without talking about MapReduce. MapReduce is a software policy framework that maps Input data based on a map file and outputs data in key value pairs. These are two different jobs when it comes to actual processing. One is the Map Task that splits the data into smaller chunks and there is the Reduce Job that generates a Key Value combination for each of the smaller data chunks. This framework is the powerhouse for Hadoop because, this is built with parallelism in mind. Map Tasks and Reduce Tasks can both be run parallel on several machines without compromising on speed, cpu and memory resources. The NameNode is the central master that tracks the Maps and the Jobs where as the DataNodes are just providing processing resource.

Finally, Using Hadoop: Now that we know what drives Hadoop and how to get Hadoop installed, the easiest thing would be to start using them. Several examples for MapReduce jobs using Java are available to aid in learning. There are several related projects running to make the Hadoop Ecosystem more scalable and mature. Some of them are:

  • HBase, a Bigtable-like structured storage system for Hadoop HDFS
  • Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop
  • ZooKeeper is a high-performance coordination service for distributed applications.
  • Hama, a Google’s Pregel-like distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.
  • Mahout, scalable Machine Learning algorithms using Hadoop

Conclusion: Hope this post helped you in understanding the basic concepts of Big Data and also to setup a Hadoop Single Node Installation to play with. Please do post your thoughts on how Big Data is playing a major role in your organisations.

APT – What you need to know?

APT – Advanced Persistent Threat is the latest buzz word in the industry. Everyone who is in the Security Industry, professionals and business alike want to get into the bandwagon that is called APT. Security product vendors are all gearing to cater to “APT” and all their current product lines or future releases address APT in some form of the other. Now, the fever has spread to the IT Management as well and now they want their Security teams to detect and prevent APT. Even though the InfoSec public has caught up with it, how much thought have we put into understanding the magnitude of the problem at hand? Is it enough to just jump on to something without understanding it fully or do we need a more educated and intelligent decision making?

Let us find out more in this post!!!!

As always, I would like to define APT to start with. This is key because once the definitions are clear, all we would need is to align our thinking to that definition. Then, I will list down what flaws we have in our current approach towards security. Finally, I will try to list down as many possible solutions to the problem at hand.

Defining APT:
Simply put, APT is a Security Threat to the Enterprise (even End User for that matter) that is Advanced in execution that traditional security filters are not able to catch outright and is persistent enough that it keeps moving from one compromised target to another evading detection. 

Is it a technology of the future? – No, it is not. APT is nothing but a threat we are not trained to see. One of the main reasons why APT has been so successful in many organizations is the fact that we have an outdated security strategy. For example, we are keen on tracking a Data Exfiltration from a compromised machine. How do we do it today?

  • To start of with, we look for Data Loss Prevention Solutions and see which vendor is the market leader
  • Then we implement DLP solutions with basic policies for generic data loss (PDF, WORD DOC, XLS, Source Codes, Credit Card Numbers, PAN, PII etc)
  • We fine tune the DLP policies for our enterprise specifically and implement detection and prevention capabilities
  • We log the data from DLP solutions to SIEM and alert when something of interest happens.
  • In addition or In replacement, IDS/IPS rules will be implemented to identify data loss traffic based on REGEX file names etc.
  • In some cases we would also look at Traffic going to Blacklisted Domains and IP.
I am sure all of them or majority of the organizations do this to identify Data Exfiltrations. But  can all those organizations say that they are safe against APT? The answer is a SAD NO. The reason being, Known (Policy or Signature of What is Bad) is a drop, Unknown (Where APT works) is an Ocean. The threat landscape has evolved to exploit the Unknown, but we have not evolved to detect and respond to it. What is the solution for this problem?
There are several solutions being proposed by several people in the industry.  In my opinion one of the most important solutions is to do behavior profiling and Anomaly Detection.
Now What is Behavior profiling?
Behavior Profiling – Every network, every segment of the network has a behavior profile that is deemed normal. Today how many of us know what our Network Segments look like in terms of Connections they accept, they deny, Traffic flowing within the segment, what are the most used protocols, what are not used, What size of packets flow, what outbound and inbound communications happen, Access in and out, Who is supposed to and Who is not etc etc.. I seriously doubt it. We are more concerned about getting the system up, providing the service it is deemed to provide. We seldom think about the Security Profile the segment has. Once we profile, we can identify several Anomalies.

Let us now take the same example of Data Exfiltration and see how Behavior profiling would help:

  1. We would have complete details about where sensitive data is residing, the VLAN, the Server, the Folder, The file, The DB tables etc.
  2. To the Sensitive Machine/Network/Data, We would know who has access to and Who does not?
  3. We would also track who has a copy of that data – what is the machine, where is it residing (desktop, laptop, mobile) etc.
  4. The data usage by which team, which individuals etc are also profiled and that would give us the subset of people handling that sensitive data
  5. Any theft of that data would be through one of the above actors/entities.
  6. Tracking each of their machines activity over time would give us a Normal behavior profile.
  7. Digital Markers on such sensitive data can also be placed by the corporations to track data use/flow
  8. We can also track periodicity of data access, time of access, track the data changes etc through Digital Markers
  9. Any deviations from Normal behavior is a potential Data Exfiltration action and needs to be investigated
  10. Behavior profiles thus created can be used in addition to Signature based detection

This requires intimate co-ordination with various teams and also requires great understanding of what your Network does, what it is supposed to do. This while being the most logical is the most challenging to implement and thus the most rewarding as well. Behavior profiling is being used in the Intelligence Community for a long time, but the Technology community is still to embrace this. Enterprise data is becoming critical and with threats like APT, our fundamentals are being questioned.

This approach can help after the fact but from preventing the occurrence a Long term solution is needed. From a long term perspective the only solution is building Networks and Applications (OS as well as Apps) from ground up to treat security as a embedded character and not an add on feature.

What are your thoughts on APT? How do you think we should change our Security thought process, technology and all to combat it? Sound on below!!!

SIEM Use Cases – What you need to know?

My previous post “Adopting SIEM – What you need to know” would give a better starting point if you are new to SIEM and want to implement it in your organization. If you already use/manage/implement a SIEM, then read on.
To start with, SIEM tools take a lot of effort to implement. Once implemented, they need to be taken care like babies. If care is not given, within a few months you would be staring at a million dollar museum artifact. Now there are two parts of care:

  1. Making sure that the systems are updated regularly, not only for patches and configurations but also the content put in them.
  2. Second and the most important part is making the SIEM relevant to the current Threat Landscape.

Anyone who has worked on SIEM for some time would agree with me, that Administration is generally easier compared to making the system relevant to the Threat Landscape. Before people hit me with “Administration is also a pain”, I would like to offer a defense saying that mostly, all SIEM products have documentation attached that give fair amount of information on how to install, update, upgrade and operate these systems. However, Translating Threat Landscapes to nuts and bolts for SIEM purposes is the biggest challenge and there are no guides that can help do that.

In this blog post, my attempt is to make this translation as easy as possible. In SIEM parlance, we call the translation as a Use Case. If there is well-defined Use Case, implementing them, responding to them and managing them would become easier. Such Use Cases would eventually become the cornerstone on which a SOC (Security Operations Center) is built. As usual, I would like to start with defining a Use Case, running through its stages and then finally wrapping it up with an example. So here we go.

Use Case Definition: A Use Case by definition is nothing but a Logical, Actionable and Reportable component of an Event Management system (SIEM). It can be either a Rule, Report, Alert or Dashboard which solves a set of needs or requirements.

A Use Case is actually “developed” and this development is a complete process and not just a simple task. Like a mini project it has several stages. The various Stages involved in Use Case Development are as follows:

  • First stage is the “Requirements”Definition. It can be any of the following high level requirements and is unique to every company:
    1. Business
    2. Compliance
    3. Regulatory
    4. Security
  • Once the requirements are finalized, the next stage would be to “Define the scope” of the requirement. This would typically mean the IT Infrastructure that needs to be protected and is a high priority for the specific requirement.
  • Once the scope is finalized, we can sit down and list the “Event Sources” that would be required to implement the Use Case. These would be Log Data, Configuration Data, Alert Data etc coming out of IT Systems under the above Requirements Scope.
  • The next stage would be to ensure that the Event Sources are going through “Validation Phase” before use. Many times, we would have an Event source but the required data to trigger an Event may not be available. This needs to be fixed before we proceed with the Use Case development.
  • Post validation, we need to “Define the Logic”. This is where we exactly define what and how much data is needed to alert along with the Attack Vector we would like to detect.
  • Use Case “Implementation and Testing” is the next stage. This is where we actually configure the SIEM to do what it does best – Correlation and Alerting. During Implementation the definition of the desired output can also be done. The output can be one of the following:
    1. Report
    2. Real Time Notification
    3. Historical Notification
  • Once implementation is done, we need to “Define Use Case Response” procedures. These procedures help you to make the Use Case Operational.
  • Finally, Use Case “Maintenance” is an ongoing process to keep the Use Case relevant by appropriate tuning.

Now that we have defined in detail the Use Case Development methodology, it is time to take an example and see how this actually looks in Real Life Implementation terms.

The Requirement: Outbound Spam Detection.
The Scope: Mail Infrastructure, End User Machine, Security Detection Infrastructure
The Event Source:
  • IDS/IPS at Network and Host – Signature Based Detection
  • Mail Hygiene or Mail Filtering Tools – Signature Based Detection
  • Events from Network Devices – Traffic Anomaly Based Detection
  • Events from End User Detection tools – Signature and Traffic Anomaly Based Detection

The Event Validation: The devices logging to SIEM should be normalized and parsed properly. Typically, SIEM products would allow Content development based on their native Field Mappings (Through Parsing). If the fields are not mapped, then the SIEM does a poor job of Event Triggering and Alerting. The required fields for the above Use Case would typically be Source IP, Source user ID, Email Addresses, Target IP, Host information of Source and Target, Event Names for SPAM detection, Port and Protocol for SMTP based traffic detection etc.

Use Case Logic Flow: The Logic definition is something unique to the environment and needs to be defined accordingly. The logic can be either Signature based or behavior based. You can have it restricted to certain subset of data (based on the Event Sources above) or expand it to be more generic. Some samples are given below:
  • One machine doing Port 25 Outbound connections at the rate of 10 in a minute
  • SPAM Signatures originating from the same source from IDS/IPS, Mail Filter etc having the same destination Public domain
  • SYN Scans on port 25 constantly from a single source etc

Implementation and Testing: Once the logic is defined, Configuration of SIEM and tuning the implementation to trigger more accurately is the next phase. After Implementation of the Use Case, we would need several iterations of Incident Analysis along with data collection to ensure that the Use Case is doing what it is intended to do. This is done at the SIEM level and may involve aggregation, threshold adjustments, logic tightening etc.

Use Case Response: After implementation, the Use Case need to be made as a valuable resource by Defining a Use Case Response. This is the stage where you would define “What action needs to be taken and how it needs to be taken”. You can look at Episode 4 of my Security Investigation series to get an idea of how to Investigate SPAM cases. Other Security Investigation Series Articles are located here – Security Investigation Series.

SIEM Use Cases are really the starting point for good Incident detection. If you want to run a SOC, having well-defined SIEM Use Cases would ease management and increase efficiency of Operations. This post is my humble attempt to simplify and regularize Use Case development for SIEM implementations.

As always, I would love to hear comments and thoughts on this topic.