Galleries

Big Data – What you need to know?

Big Data is the buzzword in IT Circle nowadays. The major reason for this is the exploding “Netizen” base. Today Everything is happening Online and Online Data is estimated in zettabytes. The wealth of information one can carve from Online data is undeniably attractive for several organizations for marketing and sales. Organizations like Google, Yahoo, Facebook, Amazon etc process several Petabytes of data on a daily basis. Many more organizations are moving towards being able to collect, store and make sense of data in the Internet to further their interests.That is where “Big Data” has caught the imagination of people around the world. But What is Big Data and How can I jump into this bandwagon. Fret not, for in the blog post, you are going to find all about it.  The structure of this blog will be typical of a What you need to know? series posted at Infosecnirvana.com. So lets get started!!!

What is Data?
Data is anything that provides value in a structured or unstructured format. It is the lowest level of abstraction in Computing terms because after this, it is binary digits only. Data is typically stored in File Systems

Introducing File Systems
File Systems are the basis of storing and accessing data from a hardware device. It is nothing but an abstraction layer of software/firmware that gives you the capability to store data in a structured format, remember the structure and when queried, help retrieve it as quickly as possible. There are 2 major and common types of File Systems – Disk Based (local access) and Network Based (remote access). To give a simple example, FAT is a Windows Disk based File System wheres NFS is a Network based File System.

Even though both the file systems continue to dominate IT space, more and more relevance is given to Network based File Systems for obvious reasons like Distributed Data storage, redundancy, fault tolerance capabilities etc. This is the basis of “Big Data Tools and Technologies”.

Introducing DFS
Distributed File Systems are Network based File Systems that allow data to be shared across multiple machines across multiple networks. This makes it possible for multiple users on multiple machines to share files and storage resources. The client machines don’t have direct access to the Storage disk itself (as in a Disk based file system), but are able to interact with the Data using a File System protocol. One classic example of DFS is Microsoft SMB where All Windows machines are SMB Clients and access a common SMB Share on the File Server. But SMB suffers from issues pertaining to scalability and fault tolerance. This is where systems like Google File System – GFS (Google uses this in their search engine) and Hadoop Distributed File System – HDFS (Yahoo and others) come into prominence. What these File Systems do is provide a mechanism to effectively manage big data collection, storage and processing across multiple machine nodes.

Introducing HDFS:

Hadoop Distributed File Systems or shortly HDFS is similar to the other DFS file systems talked above, however it is significantly different as well. HDFS can be deployed on Commodity Hardware, is Highly Fault Tolerant and is very capable of handling large data sets. Originally HDFS was developed as part of the Apache NUTCH Project for an alternate Search Engine akin to Google. Some of the most prominent software players for HDFS are “Apache Hadoop”, “Greenplum”, Cloudera etc.

In this post, we will be looking at Log Collection and Management using the Hadoop Platform.

APACHE Hadoop: The Apache Hadoop architecture in a Nutshell consists of the following components:

  • HDFS is a Master Slave Architecture
  • Master Server is called a NameNode
  • Slave Servers are called DataNodes
  • Underlying Data Replication across Nodes
  • Interface Language – Java

Installing Hadoop: Installation of Apache Hadoop is not a very easy task, but at the same time it is not too complex either. Understanding of the Hardware Requirements, Operating System Requirements and Java Programming Language can help you install Apache Hadoop without any issues. Installing Hadoop can be either a Single Node Installation or a Cluster Installation. For this post, we will look at only Single Node Installation steps:

  1. Install Oracle Java on your machine – Ubuntu
  2. Install OpenSSH Server
  3. Create a Hadoop Group and Hadoop User and set Key Based Login for SSH
  4. Download the Latest Distribution of Hadoop from http://www.apache.org/dyn/closer.cgi
  5. Installation is just extracting the Hadoop files into a folder and editing some property files
  6. Provide the location for the JAVA home in the following file location- hadoop/conf/hadoop-env.sh
  7. Create a working folder in Hadoop User Home Directory /home//tmp
  8. Add the relevant details about the host and the home directory following configuration elements in /hadoop/conf/core-site.xml
    conf/core-site.xml —>
    
    hadoop.tmp.dir
    /home//tmp
    A base for other temporary directories.
    
    fs.default.name
    hdfs://localhost:54310
    The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri’s scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri’s authority is used to
    determine the host, port, etc. for a filesystem.
  9. Then we need to edit the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)
    conf/mapred-site.xml —>
    
    mapred.job.tracker
    localhost:54311
    The host and port that the MapReduce job tracker runs
    at. If “local”, then jobs are run in-process as a single map
    and reduce task.
  10. Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:
    conf/hdfs-site.xml —>
    
    dfs.replication
    1
    Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
  11. Before running the Hadoop Installation, the most important step is to format the NameNode or the Master Server. This is critical because, Without the NameNode, the DataNodes will not be setup. In a Single Node Installation, NameNode and DataNodes will reside on the same host, where as in Cluster Installation, NameNodes and DataNodes will reside on different hosts. In order to format the NameNode using Hadoop commands, Run the following command – /hadoop/bin/hadoop namenode -format
  12. In order to start the Hadoop Instance, from hadoop/bin run ./start-dfs.sh and Running the commands will start up Hadoop and when you query the Java Process, you should be able to see the following components of Hadoop Running:
    NameNode
    DataNode
    SecondaryNameNode
    JobTracker
    TaskTracker
  13. If you have successfully completed till this, then you now have a Hadoop Single Node Instance running on your machine.

Getting Data in/out of Hadoop:

Once the installation is completed, the next thing we need to worry about is getting data in and out of Hadoop File System. Typically in order to get the data into the system, we need a API interface into HDFS. This typically is a JAVA or HTTP API. Tools like FluentD, Flume etc help in getting data in and out of Hadoop. Both the tools have plugins for receiving HTTP data, Streaming data and Syslog Data as well.

MapReduce: Hadoop and Big data discussions are incomplete without talking about MapReduce. MapReduce is a software policy framework that maps Input data based on a map file and outputs data in key value pairs. These are two different jobs when it comes to actual processing. One is the Map Task that splits the data into smaller chunks and there is the Reduce Job that generates a Key Value combination for each of the smaller data chunks. This framework is the powerhouse for Hadoop because, this is built with parallelism in mind. Map Tasks and Reduce Tasks can both be run parallel on several machines without compromising on speed, cpu and memory resources. The NameNode is the central master that tracks the Maps and the Jobs where as the DataNodes are just providing processing resource.

Finally, Using Hadoop: Now that we know what drives Hadoop and how to get Hadoop installed, the easiest thing would be to start using them. Several examples for MapReduce jobs using Java are available to aid in learning. There are several related projects running to make the Hadoop Ecosystem more scalable and mature. Some of them are:

  • HBase, a Bigtable-like structured storage system for Hadoop HDFS
  • Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop
  • ZooKeeper is a high-performance coordination service for distributed applications.
  • Hama, a Google’s Pregel-like distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.
  • Mahout, scalable Machine Learning algorithms using Hadoop

Conclusion: Hope this post helped you in understanding the basic concepts of Big Data and also to setup a Hadoop Single Node Installation to play with. Please do post your thoughts on how Big Data is playing a major role in your organisations.

ArcSight CORR 6.0 – Install and Migration

ArcSight (now HP) Enterprise Security Manager (ESM) is the premiere security event manager that analyzes and correlates every other event in order to support the Security Team or analysts in every aspect of security event monitoring, from compliance and risk management to security intelligence and operations. There have been several versions of ArcSight ESM released over a period in time. Their latest version is ArcSight CORR 6.0. At InfoSecNirvana.com we have got a copy of the latest version and we will be writing a multi-part post on how to Install, Migrate from Older versions to 6.0 and some basic walk around.

In this Part 1 post, we shall cover about the installation of ArcSight CORR (Correlation Optimized Retention and Retrieval), a proprietary data storage and retrieval framework that receives and processes events at high rates, and performs high-speed searches; the latest ArcSight ESM by HP. With the ArcSight CORR, Oracle database is now eliminated.

CORR components:

  • ArcSight Manager
  • CORR Engine
  • ArcSight Console
  • ArcSight Web
  • Management Console
  • Smart Connectors

Requirements:

System: This completely depends on the EPS that you expect to receive. InfoSecNirvana has been working on getting a PoC for this and the below configuration was used:
A VMWare box with 8 cores, 32GB Ram, 256GB SSD HDD, 2TB WD 7200 RPM SATA HDD (Note: for production, there might be/recommend a higher configuration. Check with ArcSight manuals on the same)

OS: Red Hat Enterprise Linux Server release 6.2 x64, installed with xfsprogs-3.1.1-6.el6.x86_64 rpm; this is required to convert some of the ext4 file systems to xfs filesystems. XFS Partition is the most apt format for us to fully utilize the performance enhancements coming with CORR. Typically, I would recommend /opt/ to be formatted with XFS and maximum storage can be allocated to this partition. This is crucial because, the very first step of installation would verify whether the entire /opt/ directory is in XFS. When using VMWare with LVM, we faced some issues during the installation and ArcSight Support could not help us with this. However, when raw devices were mounted as /OPT/ we did not face any issues.

Storage: Please allocate the required storage (calculate based on Number of Devices, Events per second, Average Event Size and Retention period). Remember, CORR is like an ESM with a built in Logger. You can still use a Logger for long term retention if that is what you prefer so that ESM will be lean and mean.

Permissions: The installation has to be done using a Non-Root account. This account can be a service account named”arcsight”. This account should have RWX permissions on the /opt/ directory. Make sure this is satisfied.

Misc: /TMP/ partition should have at least 3GB space. /home/arcsight also should have a minimum of 5GB free space. This is crucial again because, the INSTALL DIR log files are written in these location and if sufficient space is not allocated the installation fails.

The CORR package: Get the CORR installation package and the license from HP ArcSight. This can be obtained from your sales representative with HP/ArcSight.

CORR Installation:
The installation is pretty straightforward and is just a series of clicks. I have given most of the screenshots below just as a reference. Obviously, if you have already installed ArcSight Software, you would not even need this. Once done, you would be able to install the Console to access CORR and play around.



Once the installation is completed, we would want to test the following before we call the install as complete:

  1. Validate the Log Files in the Manager Install Logs and find out if there are any warnings and errors. Generally, this is a best practice to ensure valid installation.
  2. Install the Console and try to connect to ESM, with the default user name and password (mentioned in the install guide). First time when you connect, A certificate import of the Manager happens. If you use a self-signed certificate make sure you note down the parameters used to create cause this will help in future migrations, troubleshooting or recovery.
  3. After connecting to the console, you are ready to go.

Migrations from Existing Installs – Migrating from earlier versions to this CORR instance is tricky, because you are migrating from a DB back end to a NON-DB back end. I will be posting a followup of this post in PART 2 that will detail the migration procedure from 4.X to 5.X.

Stay Tuned to InfoSecNirvana.com for more!!!

SIEM – The Good, The Bad and The Ugly – Part 2

In Part 1 of the post, we discussed about the several shortcomings of SIEM that has risen over the years. These problems need to be addressed if we need to progress further in our maturity with SIEM technology. Let me start with the main capability required for a SIEM to function – Log Management. This is where majority of the problems are.

Log Management: The problems plaguing Log Management in a Client/Server model are way too many to comprehend. Centralized log Management is the solution, however we don’t have a sound Log Management solution today that addresses these needs. Let us see what problems are there in Log Management and how we could solve this. Log Management is broadly divided into four parts:

  1. Log Collection
  2. Log Categorization
  3. Log Storage
  4. Log Management (Making it easily searchable across large sets)

Log Collection – Client-less and Standardized: In my view, ideal solution would be that, the source devices should not have any clients installed on them, should not need special formatting, should not be taxed in terms of processing. Standard method of data collection should be set as the norm. From a higher order, an RFC or a Standard should be floated that standardizes all the Log Data from every IT device. This standardization would help in two things – One to improve the Overall Logging and Auditing capabilities of devices (Client Less – Out Of Box) in a standard format and the other is to Improve the Security Consciousness of the Application development teams. Think in terms of Logging being a part of standard protocol suite. What this would do is help interoperability between devices, be it Log Sources, Log Collectors, Log Managers etc along with bringing together all the fragmented parts of the existing log management under one umbrella. For example, Every IDS vendor today supports SNORT formatting. Similarly, all SIEM vendors should support a standard log Collection and processing so that interoperability, migration between vendors etc becomes more easy. I know CEF is one of the standards, but I am not sure everyone adopts that today.

Log Categorization – Security and Non-Security: When large volumes of Log data comes in, there is a need to separate the list of Security events from normal Non-Security events. The Log Standard also should specify categorization of Events clearly as Security and Non-Security. For every device class, the Security Events should be listed out and only those logs should be collected by SIEM for correlation and incident management. Many organizations collect way too much Log data (up to 100K) but effectively use only 10% of it for Security purposes. That means the remaining 80-90% is Non-Security related Logs. And this junk is the one eating up Terra bytes of space. There is a huge disparity between the Log Management tier and the SIEM tier. Data Collection is always more and more, however SIEM processing is more focussed. This is where the categorization helps so that SIEM receives only Security events to process.

Log Storage and Log Management – Streamlined and consistent data sets: The moment we start collecting standard data and properly categorizing them, the storage, indexing and retrieval becomes easier. This space we are good at and should continue to improve. Things like using Big Data Storage technologies instead of relying on Oracle or SQL or MySQL as the backend limits the capabilities when handling big data sets. Storage has to be streamlined and new indexing and searching capabilities should be thrown into the future development of Log Management tools and solutions.

Security Incident and Event Management (SIEM) – Once the Log Management problems are sorted out, the SIEM problems become easier to solve. One of the major pain points in using SIEM was the client-server architecture. When Log Collection becomes Client-less, the SIEM solutions need not focus on building Log Collection clients and instead focus their energies on better correlation and intelligence data mining. Some sweeping changes that can be brought into SIEM world are as below:

  • SIEM should be an inference engine, a correlation engine and a Data mining engine only. This will bring more value in the intelligence piece of Log Data Mining rather than just parsing and doing some basic alerting defined. Remember, as I always say, Security is an Intelligence Function and not an Operations Functions
  • SIEM should be able to focus only on Security events for Alerting, reporting and investigation. This is where the Log Standardization and Categorization plays a big role. If SIEM were to process only Security Data, we would never be hitting more than 5K in most enterprises.
  • SIEM should be a fast and agile product and not rely on backend DB queries, reports and stats usually driven using Oracle or SQL. This is something that some vendors are starting to explore. I know Novell e-Sentinel has this capability for a long time, HP is now trying to do similar thing with CORR. This is in my opinion the right way forward
  • The SIEM should also become a more Active tool instead of being a Passive tool as it is today. What I mean by this is, a SIEM should be able to respond to threats in a comprehensive way. It should be able to alert, do basic ITIL Service Management Integration out of the box and also if need be, execute boxed responses for alerts. This is helpful because, many a times the rules written in SIEM are basic and over a period of time become repetitive. Such repeatable alert responses can be automated into a Workflow and the system should be capable of becoming self-sufficient.
  • SIEM should get better at Large Data Set Mining. Today, SIEM Management consoles are “code heavy” in the front end and “CPU heavy” on the backend. This kinda reduces the efficiency of SIEM technologies in perform real-time correlation. A radical change is needed in the way SIEM applications are developed to make the application lightning fast.
  • SIEM should also provide flexibility in terms of customization of Incident Detection and Response templates, writing visualization rules (where you are able to chart Attack Vectors, Vulnerable points, Network Maps etc), trending Historical Data in a way where the system automatically detects a pattern of similar issues in the past and so on and so forth. In short, add “Intelligence as well as Learning Capabilities” to SIEM.

Again, I am not sure I have covered everything in terms of possible improvements to existing problems, but at least grazed on a few.

What else do you think can help improve the SIEM capabilities? Comment on below.