Introduction to Blockchains & What It Means to Big Data

Introduction to Blockchains & What It Means to Big Data

“Arguably the most significant development in information technology over the past few years, blockchain has the potential to change the way that the world approaches big data, with enhanced security and data quality just two of the benefits afforded to businesses using Satoshi Nakamoto’s landmark technology.”

What is a Blockchain?

Blockchain is a distributed database system that acts as an “open ledger” to store and manage transactions. Each record in the database is called a block and contains details such as the transaction timestamp as well as a link to the previous block. This makes it impossible for anyone to alter information about the records retrospectively. Also, due to the fact that the same transaction is recorded over multiple, distributed database systems, the technology is secure by design.

With the above in mind, blockchain is immutable – information remains in the same state for as long as the network exists.

Blockchain and Big Data

When you talk about blockchain in the context of Bitcoin, the connection to Big Data seems a little tenuous. What if, instead of Bitcoin, the blockchain was a ledger for other financial transactions? Or business contracts? Or stock trades?

The financial services industry is starting to take a serious look at block chain technology. Oliver Bussmann, CIO of UBS says that blockchain technology could “pare transaction processing time from days to minutes.”

The business imperative in financial services for blockchain is powerful. Imagine blockchains of that magnitude. Huge data lakes of blocks that contain the full history of every financial transaction, all available for analysis. Blockchain provides for the integrity of the ledger, but not for the analysis. That’s where Big Data and accompanying analysis tools will come into play.

Opportunities for Big Data Analytics

Recently, a consortium of 47 Japanese banks signed up with a blockchain startup called Ripple to facilitate money transfers between bank accounts using blockchain. The main reason behind the move is to perform real-time transfers at a significantly low cost. One of the reasons traditional real-time transfers were expensive was because of the potential risk factors. Double-spending (which is a form of transaction failure where the same security token gets used twice) is a real problem with real-time transfers. With blockchains, that risk is largely avoided. Big data analytics makes it possible to identify patterns in consumer spending and identify risky transactions a lot quicker than they can be done currently. This reduces the cost with real-time transactions.

In Industries outside of banking too, the main drive for adoption of Blockchain technologies has been security. Across healthcare, retail and public administration, establishments have started experimenting with blockchain to handle data to prevent hacking and data leaks. In healthcare, a technology such as blockchain can make sure that multiple “signatures” are sought at every level of data access. This can help prevent a repeat of events such as the 2015 attack that led to the theft of over 100 million patient records.

Possibilities in Real-Time Analytics

Up until now, real-time fraud detection has only been a pipe dream and banking institutions have always relied on using technologies to identify fraudulent transactions retrospectively. Since the blockchain has a database record for every single transaction, it provides a way for institutions to mine for patterns in real-time, if need be.

But all of these possibilities also raise questions about privacy and this is in direct contradiction to the reason why blockchain and bitcoins became popular in the first place. Several industry experts have expressed concerns that a technology that can provide a record of every transaction can be exploited for everything “from customer profiling to other less benign reasons”.

From another perspective however, blockchains greatly improve transparency in data analytics. Unlike previous algorithms, the blockchain design rejects any input that it can’t verify and is deemed suspicious. As a result, analysts in industries such as Retail only deal with data that is completely transparent. In other words, the customer behavior patterns that blockchain systems identify are likely to be a whole lot more accurate than it is today.

Uncovering Transactional Data

The data within the blockchain is predicted to be worth trillions of dollars as it continues to make its way into banking, micropayments, remittances, and other financial services. In fact, the blockchain ledger could be worth up to 20% of the total big data market by 2030, producing up to $100 billion in annual revenue. To put this into perspective, this potential revenue surpasses that of what Visa, Mastercard, and PayPal currently generate combined.Big data analytics will be crucial in tracking these activities and helping organizations using the blockchain make more informed decisions.

Data intelligence services are emerging to help financial institutions, governments, and all kinds of organizations delve into who they might be interacting with on the blockchain and uncover “hidden” patterns.

Uncovering Social Data

As the popularity of bitcoin advanced in 2014 and 2015, the virtual currency began to fluctuate heavily as a result of real-world events and the general public’s sentiment about the technology. These fluctuations are proof that the virtual currency has several characteristics that make it ideal for social data predictions.

According to Rick Burgess of Freshminds: “Using social data to predict consumer behavior is nothing new, and many traders have been looking to include social metrics into their trading algorithms. However, because there are so many factors involved in pricing most financial instruments, it can be extremely difficult to predict how markets will change.”
Fortunately, bitcoin users and social media users tend to align quite well, and it may be beneficial to use them both for data analysis, as he further explains:

  • Bitcoin users tend to be in the same demographic as social media users, and so their attitudes, opinions, and sentiment towards bitcoin are well documented.
  • The value of bitcoins and other cryptocurrencies are determined almost solely by market demand because the number of coins on the market is predictable and are not tied to any physical goods.
  • Bitcoins are predominantly traded by individuals rather than large institutions.
  • Events that affect Bitcoin’s value are disseminated first and foremost on social media.

Data analysts are now mining social data for insights into key cryptocurrency trends. This, in turn, helps organizations uncover powerful demographic information and link bitcoin’s performance to world events.

Uncovering New Forms of Data Monetization

According to Bill Schmarzo, CTO of Dell EMC Services, blockchain technology also “has the potential to democratize the sharing and monetization of data and analytics by removing the middleman from facilitating transactions.” In the business world, this gives consumers stronger negotiating powers over companies. It allows consumers to control who has access to their data through the blockchain. They could then demand pricing discounts in exchange for revealing data on their personal consumptions of a company’s product or service.

Schmarzo also explains how the blockchain may lead to new forms of data monetization because it has the following big data ramifications:

  • All parties involved in a transaction have access to the same data. This accelerates data acquisition, sharing, the quality of data and data analytics.
  • A detailed register of all transactions is kept in a single “file” or blockchain. This provides a complete overview of a transaction from start to finish, eliminating the needs for multiple systems.
  • Individuals can manage and control their personal data without the need for a third-party intermediary or centralized repository.

Ultimately, the blockchain could become a key enabler of data monetization by creating new marketplaces where companies and individuals can share, sell, and offer their data and analytical insights directly with each other.

Spearheaded by the large scale adoption of bitcoin, blockchain technologies are gaining ground throughout the business and financial worlds. The fast and secure transactions it facilitates could potentially revolutionize traditional data systems. According to a survey, one-third of decision makers trust their company’s data. But with blockchain technologies, this trust can be considerably strengthened, and real applications will become much more commonplace.


Introduction to Blockchains & What It Means to Big Data

HDFS vs. HBase : All you need to know

HDFS vs. HBase : All you need to know

The sudden increase in the volume of data from the order of gigabytes to zettabytes has created the need for a more organized file system for storage and processing of data. The demand stemming from the data market has brought Hadoop in the limelight making it one of biggest players in the industry. Hadoop Distributed File System (HDFS), the commonly known file system of Hadoop and Hbase (Hadoop’s database) are the most topical and advanced data storage and management systems available in the market.

What are HDFS and HBase?

HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during system failures. HBase is a non-relational and open source Not-Only-SQL database that runs on top of Hadoop. HBase comes under CP type of CAP (Consistency, Availability, and Partition Tolerance) theorem.

HDFS is most suitable for performing batch analytics. However, one of its biggest drawbacks is its inability to perform real-time analysis, the trending requirement of the IT industry. HBase, on the other hand, can handle large data sets and is not appropriate for batch analytics. Instead, it is used to write/read data from Hadoop in real-time.

Both HDFS and HBase are capable of processing structured, semi-structured as well as un-structured data. HDFS lacks an in-memory processing engine slowing down the process of data analysis; as it is using plain old MapReduce to do it. HBase, on the contrary, boasts of an in-memory processing engine that drastically increases the speed of read/write.

HDFS is very transparent in its execution of data analysis.  HBase, on the other hand, being a NoSQL database in tabular format, fetches values by sorting them under different key values.

Enhanced Understanding with Use Cases for HDFS & HBase

Use Case 1 – Cloudera optimization for European bank using HBase

HBase is ideally suited for real-time environments and this can be best demonstrated by citing the example of our client, a renowned European bank. To derive critical insights from the logs from application/web servers, we implemented solution in Apache Storm and Apache Hbase together. Given the huge velocity of data, we opted for HBase over HDFS; as HDFS does not support real-time writes. The results were overwhelming; it reduced the query time from 3 days to 3 minutes.

Use Case 2 – Analytics solution for global CPG player using HDFS & MapReduce

With our global beverage player client, the primary objective was to perform batch analytics to gain SKU level insights, and involved recursive/sequential calculations. HDFS and MapReduce frameworks were better suited than complex Hive queries on top of Hbase. MapReduce was used for data wrangling and to prepare data for subsequent analytics.  Hive was used for custom analytics on top of data processed by MapReduce. The results were impressive; as there was a drastic reduction in the time taken to generate custom analytics – 3 days to 3 hours.

To offer a reasonable comparison between HDFS and HBase, the following points need to be emphasized on:

HDFS

HBase

HDFS is a Java-based file system utilized for storing large data sets.

HBase is a Java based Not Only SQL database

HDFS has a rigid architecture that does not allow changes. It doesn’t facilitate dynamic storage.

HBase allows for dynamic changes and can be utilized for standalone applications.

HDFS is ideally suited for write-once and read-many times use cases

HBase is ideally suited for random write and read of data that is stored in HDFS.

HDFS vs. HBase : All you need to know