Almost from the moment Hadoop was first introduced, organizations have sought to replace their expensive existing data warehouse solutions with it, taking advantage of Hadoop’s distributed nature and the fact that it uses commodity hardware that is cheaply and readily available. In the vast majority of cases this has been ill-advised, and the projects have ended badly. Hadoop, and in particular the distributed file system at the heart of Hadoop, HDFS, were not designed for the speed and updateability requirements of data warehouses.
The Problem of Immutability
The main problem behind replacing an existing Data Warehouse with Hadoop is a seemingly innocent concept called immutability. Quite simply, data in HDFS cannot be changed, it can only be overwritten or logically appended. This makes Change Data Capture and other data warehouse concepts difficult to implement. When Hadoop is used for archival or analytical purposes this is not usually an issue, but Data Warehouses alter existing data tables regularly, anywhere from daily to multiple times a second.
Workarounds using existing Hadoop tools were invented to address the issue, but they are cumbersome, slow, and unreliable. Many who have attempted these workarounds have concluded that the efforts required, and the shortcomings that result, are simply not worth the investment.
To make a Hadoop cluster based on HDFS and Hive function as a data warehouse, two main approaches evolved: partitioning based on time and/or the “Four-Step Method.”
Time-based partitioning separates the data and stores each month, week, or other time unit of data as a separate file. If changes are detected in a record for the month of July in the originating table, for example, the entire month is overwritten in HDFS. While somewhat effective at keeping data current, this method is needlessly cumbersome. Why overwrite an entire month or week just to update a single record? This is a waste of resources.
The Four-Step Method takes a different, much more complex approach. New data from the originating system is used to create a temporary table, which is then joined to the existing table in a view, which is then used to overwrite the existing data in HDFS at the end of the current time period. Confused? You’re not alone.
Cloudera recognized the issues that the immutability of data in HDFS caused for their customers, and set to work to develop an alternative that would store data in such a way that would allow updates and changes in an expedient fashion. The result is Kudu, an efficient, updateable file store that acts much like a relational database in everyday use.
Commands like INSERT, UPDATE, and DELETE that are unavailable in HDFS and Hive (or didn’t work like their counterparts in relational databases), now function as expected and scale to previously unimagined sizes of data. Kudu even ups the ease of use by including support for an UPSERT command that automatically detects whether a record should be updated or inserted based on the primary key. Which brings us to the next point…
Keys to the Kingdom
Another notable difference between Hadoop and relational databases has been the lack of keys in Hive. Keys are…well, key…to the success of Change Data Capture (CDC), the process by which data in the Data Warehouse is continually updated. Kudu’s support of and requirement for a primary key in each table mean that there is now a mechanism in Hadoop for preventing duplicate data during updates. This was a glaring omission before.
It can’t be all sunshine and roses, can it? Since Kudu is a relatively new technology there are limitations that must be considered when using it as a data warehouse, some of which include the following:
While Kudu requires primary keys it doesn’t support foreign keys, so linking between tables in the way a relational database would is not yet possible.
As of version 1.3, Kudu doesn’t support creating tables from files or the LOAD DATA INFILE command. The fastest way to bulk load data is to bring them into HDFS, create a Hive table over them, and then create the Kudu table through a CREATE AS SELECT statement. Another alternative is to interface directly with the API, bypassing ODBC or JDBC drivers. The API works with C++, Java, and now even includes bindings for Python.
The number of supported data types for columns in Kudu is relatively small and does not include either DATE or TIMESTAMP. These must be stored as strings or integers.
Now, Why Would I Do This?
Now that immutability, updates, and speed aren’t the limiting issues they previously were, the question remains: why use Hadoop as a Data Warehouse? This usually boils down to one of two things: price and/or scale. Our clients cite excessive licensing costs for traditional data warehouses and limitations of the hardware used in those solutions, along with the seemingly limitless scaling capability of Hadoop, as reasons for wanting a Hadoop data warehouse. Compared to other popular solutions Hadoop scales past petabytes of data and does so inexpensively in terms of both software and hardware.