Big Data is in and of itself a new phenomenon, and keeping track of all the emergent technologies in the field can be dizzying. With that in mind, here are a few technologies to keep an eye on as they mature and evolve throughout 2013!
1. Cloudera Impala
While Impala isn’t the newest on this list, it’s definitely one to watch out for in 2013. Cloudera’s implementation of real-time data processing in Hadoop is inspired- Hadoop’s disk-based storage format isn’t normally known for real-time ad-hoc query capability, but Impala uses some great open-source technologies to implement it. It uses the same metadata and syntax as Apache Hive, but it generates significantly less CPU load and takes better use of hardware resources than Hive generally does. Depending on configuration, Impala’s speed can be exponentially faster than Hive for the same type of use. It’s not a replacement for data warehousing, but it serves as a complement and can be used side-by-side with MapReduce and Hive.
|Source: Cloudera Impala|
Trevni serves as a complement to Impala, and it’s extremely promising: it’s one of the newer projects out there, but it’s already causing a buzz in the big data arena. Trevni is a columnar binary storage format for Cloudera Impala, and it has quite lofty achievements: Cloudera’s Impala team hopes that once finished and properly implemented, Trevni could achieve speeds equal to those outlined in Google’s Dremel paper while actually exceeding the SQL functionality it displays. Trevni’s joining of great SQL functionality along with Dremel-like speeds definitely makes it a contender along with Impala for a tech to watch out for in 2013!
True to its name, Spark is a cluster computing solution that sets out to make the process of data analytics as fast as possible- both to write and to run. Spark provides primitives for performing your cluster computing in memory: your job can load the required data into memory and query it as quickly as possible, much faster than disk-based system like MapReduce and Hadoop. Part of Spark’s appeal also lies in its clean APIs in Scala, Python, and Java; you can also use it to interactively and rapidly query big datasets from the Python and Scala shells should you need to. Spark’s flexibility and speed definitely make it an open-source technology to watch out for in 2013!
4. Apache HCatalog
HCatalog is something that Big Data administrators and developers have pined for: a complete table and storage management service for Hadoop-created data. It’s possibly one of the least-talked about technologies on this list, no doubt due to its infancy: nevertheless, it’s most definitely worth watching out for. As a metadata management model that keeps to its open-source philosophy and works across all of your data, it’s been hailed as a godsend from some beleaguered Big Data proponents. It’s a specific problem that nonetheless needs a strong solution, and HCatalog is shaping up to be that great solution for metadata management in HDFS.
|Source: Apache HCatalog|
5. Data-Driven Documents
|Source: Data-Driven Documents|
2013 is an exciting time for big data, especially considering the shift to real time data processing in combination with the considerable data warehousing abilities of current Hadoop deployments. Make sure to look at these open source technologies for 2013- who knows, they may even be useful for your enterprise deployments in the next year!