It’s safe to say that ‘big data’ is the big buzzword de jour, and that is unlikely to change. Many of the major players in the tech industry are leveraging big data to spectacular and varied effect, and though it may be used a little too much as a buzzword, it is one of the most important developments in the tech industry. What’s more, the cotemporaneous rise of open source software means that many of the most exciting big data technologies are open source, with strong communities developing around many big data tools. In this article, we’ll run through a few that you should be aware of, what they do, and what separates them from the pack.
You can’t really talk about open source and big data without mentioning Hadoop. In the world of big data, it is a real giant, and is easily the most famous and widely used platform on this list. Named after the creator’s toy elephant, Hadoop is an open source programming environment that allows for large-scale, distributed data processing. It’s flexible in that it can work with multiple sources of data, and is the basis for many other open source big data technologies.
Cascading functions as an open-source abstraction layer for use with Hadoop. It allows users to take Hadoop clusters and action data-processing workflows on them using JVM-based languages. Applications include machine learning, web content mining, ad targeting, and many other sectors. A key advantage is that it hides the complexity of MapReduce.
R’s minimalist name belies its immense power. A statistical programming language, R is fast becoming a standard tool when dealing with statistics. It has a large and extremely active community, which is producing new and innovative approaches on an almost daily basis. As such it has a lot of momentum behind it, and it’s something you really need to be aware of if you’re engaged in this area.
When a company like Facebook backs something, that’s usually time to take notice. Scribe was developed by Facebook as a way of aggregating the log data from a lot of servers. Released in 2008, Scribe is now used by Facebook to manage the tens of billions of messages generated by their near-ubiquitous social network.
D3 is a visualisation platform, which many consider to be a revolutionary step forward in how we present and interact with data. It is compatible with many hardware platforms, and allows data to be presented through HTML tables, dashboards, and other forms, whilst constantly updating to reflect and integrate new data. D3 is quickly replacing HUDs as the industry standard in data visualisation.
ElasticSearch is an open source search server that has been adopted by organisations like StumbleUpon and Mozilla. It can provide near- real time search, and it is this feature that has been responsible for its success. While platforms like Solr are fast under certain circumstances, ElasticSearch is highly scalable and highly integratable.