Today, the quantity of data that are produced by the whole company increase staggeringly. In 2008, the produced data covered 0.5 zettabytes, that represent 500 millions of terabytes. In 2011, there were 2.5 zettabytes and Cisco estimates that in 2020 there will be more than 35 zettabytes of data. Moreover, the explosion of the data is accompanied with a diversification of the type of data, the unstructured data like text files, music, video,etc.
All this data have to be stored and managed but they mainly need specific infrastructure and software to be effectively processed. The final goal is to produce a useful result that the company will use to guide their business.
The big data touch all the sectors, whatsoever the marketing, in order to have a better knowledge of the customer, the mediaor the advertisers to recommend the best content to the right user, the scientific research to find the a correlation between millions of treatments, etc. All the sectors are mainly confronted with the same 5 challenges. According to a presentation of Infochimps: 80% had a problem to find talent, 76% had difficulties to find the right tools for their business, 75% didn’t have time enough, 73% had issues to understand the different platforms and 72% needed to be more educated on the topic.
Every Big Data project solution will be composed of 3 main parts : the infrastructure, the middle software that will enable us to manage and process all the data in a distributed manner and the analytic software that will run on top of the middle software. The analytic software will run several algorithms as machine learning and data mining to produce an interesting result for the business.
There are several software solutions for a big data environment. Different kinds of technologies exist NoSQL, MPP and Hadoop. There are different Hadoop distributions that are completely packaged as a Linux distribution. The most ones are MapR, Cloudera and Hortonworks. These distributions are preconfigured and the installation is more or less automated.
The real challenge
As we can see in Figure 1, the growth of the data is not linear, it’s exponential and since 2000 we see an explosion of the unstructured data coming from example the social network. An unstructured data can be text, video files, voice, etc.
The real issue is that if we want to keep the same performance following the data growth, the infrastructure needs to proportionally grow much more. The relation between the size of the data and the need in term of infrastructure is exponential. That introduced a big cost, that is the reason why different architectures have been developed to have a linear scaling of the performance with the size of the infrastructure.
As we can see in Figure 2, the relation between the size of the data and the performance is not linear for a relational database. One of the main challenge of big data is to try to have this curve more linear, to reduce the cost of the infrastructure and to have better performance.
if you want more information on differents technologies, have a look at the following articles: