Is the Golden Era of Hadoop Passing You By?
Four years ago, a developer and I were doing a proof of concept (POC) for a pharmaceutical client to see if we could run a complicated analytic through Hadoop. From start to finish it was taking the client no less than two weeks to complete under the existing paradigm. We were doing the POC in Amazon AWS using EMR (Amazon’s version of Hadoop). The files we were pulling for the POC were large claims files, 4-10GB each. We tried using Amazon’s large data transfer tool. Long story short, we kept getting errors from the tool and we called them up. Amazon tech support told us that the tool could not handle files larger than 1GB. Maybe they had a different idea of what “big data” meant at the time, but outside of pulling incremental log files on a repeating schedule, few people had ever used the tool.
At that time the Hadoop installation process was difficult at best, security was a nice-to-have, you could expect to hit at least one bug a day, and finding a competent Hadoop administrator was like finding a needle in a haystack. Fast forward to today and solid Hadoop platform software is available and geared towards everyday enterprise use. We now have security, concurrence, and big data-as-a-service platforms that can be stood up in 30 minutes.
More importantly, however, we know the most efficient way to use the technology. Instead of persisting all the data on the cluster, we persist the data on cheap storage (S3 or ADLS) and only use the cluster for interim data processing. This approach allows organizations to realize the savings that were originally promised. Cheap storage finally meets distributed processing power. Hadoop is great at source data provisioning and batch integration. For routine user access we can employ satellite marts using traditional relational technologies (SQL Server, Redshift, etc.). The kinks of other tools making use of the parallel processing have been worked out for the ancillary product universe, such SAS and R Studio.
All this is to say that Hadoop is ready and capable of providing low cost large data storage and processing platforms that play nice with end user tools and analytics. The lifespan of any technology is getting shorter and the trick is to know when to dive in and when to get out. Now is the time to get in and no doubt in four to five years we will be going through the cycle again with something new. This truly is the Golden era of Hadoop and it will be over as quickly as it began. Don’t let it pass you by.
…and for the POC analytic, we got it down to 20 minutes