The Democratization of Data Science with the Arrival of Apache Spark

As an emerging field, data science has seen rapid growth over the span of just a few short years. With Harvard Business Review referring to the data scientist role as the “sexiest job of the 21st century” in 2012 and job postings for the role growing 57 percent in the first quarter of 2015, enterprises are increasingly seeking out talent to help bolster their organizations’ understanding of their most valuable assets: their data.

The growing demand for data scientists reflects a larger business trend – a shifting emphasis from the zeros and ones to the people who help manage the mounds of data on a daily basis. Enterprises are sitting on a wealth of information but are struggling to derive actionable insights from it, in part due to its sheer volume but also because they don’t have the right talent on board to help.

The problem enterprises now face isn’t capturing data – but finding and retaining top talent to help make sense of it in meaningful ways. Luckily, there’s a new technology on the horizon that can help democratize data science and increase accessibility to the insights it unearths.

Data Science Scarcity & Competition

dataThe talent pool for data scientists is notoriously scarce. According to McKinsey & Company, by 2018, the United States alone may face a 50 to 60 percent gap between supply and demand for “deep analytic talent, i.e., people with advanced training in statistics or machine learning.” Data scientists possess an essential blend of business acumen, statistical knowledge and technological prowess, rendering them as difficult to train as they are invaluable to the modern enterprise.

Moreover, banks and insurance companies face an added struggle in hiring top analytics talent, with the allure of Silicon Valley beckoning top performers away from organizations perceived as less inclined to innovate. This perception issue hinders banks’ and insurance companies’ ability to remain competitive in hiring and retaining data scientists.

As automation and machine learning grow increasingly sophisticated, however, there’s an opportunity for banks and insurance companies to harness the power of data science, without hiring formally trained data scientists. One such technology that embodies these innovations in automation is Apache Spark, which is poised to shift the paradigm of data science, allowing more and more enterprises to tap into insights culled from their own data.

Spark Disrupts & Democratizes Data Science

Data science requires three pillars of knowledge: statistical analysis, business intelligence and technological expertise. Spark does the technological heavy-lifting, by understanding and processing data at a scale that most people aren’t comfortable. It handles the distribution and categorization of the data, removing the burden from individuals and automating the process. By allowing enterprises to load data into clusters and query it on an ongoing basis, the platform is particularly adept at machine-learning and automation – a crucial component in any system intended to analyze mass quantities of data.

Spark was created in the labs of UC Berkeley and has quickly taken the analytics world by storm, with two main business propositions: the freedom to model data without hiring data scientists, and the power to leverage analytics models that are already built and ready-for-use in Spark today. The combination of these two attributes allows enterprises to gain speed on analytics endeavors with a modern, open-source technology.

The arrival of Spark signifies a world of possibility for companies that are hungry for the business value data science can provide but are finding it difficult to hire and keep deep analytic talent on board. The applications of Spark are seemingly endless, from cybersecurity and fraud detection to genomics modeling and actuarial analytics.

What Spark Means for Enterprises

Not only will Spark enable businesses to hire non-traditional data scientists, such as actuaries, to effectively perform the role, but it will also open a world of possibilities in terms of actual business strategy.

Banks, for example, have been clamoring for Spark from the get-go, in part because of Spark’s promise to help banks bring credit card authorizations back in-house. For over two decades, credit card authorizations have been outsourced, since it was more efficient and far less dicey to centralize the authorization process.

The incentive to bring this business back in-house is huge, however, with estimated cost savings of tens to hundreds of millions annually. With Spark, the authorization process could be automated in-house – a huge financial boon to banks. The adoption of Spark allows enterprises to effectively leverage data science and evolve their business strategies accordingly.

The Adoption of Spark & Hadoophadoop_1_resized

Moreover, Spark works seamlessly with the Hadoop Distributions sitting on EMC’s storage platforms. As I noted in my last post, Hadoop adoption among enterprises has been incredible and is quickly becoming the de facto
standard for storing and processing terabytes or even petabytes of data.

By leveraging Spark and existing Hadoop platforms in tandem, enterprises are well-prepared to solve the ever-increasing data and analytics challenges ahead.

About the Author: Keith Manthey

Keith is the CTO with a passion for High Performance Computing, Financial Services, and Analytics for Dell EMC. He brings more than 24+ years of Identity Fraud Analytics, high performance computing, and Financial Systems experience. Keith holds numerous patents in high performance computing and analytics and is an advisory board member of the University of Georgia’s Management of Information Systems School.