AnalyticsBig DataInsights & OpinionInterviewMachine Learning

Is Apache Spark today’s Hadoop?

Exclusive interview with Romeo Kienzler, Chief Data Scientist of the IBM Watson IoT worldwide team

With businesses generating data at an enormous rate today, many Big Data processing alternatives such as Apache Hadoop, Spark, Flink, and more have emerged in the last few years. Apache Spark among them has gained a lot of popularity of late, as it offers ease of use and sophisticated analytics, and helps you process data with speed and efficiency.

Romeo Kienzler

Chief Data Scientist in the IBM Watson IoT worldwide team, has been helping clients all over the world find insights from their IoT data using Apache Spark. An Associate Professor for Artificial Intelligence at Swiss University of Applied Sciences, Berne, he is also a member of the IBM Technical Expert Council and the IBM Academy of Technology, IBM’s leading brains trust.

In this interview, Romeo talks about his new book on Apache Spark and Spark’s evolution from just a data processing framework to becoming a solid, all-encompassing platform for real-time processing, streaming analytics and distributed Machine Learning.

Key Takeaways

  • Apache Spark has evolved to become a full-fledged platform for real-time batch processing and stream processing.
  • Its in-memory computing capabilities allow for efficient streaming analytics, graph processing, and machine learning.
  • It gives you the ability to work with your data at scale, without worrying if it is structured or unstructured.
  • Popular frameworks like H2O and DeepLearning4J are using Apache Spark as their preferred platform for distributed AI, Machine Learning, and Deep Learning.

Full-length Interview

As a data scientist and an assistant professor, you must have used many tools both for your work and for research? What are some key criteria one must evaluate while choosing a big data analytics solution? What are your go-to tools and where does Spark rank among them?

  1. Scalability. Make sure you can use a cluster to accelerate execution of your processes
  2. TCO – How much do I have to pay for licensing and deployment. Consider the usage of Open Source (but keep maintenance in mind). Also, consider Cloud.

I’ve shifted completely away from non-scalable environments like R and python pandas. I’ve also shifted away from scala for prototyping. I’m using scala only for mission-critical applications which have to be maintained for the long term. Otherwise, I’m using python. I’m trying to completely stay on Apache Spark for everything I’m doing which is feasible since Spark supports:

  • SQL
  • Machine Learning
  • DeepLearning

The advantage is that everything I’m doing is scalable by definition and once I need it I can scale without changing code.

What does the road to mastering Apache Spark look like? What are some things that users may not have known about Apache Spark? Can readers look forward to learning about some of them in your new book: Mastering Apache Spark, second edition?

Scaling on very large clusters is still tricky with Apache Spark because at a certain point scale-out is not linear anymore. So, a lot of tweaking of the various knobs is necessary. Also, the Spark API somehow is slightly more tedious that the one of R or python Pandas – so it needs some energy to really stick with it and not to go back to “the good old R-Studio”.

Next, I think the strategic shift from RDDs to DataFrames and Datasets was a disrupting but necessary step. In the book, I try to justify this step and first explain how the new API and the two related projects Tungsten and Catalyst work. Then I show how things like machine learning, streaming, and graph processing are done in the traditional, RDD based way as well as in the new DataFrames and Datasets based way.

What are the top 3 data analysis challenges that never seem to go away even as time and technology keep changing? How does Spark help alleviate them?

  1. Data quality. Data is often noisy and in bad formats. The majority of the time I spend improving it through various methodologies. Apache Spark helps me to scale. SparkSQL and SparkML pipelines introduce a standardized framework for doing so.
  2. Unstructured data preparation. A lot of data is unstructured in the form of text. Apache Spark allows me to pre-process vast amount of text and create tiny mathematical representations out of it for downstream analysis.
  3. Instability on technology. Every six months there is a new hype which seems to make everything you’ve learned redundant. So, for example, there exist various scripting languages for big data. SparkSQL ensures that I can use my already acquired SQL skills now and in future.

How is the latest Apache Spark 2.2.0 a significant improvement over the previous version?

The most significant change, in my opinion, was labeling Structured Streaming GA and no longer as experimental. Otherwise, there have been “only” minor improvements, mainly on performance, 72 to be precise as all are documented in JIRA since it is an Apache project. The most significant improvement between version 1.6 to 2.0 was whole stage code generation in Tungsten which is also covered in this book.

Streaming analytics has become mainstream. What role did Apache Spark play in leading this trend?  

Actually, Apache Spark takes it to the next level by introducing the concept of continuous applications. So with Apache Spark, the streaming and batch API have been unified that you actually don’t have to care anymore on what type of data you are running your queries on. You can even mix and match. For example joining a structured stream, a relational database, a NoSQL database and a file in HDFS within a single SQL statement. Everything is possible.

Mastering Apache Spark was first published back in 2015. Big data has greatly evolved since then. What does the second edition of Mastering Apache Spark offer readers today in this context?

Back in 2015, Apache Spark was just another framework within the Hadoop ecosystem. Now, Apache Spark has grown to be one of the largest open source projects on this planet! Apache Spark is the new big data operating system like Hadoop was back in 2015. AI and Deep Learning are the most important trends and as explained in this book, Frameworks like H2O, DeepLearning4J and Apache SystemML are using Apache Spark as their big data operation system to scale.  

I think I’ve done a very good job in taking real-life examples from my work and finding a good open data source or writing a good simulator to give hands-on experience in solving real-world problems. So in the book, you should find a recipe for all the current data science problems you find in the industry.  

2015 was also the year when Apache Spark and IBM Watson chose to join hands. As the Chief data scientist for IBM Watson IoT, give us a glimpse of what this partnership is set to achieve.

This partnership underpins IBM’s strong commitment to open source. Not only is IBM contributing to Apache Spark, IBM also creates new open source projects on top of it. The most prominent example is Apache SystemML which is also covered in this book. The next three years are dedicated to DeepLearning and AI. And IBM’s open source contributions will help the Apache Spark community to succeed. The most prominent example is PowerAI where IBM outperformed all state-of-the-art deep learning technologies for image recognition.

For someone just starting out in the field of big data and analytics, what would your advice be?  

I suggest taking a Machine Learning course of one of the leading online training vendors. Then take a Spark course (or read my book). Finally, try to do everything yourself. Participate in Kaggle competitions and try to replicate papers.

3D Graphics ABBYY Accountability Adaboost Adobe Analytics Adobe Photoshop AI AI assistant Aibo AI broker AI capabilities AI computer AI innovations AI platform AI powered chatbots AI powered headhunter AI SDK AI startups AI toolkit AI trends Alexa algorithms Alibaba Alibaba Cloud AllegroGraph AlphaGo AlphaGo Zero AlphaZero Alteryx Alteryx Analytics 2018.1 Alteryx Server AMA Amazon Amazon Aurora Amazon cloud Amazon EMR 5.10.0 Amazon ML solutions Amazon Neptune Amazon re:invent Amazon Rekognition Amazon Sagemaker Amazon translate Amazon Web Services AMD AmoebaNets Anaconda Anaconda 5.1.1 Anaconda enterprise 5.1.1 analytics Android 8.1 and Transparency and Transparency (FAT) Angular Animation ANN Announcements Apache hadoop Apache Hadoop 2.9.0 Apache Impala Apache kafka Apache Kylin Apache Mesos Apache MXNet Apache MXNet 0.12 with Gluon Apache Software Apache Spark Apache Spark 2.3 Apache Spark MLlib Apache Storm Apache Tomcat API Grafeas APIs Apple Application Server infrastructure AppSync article artificial intelligence Artificial intelligence for gaming artificial intelligence recruiter AT&T AthenaX Atos augmented reality Aurora architecture Aurora Serverless Auto-generate text Auto-sklearn Auto-WEKA autodiff autoencoders automated machine learning automated trading automation AutoML autonomous self driving cars auto scaling database AWS AWS analytics AWS Deep learning AMI AWS DeepLens AWS Lambda AWS management console AWS Marketplace AWS S3 AWS Sagemaker Azure Databricks Azure SQL Baidu Baidu Research Base AMI Bayesian Analysis Bayesian deep learning bayesian learning Bayesian Optimization BBC Behavior Analytics beta Bias Variance Tradeoff Big Data Bigdata Big data analysis with SAS Big Data Analytics big data platforms BigQuery Data Bitcoin Bitcoin gold Bitcoin predictions Blockchain Block sparse kernel BlueData Bluemix brand Bokeh 0.12.11 Book Excerpt book promotional content Box and whisker plot in Tableau BrainChip Brainwave breakthrough Brief History of Time Brytlyt Bubblecharts in Tableau BullSequana S Business analysts business intelligence Byzantium C++ CapsNet Capsule network Caviar CDO Certifiable Distributional Robustness CES 2018 Chatbase chatbot API chatbots Chief robotics officer Chromebooks CIFAR CIFAR-10 Cisco Cisco Spark Assistant Classification Cloud Cloud AutoML Cloudera Cloudera Altus Analytics Cloudera Altus Data engineering Cloud partnership cloud service clustering algorithms CNTK CNTK 2.2 CockroachDB COCO object detection cognitive computing cognitive process automation CoLaboratory Concrete VAEs Conda based AMI Conference on Fairness Confluent platform Connect (); containerization ContentMine conversational AI Cortana Cortana virtual assistant Couchbase Server 5.0 Cray CRM cross border funds Crowdritt CRUD operations cryptocurrency crypto trading Customizing deep learning models cybersecurity CycleGANs cyptocurrency daily news daily news roundup dapps Data acquisition data analytics Data analytics gateway data analytics methodologies database Database management Databricks data driven decisions Data Engine Handling Data exploration techniques data files with IBM SPSS Modeler Dataframes data mining Data models DataOps Data partitioning Data Platform data processing Datascience data science data science announcements Data science in 2017 Data science jobs datascience news data science news data science project portfolio Data Science Stories data science weekly Data Scientists Data Sources and Models Data Storytelling Data streams Data structures and models Data Studio Data Visualization DataViz data warehouse Data wrangling Date with Data Science Decision Trees Deep Bayesian learning DeepDream DeepLab DeepLab v3+ Deeplearn.js deep learning deep learning AMI deep learning framework Deep Learning interpretability deep learning library DeepMind Deep Reinforcement Learning DeepVariant DefinedCrowd Dell EMC & HPE Delloite Descriptive analysis developers developers survey DevOps Dialogflow digit classification disaster management Disaster Recovery Discrete fourier transform Discriminative model distributed streaming DL frameworks Docker container technology dossiers Dragonchain DroneTracker DuerOS Duer OS Prometheus Project Dynamics 365 DynamicSQL eager execution eBay Edward eGenix Elasticsearch Elasticsearch 5.x Elasticsearch 6.0 Elastic Stack Elmo Elon Musk Emotional intelligence endpoint protection ensemble learning enterprise EnvoyAI ErosCoin Ethereum event processing Exoplanet expert insights explainability Exploratory data analysis F# Facebook face detection facial recognition technology Factual Falcon Computing Fast Adaptation Engine FAT 2018 FAT conference 2018 Feature engineering fintech Firebase predictions Fitness AR FIVO framework Franz Fundamental algorithms in Scala FundRequest GAN GANs gaussian methods Gaussian Mixture model GE Healthcare Generative adversarial network Generative Adversarial Networks Generative model Generative Models geo-partitioning Geostatistics Getting Started GIS Github Github Universe Conference Gluon Glyph Lefkowitz Go Golang Google Google's NIMA Google-Landmarks Google AIY Vision kit Google AutoML Google Bristlecone Google Cloud platform Google Cloud Platform Console Google Cloud Services Google Compute Engine Google Lucid Google MapReduce Google Nsynth Super Google Tacotron 2 Google Tangent Google Tensorflow GPU gradient checkpointing Gradient descent algorithm Graph analytics Graph database GuardDuty Guest Post H1 Instances H2O H2O Driverless AI Haar Cascade hadoop HADOOP Resource Estimator handpicked articles HDFS healthcare Helena Hikvision Holosense AI processor tool How it works HPC systems HPE HPE Superdome Flex human genome sequencing Hyperledger Hypothesis Testing Ian Goodfellow IBM IBM Cloud Migration Service IBM cloud restructuring IBM SPSS IBM SPSS Modeler IBM Watson ICLR 2018 ICLR Conference 2018 ICO ImageNet Image recognition Image retrieval Imitation learning Impetus technologies in-browser DL tools Index types Industrial Internet Architect InfoGANs Informatica Powercenter Initial Coin Offering In Memory OLTP Insights & Opinion Integrated Analytics System Intel Intent Based Network System interactive computing internet of things Interpolation in SciPy interpretability Interpretable machine learning Interview iOlite Ionic IoT IOTA IoT analytics IoT applications IPython Ironman IronPython Iterative machine learning IZEA Japan Java Java Machine Learning Javascript Java Streams Jenkins Mesos Plugin Job executor Jupyter 5.3.0 JupyterLab Jupyter Notebook Just for Fun K-means algorithm Kafka Kafka streams Kaggle competition Kepler keras Keras 1.2.2 Keras 2.0.9 Keras on Cloud ML Keras on Docker Keras with deep learning Keynote Kia motors Kibana kmeans Kriging Kubernetes LabVIEW Lambda Landmark recogntion Leverhulme Centre of the Future of Intelligence LiDAR Light Reading Linear Regression model LinkedIn Linux Linux OS Lise Getoor Logistic regression with R Loihi Long short term memory Low SNR Luciano Ramalho Lunit Insight machine learning machine learning algorithm Machine Learning frameworks machine learning language machine learning models machine learning practitioners machine learning service machine learning services machine learning startups Machine Learning with H2O Mac OS Magenta MAML Mapbox MapR MapReduce Marc-Andre Lemburg MariaDB MaskGAN matplotlib Matrix operations McAfee Mean Field Games MemSQL 6 Mesos Meta Learning Metricbeat Microsoft Microsoft Azure Microsoft Connect() Microsoft Excel Microsoft Power BI Microsoft Teams Microsoft Windows 10 MicroStrategy Mike Bayer Milipol 2017 Mitchell Work Center ML-Agents MLlib mobile apps Mobile deep learning modern encryption MongoDB Mongo DB MongoDB Security Mongo shell Mozilla Multi layer perceptron MUSE MXNet myEinstein MySQL MySQL 8 naive bayes NarrativeQA NASA Natural language processing Natural Language toolkit Natural language Understanding Neo4j NER NetBase Neural Fuzzing Neural Image Assessment Neural Network neural network chip neural networks neuroevolution New MapR news NewSQL news round up news roundup NICE NIPS 2017 NIPS keynote session NLP NLTK Library Node.js NoSQL NSynth NSynth Super Nueromation Numeric Metric Aggregation NumPy Nvidia Object recognition Olympus One shot learning Online discrimination and Privacy ONNX OpenAI Open AI OpenCV OpenNLP operating system Optimization problem Oracle Oracle 10g/11g Oracle 12c SQL Oracle OpenWorld overfit PaaS Packt Video Course Pagerank algorithm pandas Pandas on Ray pattern mining Pegasus Pentaho 8.0 Pentaho BI Suite Pentaho Data Integration Periscope data PHP PINT Pixel Visual Core Plato Point Estimation Method Pomogrenate posterior server PostgreSQL PostgreSQL10 PowerAI predictive analytics predictive forecasting press conference prime numbers Principle Component Analysis Principled Adversarial Training probabilistic modeling process automation Professor Stephen Hawking Progressive GANs Project Jupyter Project Scio Puppet Pydbgen PyMongo PyPy 5.9 Pyramid 2018 Pyro Python Python Data Analysis python library Python package Python platform Python Regular Expressions Python tool PyTorch PyTorch 0.2 Q# Q-learning Qlik quantum computer Quantum Computing quantum neural network qubit machine Qubole QuickPivot R Ray Razorthing Big Brain Reading Comprehension real-time analytics real time recognition Recommenders recruiter headhunter Recurrent Neural Network Recursion Red Hat regex Regression analysis Regression analysis with R Regression techniques Reinforcement Learning Reporting Services in SQL Server Reptile Research RethinkDB R for Data mining Rigetti's Computing Risk assessment R Markdown RNN robotics robots Rocksandra Rockwell Royal Bank of Canada RPA R package Ruby S/4 Hana Cloud Salesforce Salesforce Analytics Salesforce Analytics Query Language Salesforce CRM Salesforce myEinstein SAM SambaNova SAP SAP Analytics SAP Analytics Cloud SAP Vora SAS Scala SciPy scripting language SDK Seaborn Seagate Sebastian Raschka security SegWit2x self-learning self driving cars self learning chip Semantic Image Segmentation Sentiment analysis Sequence models Sharepoint Server Shogi Shutterstock Single layer perceptron smart contracts Snips NLU Social media analytics soft skills Solidity language Sony space exploration spaCy Spark 2.0 sparkling water Spark MLlib Spark SQL SparkSQL sparse matrix speech recognition Splunk Sports Analytics SpotLyt Spring Data Neo4j SQL SQLAlchemy SQLite SQL Operations Studio SQL server Stack Overflow developer survey Standard Macros Statistical analysis Statistical Feature based selection Statistics for Data science stochastic gradient descent Stockfish Stock price predictive model Streaming Analytics stream processing Structure Sumerian Summary supercomputers Sydney Tableau Table Calculation Talent insights Tangent Tangle Teachable Machine Tech Mahindra Tensor algebra TensorFire tensorflow TensorFlow 1.0 TensorFlow 1.4 Caffe2 0.8.1 TensorFlow 1.6.0 Tensorflow 1.7.0-rc0 TensorFlow Lattice TensorFlow Lite TensorRT Tensors in Tensorflow Teradata Analytics platform Text-to-speech Text generation Textual data TFGAN thanksgiving theano Theano 0.9 ThoughtWorks Thunder Tile Time series analysis tools Tools & Frameworks torch TPU transfer learning Treemaps in Tableau troubleshoot Tutorial tutorials TwinNet Twin Networks Twitter Twitter bookmarks Uber Ubuntu underfit Uniqlo Unity Universa Unsupervised Machine Learning US Elections on Twitter UX design into machine learning Valorem foundation VBA Vespa ViewLift Virtual reality Visual Basic Upgrade Vora 2.0 VoxelNet architecture Wasserstein GANs Wavenet web development Web scraping weekend reading weekly data science news What is Windows Windows 10 Windows ML Windows Server Wired Woo word2vec Word embeddings word vectors Workflows X-pack Xavier SoC Xiaomi Yahoo year end special Yellowfin 7.4 Yellowrocket Yoshua Bengio

Amey Varangaonkar

Data Science Enthusiast. A massive science fiction and Manchester United fan. Loves to read, write and listen to music.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *