Managed Big Data: DataBricks, Spark as a Service

The title accompanying this blog post is quite the mouth full. This blog post will explain why you should be using Spark. If a use case would make sense, then we will introduce you to the DataBricks product, which is available on Azure. Being recognised as a Leader in the Magic Quadrant, emphasizes the operational use and vision of DataBricks. 

What is Spark and what ‘need’ did it fulfil?

If you’ve been around for several years in the ‘Data and Analytics’ domain, you’ll probably remember the hype on ‘Big Data’. The most common scenario where companies switch to Big Data is if they do not have the analytical power to perform a workload on one (and only one) server. Such a workload could be ‘categorising your documents by using Natural Language Processing’, ‘creating segmentations on all website visitors’, ‘training a very accurate prediction model’, … In the most used sense of the word ‘Big Data’, processing is done by in a distributed way, i.e. by coordinating several machines at once.
Spark is an open-source analytical engine which allow technical users to setup a distributed system, thus allowing companies to tackle their Big Data projects. By default, Spark also gives you the ability to capture streaming events, provides a set of machine learning algorithms and allows for working with graph databases.

We now know Spark, what is DataBricks?

While Spark is great at what it does, it is hard to maintain and configure, hard to spin up and spin down, hard to add servers to your cluster and remove them. DataBricks addresses this problem and provides ‘Spark as a Service’ while also adding enterprise-required features. As the majority of the DataBricks product team has also created the core of Spark, they also made API and performance improvements to the analytical engine they provide you with. As such, we believe that DataBricks is the most enterprise-ready Big Data and Data Science platform.

Get started in minutes instead of days:

Setting-up a databricks custer with several nodes is easy. The wizard ‘New Cluster’ lets you pick your cluster size and what kind of compute it requires: RAM? GPU? Delta Optimized? 

The feature ‘Autoscaling’ allows DataBricks to add-to or reduce the amount of workers within your cluster based on your workload. If the queries sent to your cluster require large amounts of processing power, then other servers will be added to your cluster so you’ll be provided the results faster!

Delta: ACID Transactions on Data Lakes

An absolute game change that DataBricks has brought to the market, is Delta (Lake). Data lakes are (often) a large collection of files, files which are ‘immutable’ and thus cannot be changed. Delta, on the other hand, enforces ACID properties on your Data Lake. Adding ACID properties (being Atomicity, Consistency, Isolation and Durability) to a data lake allows for other analytical scenario’s that involve inserts, updates and deletes.

The DataBricks Notebook Experience

If you haven’t worked with analytical notebooks (Jupyter, Azure Data Studio or DataBricks), you’re missing out! Being able to write documentation and code within the same document is a big step forward. It helps you make clear to the readers why you created the queries and guiding them in their first steps in analytics. As opposed to other notebooks, DataBricks can connect to version control (Git, TFS, …) and allows you to combine both R, Python, SQL and Scala in the same notebook.

Managing your Data Engineers and Data Scientist

DataBricks allows applying security on folders and workbooks using Azure Active Directory; workbooks which contain sensitive data can only be seen by a specific security group. If you combine this with Azure Data Lake Storage Gen2, which allows applying security on data-folders, you’ll have an enterprise-ready data science environment. 

Why shouldn’t I use DataBricks?

Delta Lake will not support operational processes like any OLTP system due to its limited concurrency to a single file. Thus, we as Lytix and Cubis, often use it for Data Warehousing scenario’s which require little concurrency.
As your data is divided over several partitions across many servers and results of queries/calculations should pass through your head node, do not expect that distributed computing will provide you with a snappy behaviour! The technology is used for analytical workloads on large amounts of data and will backfire if you would use it any other way.

Our XTL Framework

In addition to all features DataBricks provides you with, Lytix/Cubis has a framework which simplifies development and implements best-practices. This framework uses Azure Data Factory as orchestrator and as a result provides you with a data layer that is easily accessible by any reporting tool (Power BI, SAP Analytics Cloud, …). As our logging components are automatically wrapped around business transformations, time spend on data engineering is significantly reduced. Using ‘Delta’, we even provide the ‘Data Warehouse on Spark/DataBricks’ solution.

If you want help/guidance on your DataBricks or Azure Platform implementation, do not hesitate to reach out.