Sparkube Tutorial
Learn how to start a multidimensional cube on Spark

What is Sparkube?

Sparkube is a visualization tool to analyse Apache Spark datasets as multidimensional cubes. It will examine your dataset, create hierachies and measures for each column and launch a server to expose the cube. This OLAP cube can then be manipulated like a Pivot table in Microsoft Excel or through ActiveViam’s analytical platform called ActiveUI. Sparkube server will act as an XMLA server converting end user actions into Spark transformations and actions.

Sparkube allows data scientists to perform projection operations along the dimensions effortlessly and in the graphical tools they love.

Getting started!

This tutorial will teach you how to launch a Sparkube server on top of your Spark cluster.

Deployment platform: Databricks Linux Windows

Set-up

In you workspace, create a new library by uploading the Sparkube jar. Then attach it to the cluster(s) on which you want to run Sparkube.

Update the Databricks security group in your AWS account to give ingress access to incoming Sparkube connections. You will need to specify which IP addresses will be allowed to connect to Sparkube. You or your admin only need to complete this step once:

  • From your AWS console, go to the VPC dashboard and find the Databricks security group. It will have a label similar to -worker-unmanaged. For example, dbc-fb3asdddd3-worker-unmanaged.
  • Edit the security group, and add an inbound TCP rule to allow port 6006 to worker machines. It can be a single IP address of your machine or a range.
  • Make sure your laptop and office allows sending TCP traffic on port 6006.

First Cube

You can build your first cube with any dataset. We will use this small dataset in this tutorial. It is a record of online retail invoices which provides interesting hierarchies and measures.

In the data menu, you can upload it from your computer as a new Table:

Start by creating a dataset as usual (for instance from a CSV file):

var retailDataset = spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("/FileStore/tables/online_retail_dataset-92e8e.csv")

Note that when you import your dataset from a CSV file, you have to infer the Schema explicitly, or else all your columns will be string type and your cube won't have any measures

Import Sparkube dependencies:

import com.activeviam.sparkube._

You can do any transformation you want on your dataset. When you are ready you can expose it as a multidimensional cube:

new Sparkube()
.fromDataset(retailDataset)
.withName("Retail")
.expose()

If the launch was successful, you should get a success message with some links to connect to your new OLAP cube.
That's it! Your cube is ready, you can now use it from Excel or ActiveUI with the provided links.

Visualization in ActiveUI.

Once your cube is exposed, you can access it in your browser at the link given by expose, for example http://ec2-52-215-8-241.eu-west-1.compute.amazonaws.com:6006/ui (or http://localhost:9090/ui if you're running locally). All your actions in the UI will be converted by Sparkube into Spark queries.

In the right panel you will find the Data Explorer. Every string column of your dataset has been converted into a Hierarchy:

All numerical columns have been turned into a measure. Because Spark will do all the aggregations Sparkube has created one measure for each Spark aggregation function (sum, average, min, max, variance...).

Query the cube in ActiveUI.

In ActiveUI click on "Pivot Table" to create your first view. You should have a table with a single cell: the numbers of rows in your dataset. "Contributors.COUNT" is the default measure that simply count the number of elements. You can drag and drop another measure such as "Consumption.SUM" and this will give you the total consumption in your dataset.

You can drag-and-drop any hierarchy on "Row" or "Column". This will aggregate per member of your hierarchy. In the example dataset you can try grouping by "Regions" on rows to get the consumption for each of the 12 French regions. If you add "Field" on the columns you will get a table with the consumption per region and per field.

Visualization in Excel.

In Excel you can access your cube as a Microsoft Analysis Services server. Data will not be loaded into Excel and all the computation will be done by Spark. Excel is only used as a User Interface.

Once your cube is exposed, your cube is available at the link given by expose, for example http://ec2-52-215-8-241.eu-west-1.compute.amazonaws.com:6006/xmla (or http://localhost:9090/xmla if you're running locally).

You can load the cube by clicking on Data > Get Data > From Database > From Analysis Services.

Enter your cube address and leave the login blank, there is no login required.

All the cubes you have exposed will appear here. The name showed in here is the name that you provided in the withName() method. Select the one you want to explore and click on “Finish”.

That's it, you now have your data in Spark exposed as a multidimensional pivot table in Excel.

You can read more about Analysis Services at support.office.com.

Query the cube in Excel

In this part of the tutorial you will learn how to query the cube containing your data. All your multidimensional Excel queries will be converted by Sparkube into Spark queries.

In your pivot table you can see that every string column of your dataset has been converted into a Hierarchy.

All numerical columns have been turned into a measure. Because Spark will do all the aggregations Sparkube has created one measure for each Spark aggregation function (sum, average, min, max, variance...).

You can drag-and-drop any measure on "values". This will aggregate on all the lines of your dataset. If you use the example dataset about electricity you can use Consumption.SUM to get the total consumption of France in 2015.

You can drag-and-drop any hierarchy on "Row" or "Column". This will aggregate per member of your hierarchy. In the example dataset you can try grouping by "Regions" on rows to get the consumption for each of the 12 French regions. If you add "Field" on the columns you will get a table with the consumption per region and per field.

Any multidimensional query is possible, even more complex ones. For example you can ask for the 10 cities with the highest consumption. To do that select a hierarchy on Rows, then click on Row Labels filter symbol > Value Filters > Top 10.

There are lots of possible multidimensional queries and Sparkube lets you do them through Excel UI with the power of a Spark cluster.

You now know all the basics of Sparkube. We invite you to try it with different and bigger datasets. This tutorial will now explain more advanced features that will help you understand more deeply how Sparkube works.

Cube and server management

When you create a cube you can store it in a variable to use it later.

var cube = new Sparkube()
cube.fromDataset(df).expose()

You can use this to rename or delete your cube.

cube.rename(oldName, newName)
cube.unexpose()

You can also stop the whole server:

stopSparkubeServer()

If you want you can create a server on a specific port. This is useful if the default port (9090) is already in use. You need to call this before exposing your cube.

Sparkube.startSparkubeServer(8080)

Any cube exposed after that (by calling the expose function) will be added to this new server.

Hierarchy Creation

Sparkube keeps a list of every distinct member for each hierarchy in order to optimize requests. Creating this list is done at the first request on the hierarchy, we call it lazy loading of hierarchies. It can take time and require memory.

Slicing Hierarchies

When you create your cube you can use the slicing hierarchy option. A slicing hierarchy is a hierarchy that will not be aggregated on all the members. For example currency is usually a slicing hierarchy because you don't want to aggregate Euros and Dollars together but filter only one of them.

You can define a slicing hierarchy like that:

new Sparkube().fromDataset(df)
.withName("My first dataset")
.withSlicingHierarchy("Currency")
.expose()