Sparkube Tutorial
Learn how to start a multidimensional cube on Spark

Getting started !

This tutorial will teach you how to create a Sparkube server on top of Spark. You will then be able to perform advanced data visualization by connecting it to software like Microsoft Excel. Sparkube server will act as a XMLA server converting end user actions into Spark transformations and actions.

Development OS : Linux Windows

Installation

In order to use Sparkube you only need a working Spark instance. You need at least Spark 2.0 but we recommend to always use the latest version.

Sparkube JAR will provide the API to build a Sparkube server on top of a dataset. Add the JAR to your application as you usually add external jars. For instance you can add it to you jars folder, in SPARK_HOME/jars

Sparkube JAR will provide the API to build a Sparkube server on top of a dataset. Add the JAR to your application as you usually add external jars. For instance with spark-shell or spark-submit you can use the --jars option :

spark-shell --jars sparkube.jar

Importing the JAR could be different if you are using Zeppelin or Jupyter

First Cube

You can build your first cube with any dataset. We will use this small dataset in this tutorial. It is the French electricity consumption per city in 2015 which provides interesting hierarchies and measures.

Start by creating a dataset as usual (for instance from a CSV file) :

var df = spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("path/to/file.csv")

Note that when you import your dataset from a CSV file, you have to infer the Schema explicitly or all your columns will be string type and your cube won't have any measures

Import Sparkube dependencies :

import com.activeviam.sparkube._

You can do any transformation you want on your dataset. When you are ready you can expose it as a multidimensional cube like that :

new Sparkube().fromDataset(df)
.withName("My first dataset")
.expose()

That's it ! Your cube is ready, you can now use it from Excel or ActiveUI.

Note that Sparkube will keep a list of every distinct member for each column in order to optimize requests. Creating these lists can take time during the cube creation and also requires memory. See Hierarchy Creation for more details.

On Windows, Windows Defender can take a lot of CPU power while running Spark with Sparkube. If you want to improve your performances you should disable Windows Defender

Visualization in ActiveUI.

Once your cube is exposed you can access it in your browser at http://localhost:9090/ui (By default the server is started on port 9090). All your actions in the UI will be converted by Sparkube into Spark queries.

In the right panel you will find the Data Explorer. Every string column of your dataset has been converted into a Hierarchy :

All numerical columns have been turned into a measure. Because Spark will do all the aggregations Sparkube has created one measures for each Spark aggregation function (sum, average, min, max, variance...).

Query the cube in ActiveUI.

In ActiveUI click on "Pivot Table" to create your first view. You should have a table with a single cell : the numbers of rows in your dataset. "Contributors.COUNT" is the default measure that simply count the number of elements. You can drag and drop another measure such as "Consumption.SUM" and this will give you the total consumption in your dataset.

You can drag-and-drop any hierarchy on "Row" or "Column". This will aggregate per member of your hierarchy. In the example dataset you can try grouping by "Regions" on rows to get the consumption for each of the 12 French regions. If you add "Field" on the columns you will get a table with the consumption per region and per field.

Connect Spark to Excel.

In Excel you can access your cube as a Microsoft Analysis Services server. Data will not be loaded into Excel and all the computation will be done by Spark. Excel is only used as a User Interface.

By default your cube is available at http://localhost:9090/xmla

You can load the cube by clicking on Data > Get Data > From Database > From Analysis Services.

Enter your cube address and leave the login blank, there is no login required.

All the cubes you have exposed will appear here. The name showed in here is the name that you provided in the withName() method. Select the one you want to explore and click on “Finish”.

That's it, you now have your data in Spark exposed as a multidimensional pivot table in Excel.

You can read more about Analysis Services at support.office.com.

Query the cube in Excel

In this part of the tutorial you will learn how to query the cube containing your data. All your multidimensional Excel queries will be converted by Sparkube into Spark queries.

In your pivot table you can see that every string column of your dataset has been converted into a Hierarchy.

All numerical columns have been turned into a measure. Because Spark will do all the aggregations Sparkube has created one measures for each Spark aggregation function (sum, average, min, max, variance...).

You can drag-and-drop any measure on "values". This will aggregate on all the lines of your dataset. If you use the example dataset about electricity you can use Consumption.SUM to get the total consumption of France in 2015.

You can drag-and-drop any hierarchy on "Row" or "Column". This will aggregate per member of your hierarchy. In the example dataset you can try grouping by "Regions" on rows to get the consumption for each of the 12 French regions. If you add "Field" on the columns you will get a table with the consumption per region and per field.

Any multidimensional query is possible, even more complex ones. For example you can ask for the 10 cities with the highest consumption. To do that select a hierarchy on Rows, then click on Row Labels filter symbol > Value Filters > Top 10.

There are lots of possible multidimensional queries and Sparkube lets you do them through Excel UI with the power of a Spark cluster.

You now know all the basics of Sparkube. We invite you to try it with different and bigger datasets. This tutorial will now explain more advanced features that will help you understand more deeply how Sparkube works.

Cube and server management

When you create a cube you can store it in a variable to use it later.

var cube = new Sparkube()
cube.fromDataset(df).expose()

You can use this to rename or delete your cube.

cube.rename(oldName, newName)
cube.unexpose()

You can also stop the whole server :

stopSparkubeServer()

If you want you can create a server on a specific port. This is useful if the default port (9090) is already in use. You need to call this before exposing your cube.

Sparkube.startSparkubeServer(8080)

Any cube exposed after that (by calling the expose function) will be added to this new server.

Hierarchy Creation

Sparkube keeps a list of every distinct member for each hierarchy in order to optimize requests. Creating these lists is done when calling cube.expose(). It can take time and also requires memory.

Slicing Hierarchies

When you create your cube you can use the slicing hierarchy option. A slicing hierarchy is a hierarchy that will not be aggregated on all the members. For example currency is usually a slicing hierarchy because you don't want to aggregate Euros and Dollars together but filter only one of them.

You can define a slicing hierarchy like that :

new Sparkube().fromDataset(df)
.withName("My first dataset")
.withSlicingHierarchy("Currency")
.expose()