Cassandra User Defined Functions and Aggregates
Overview
User Defined Functions (UDF) and Aggregates (UDA) have seen a number of improvements in Cassandra version 3.x. In particular the sand boxing of UDF code makes this functionality safer in a production environment and has led us to include Java UDF support in our Cassandra 3.x managed service offering.
UDF/UDAs allow the execution of user provided code on the server side (Coordinator Node). This code will be simple with no dependencies and only using input parameters that come from table data. These functions are sandboxed with a custom security manager that does not allow access to things like the file system or cassandra internals where there is potential for malicious use. Java and Javascript are supported out of the box. Some languages that are compatible with JSR223 are also supported but may require additional libraries to be installed (See the Read Me files in the lib/jsr223 of your Cassandra install). The custom security manager does the best sandboxing with pure Java. Also Java will perform best with the least invocation latency. So sticking to pure Java is recommended.
Use Cases
Generally the use cases are where there is advantage of running code on the cassandra nodes.
- Performance benefit from reducing network usage. This is the case where reducing the amount of data returned to the client is significant to overall performance. Which may mean performance benefits of UDFs are more easily seen on larger clusters.
- Can simplify/clean client side code.
- Can be used to provide functionality that is familiar to SQL users like group or distinct.
- Can be used to perform pre-aggregation for spark jobs.
Setting Up
There are two aspects to setting up UDF/UDA. Enabling UDF in the cassandra configuration and setting user permissions.
As UDF can be misused it is turned off by default. To enable it the following needs to be set in the cassandra.yaml:
It is recommended to only use pure Java so leave the following as false:
Instaclustr provisioned clusters will have both of these settings set as recommended.
In order to run a UDF as part of a query a role needs execute permission on the functions. e.g.
You will also need CREATE or ALTER permissions in order to add and replace functions within a keyspace. e.g.
Throughout the rest of this blog I will be using examples based around a football dataset. Here is the table creation for reference.
You can also find the CQL used at this GitHub page.
User Defined Functions
UDF’s are added to a keyspace using the CREATE FUNCTION statement. It takes the form:
Where:
- Arguments types are CQL types. Refer to the below table.
- CALLED|RETURNS NULL defines the behaviour when a null value is encountered. RETURNS NULL simply returns NULL. CALLED will call the function with a null value. Meaning that in this case your code needs to handle nulls appropriately.
- Language is the name of the language used for the body of the function e.g. Java
- Body will consist of the custom code for the function
- Return type must be valid cql type refer to below table.
CQL | JAVA |
boolean | java.lang.Boolean |
int | java.lang.Integer |
bigint | java.lang.Long |
float | java.lang.Float |
double | java.lang.Double |
inet | java.net.InetAddress |
text | java.lang.String |
ascii | java.lang.String |
timestamp | java.util.Date |
uuid | java.util.UUID |
timeuuid | java.util.UUID |
varint | java.math.BigInteger |
decimal | java.math.BigDecimal |
blob | java.nio.ByteBuffer |
list<E> | java.util.List<E> where E is also a type from this list |
set<E> | java.util.Set<E> where E is also a type from this list |
map<K,V> | java.util.Map<K,V> where K and V is also a types from this list |
(user type) | com.datastax.driver.core.UDTValue |
(tuple type) | com.datastax.driver.core.TupleValue |
Example
Here we create a simple function that will give us the margin of a football game. Notice the Math and Integer libraries are referenced directly.
Aggregates
Aggregates provide a combined result based on all the rows matching the query. Cassandra already has a number of built in aggregates which are in the system keyspace these are:
- Count
- Min
- Max
- Avg
- Sum
These of course can be combined with functions for practical benefit.
Examples
We can calculate the number of goals scored for a team:
We can use the max and margin function to find greatest winning margin for the season.
User Defined Aggregates
You can also create custom aggregates. These will utilise a user defined state function and an optional final function. You can add them to your keyspace with create statements that have the following syntax:
SFUNC The state function that is called once for every row returned. The return value of the state function becomes the state parameter for the next call.
STYPE The type of the state parameter, which must be valid CQL type.
FINALFUNC an optional function called once after the state function has been called for every row. The input is the return of the last state function call.
INITCOND sets the initial value for the state passed to the first state function call. The default value is null.
Examples
In this example we create an aggregate for calculating the total goals scored per team. This does not require a final function. It simply returns the map from the last call to the state function.
Here you can see the UDA being used in a query. We get back single row with a map showing the teams and their total goals scored across the season.
In the following example we create mode functionality to find the most common scores. Here a final function is required where we traverse the map to find the mode result. It is worth noting that care should be taken when using logic with loops. You can imagine the cost of expensive calculations on large partitions.
Schema
If you require details on the functions and aggregates in your schema, you can query the system schema tables as follows:
Conclusions
So UDF and UDAs are a useful addition in your CQL tool belt. They enable you to enrich the out of the box functionality of Cassandra. However as they can be misused, appropriate care must be taken when using them. They are not a substitute for well designed data models. Consideration should be given to the overhead on the coordinator node from their use. They are best used in queries that hit small partitions. For large or multiple partitions we would stick to using Spark. Expect more to come on UDFs as further features are planned that build on this functionality such as functional indexes.
===========================================================
No comments:
Post a Comment