Divolte

Divolte Collector is a high-performance, scalable server for collecting clickstream data in HDFS and for Kafka topics. It uses a JavaScript tag on the client side to gather user interaction data, similar to many other web tracking solutions. Divolte Collector can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines and banner optimization systems.

Site integration

You integrate Divolte Collector into your site by simply including a small piece of JavaScript.

This takes care of logging all page views and exposes a JavaScript module. This module can be used to interact with Divolte Collector in the browser and log custom events.

Scalable

Divolte Collector pushes data to Hadoop HDFS and Kafka topics. Data is written to HDFS as complete Avro files, while Kafka messages contain serialized Avro records.

Divolte Collector itself is effectively stateless; you can deploy multiple collectors behind a load balancer for availability and scalability.

Structured data in Avro

To preserve the sanity of developers and data scientists alike, all data should come with a schema. CSV is not a schema. The common log format is not a schema. JSON is not a schema. A schema defines which fields exist and what their types are. Using a schema allows you to inspect data without making assumptions about which fields are available.

Divolte Collector uses Apache Avro for storing data. Avro requires a schema for all data, yet it allows for full flexibility through schema evolution.

Through a special feature of Divolte Collector called mapping, you can map any part of incoming events onto any field in your schema. Mapping also allows for complex constructs such as mapping fields conditionally or setting values based on URL patterns or other incoming event data.

User agent parsing

Know what this means? Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36

User agents are parsed into readable fields such as operating system, device type, and browser name. User agents are parsed using a database of known user agent strings, which Divolte Collector can update on the fly without requiring a restart.

IP to geolocation lookup

If enabled, Divolte Collector will perform on-the-fly ip2geo lookups using databases provided by MaxMind. You can use either the light version of their database, which is downloadable for free or use their more accurate subscription database, which comes with a commercial license.

Note that it's not possible for us to enable this feature by default as redistribution of the MaxMind database is restricted. Configuration is however very simple: just put the path of the database file in the Divolte Collector configuration.

If you have a subscription license to the MaxMind database, Divolte Collector will reload the database as updates appear without a restart.

Fast

Tracking code shouldn't keep the browser spinning any longer than necessary.

Divolte Collector was built with performance in mind. It relies on the high performance Undertow HTTP implementation and has a clean internal threading model with zero shared state and a high level of immutability. Everything is non-blocking which results in little contention under normal operation.

Custom events

Log anything from the browser. As with other web tracking tools you can fire custom events from your pages using JavaScript. Whether it's an add-to-basket, checkout or product image zoom, just add a custom event if you want to track it.

Custom events can have parameters in the form of arbitrary JavaScript objects and these are easily mapped onto your Avro records. They are part of your own schema. You can extract top-level object members directly by name or use JSONPath expressions to extract values, arrays or complete objects from the event payload.

Hadoop ecosystem

Divolte Collector is not opinionated about the best way to process or use your data. By writing data as Avro records, you are free to use any framework of your choice for working with your data.

Perform offline processing of the clickstream data using Cloudera Impala, Apache Hive, Apache Flink, Apache Spark, Apache Pig or plain old MapReduce. Anything that understands Avro will work.

For near real-time processing, you can consume Divolte Collector's messages from Kafka using plain Kafka consumers, Spark Streaming or Storm.

Open source

Divolte Collector is released under the Apache License, Version 2.0.

It's never a good idea to be locked in to a vendor for your data collection. Similarly, sending your clickstream data to cloud providers can present issues. Better to take control and free yourself from data ownership issues, closed formats and license or service fees for obtaining for your own data.

References

E-retail: Better customer relations by using machine learning algorithms


Data and technology are widespread and readily available. The true challenge of becoming DataDriven lies within your organization, the way you work, and the skills of your team members.

Rob Dielemans
Managing Director, GoDataDriven