Divolte Collector pushes data to Hadoop HDFS and Kafka topics. Data is written to HDFS as complete Avro files, while Kafka messages contain serialized Avro records.
Divolte Collector itself is effectively stateless; you can deploy multiple collectors behind a load balancer for availability and scalability.
Structured data in Avro
To preserve the sanity of developers and data scientists alike, all data should come with a schema. CSV is not a schema. The common log format is not a schema. JSON is not a schema. A schema defines which fields exist and what their types are. Using a schema allows you to inspect data without making assumptions about which fields are available.
Divolte Collector uses Apache Avro for storing data. Avro requires a schema for all data, yet it allows for full flexibility through schema evolution.
Through a special feature of Divolte Collector called mapping, you can map any part of incoming events onto any field in your schema. Mapping also allows for complex constructs such as mapping fields conditionally or setting values based on URL patterns or other incoming event data.
User agent parsing
Know what this means? Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36
User agents are parsed into readable fields such as operating system, device type, and browser name. User agents are parsed using a database of known user agent strings, which Divolte Collector can update on the fly without requiring a restart.
IP to geolocation lookup
If enabled, Divolte Collector will perform on-the-fly ip2geo lookups using databases provided by MaxMind. You can use either the light version of their database, which is downloadable for free or use their more accurate subscription database, which comes with a commercial license.
Note that it's not possible for us to enable this feature by default as redistribution of the MaxMind database is restricted. Configuration is however very simple: just put the path of the database file in the Divolte Collector configuration.
If you have a subscription license to the MaxMind database, Divolte Collector will reload the database as updates appear without a restart.
Tracking code shouldn't keep the browser spinning any longer than necessary.
Divolte Collector was built with performance in mind. It relies on the high performance Undertow HTTP implementation and has a clean internal threading model with zero shared state and a high level of immutability. Everything is non-blocking which results in little contention under normal operation.
Divolte Collector is not opinionated about the best way to process or use your data. By writing data as Avro records, you are free to use any framework of your choice for working with your data.
Perform offline processing of the clickstream data using Cloudera Impala, Apache Hive, Apache Flink, Apache Spark, Apache Pig or plain old MapReduce. Anything that understands Avro will work.
For near real-time processing, you can consume Divolte Collector's messages from Kafka using plain Kafka consumers, Spark Streaming or Storm.
Divolte Collector is released under the Apache License, Version 2.0.
It's never a good idea to be locked in to a vendor for your data collection. Similarly, sending your clickstream data to cloud providers can present issues. Better to take control and free yourself from data ownership issues, closed formats and license or service fees for obtaining for your own data.
Data and technology are widespread and readily available. The true challenge of becoming DataDriven lies within your organization, the way you work, and the skills of your team members.