I’m a Data Scientist and Solutions Architect at JustGiving, leading the delivery and data engineering of our AWS-based data science and analytics RAVEN platform, and the production deployment of Apache Spark, Scala and Python solutions that integrate with the mobile application and website.
I find that people often think about big data solutions as focusing on the size of the logs, unstructured data, number of rows etc. You could think of this as a big data lake that has everything on which you can run reporting, metrics calculations, machine learning algorithms and graph analytics processes. This can cover most traditional use cases but often long running ETL or Spark jobs need to run on a schedule so the data is not always up to date.
However another important area is Streaming analytics which works slightly differently in that you need to analyse or make calculations on the one or many stream of events directly. I like to think of it as having the data drive the code rather than the code querying the data as for non-streaming or batch solutions. Stream processing is gaining more and more adoption for use cases such as real-time dashboards, fraud detection, and operational issue notification. For example, an analytics system could process click events on check-out from an e-commerce application and create aggregate for each minute of the day, to allow users and systems to react in the best way.
Traditionally, this would have gone through a data warehouse or a NoSQL database, and the data pipeline code could be custom built or based on third-party software. In addition the data reports and dashboards would only be refreshed every few hours or as part of an overnight ETL job. Using a streaming analytics architecture, we can provide analysis of events typically within one minute or less.
To that end I’ve recently written a post on the AWS Big Data Blog on how, in a few lines of Python code, you can do streaming analytics in AWS, without needing to run and maintain an Apache Spark cluster or EC2 server. I discuss how you can create a clusterless or serverless solution using only AWS Kinesis, DynamoDB, Lambda, and CloudWatch.
You will benefit from:
- Lower costs – you only pay when the code is executed and do not need to maintain an always-on cluster
- AWS security – you do not need to manage any keys or passwords
- High availability and scalability – the Lambda functions are executed in multiple availability zones and scales with the data load in Kinesis.