James Wing By James Wing on 2017-05-24

The latest Apache NiFi release, 1.2.0, is now available in the AWS Marketplace from BatchIQ in Community and Professional Editions. You can read about the release in the official Apache NiFi Release Notes for 1.2.0. NiFi 1.2.0 is another big release with new features and improvements.

For me, the biggest new feature is the introduction of the Record concept for structured data in NiFi. NiFi has always had support for various structured data formats like JSON, Avro, CSV, XML, etc. But that support has been uneven, and has led to some bizarre data gymnastics trying to perform common conversions and transformations. Records bring a framework for more standardized operations on structured data, starting with schemas and common operations like splitting and format conversions. I expect this to be a growth area for NiFi, and should make life much easier for everyone working with structured and semi-structured data.

More...
James Wing By James Wing on 2017-02-20
Amazon S3 Ingest with Apache NiFi

Everybody loves S3 data storage. While there are as many different ways to deliver content to S3 as there are users, there are some common patterns for solving S3 content delivery with Apache NiFi. This article describes, compares, and contrasts three patterns for S3 data delivery in NiFi with respect to typical design concerns:

  • Security and exposing an API
  • Tracking the latest changes
  • Reading from multiple buckets
  • Working across AWS accounts
  • Availability of the solution

More...
James Wing By James Wing on 2017-02-14
Processing S3 Notifications with Apache NiFi

Receiving and processing S3 event notifications is a common pattern for processing files written to S3. Event notifications convert S3 from just a key/value store to a stream of events. Handling these events is the best way to perform low-latency processing of S3 objects. This pattern consists of several components:

  • S3 bucket for data collection
  • SQS queue to receive S3 event notifications
  • Apache NiFi to process the notifications and incoming data files

More...
James Wing By James Wing on 2017-01-16
Processing AWS CloudTrail Events with Apache NiFi

Processing events from AWS CloudTrail is a vital security activity for many AWS users. CloudTrail reports on important security events like user logins and role assumption, "management events" from API calls that can change the security and structure of your account, and recently "data events" from more routine data access to S3.

Apache NiFi is a great platform for processing a stream of CloudTrail events, providing low-latency handling and the flexibility to both directly process events while also supporting a wide variety of downstream processing and reporting options:

  • Enrich events with account information, IP geolocation, machine learning algorithms, etc.
  • Identify events that require immediate action
  • Route events to storage and query systems like S3, ElasticSearch, Kinesis, etc.

More...
James Wing By James Wing on 2016-12-16
Amazon Athena Ingest with Apache NiFi

Amazon Athena is a recently launched service that provides interactive SQL queries over your data in S3. Athena uses the Hive Metastore to define your data structure, and Presto for processing queries. According to Amazon's marketing copy, "there’s no need for complex ETL jobs to prepare your data for analysis". Sounds Great!

Notice, however, the fine print -- it doesn't say "no ETL", just no "complex ETL". My first Athena project was to to query the summarized tweets I stored up from the earlier S3 ingest project. This did not work because I had stored the data in Avro format, which is not yet supported by Athena. Another data set I tried has a mix of CSV data files with JSON metadata, which Athena rejected.

What ETL do we need for Athena? Apache NiFi is a great tool for building an ingest pipeline to the Amazon Athena query service, and to other AWS data tools. In this post, I will explain how to set up a data set in S3 for Athena using Apache NiFi. Then we'll revisit the Twitter to S3 flow, optimizing for Athena.

More...
James Wing By James Wing on 2016-12-13

The latest Apache NiFi release, 1.1.0, is now available in the AWS Marketplace from BatchIQ in community and professional editions. You can read about the release in the official Apache NiFi Release Notes for 1.1.0. NiFi 1.1.0 is a big release with many improvements and fixes beyond the already big 1.0.0 release.

As always, I would like to call out a few exciting new features:

  • CloudWatch - New PutCloudWatchMetric processor for writing custom CloudWatch metrics, perfect for monitoring your NiFi flow.
  • ListS3 - ListS3 processor has improved performance handling large S3 buckets.
  • NiFi UI - Visual indications of backpressure and queue size.

Anyone migrating from an earlier version of NiFi should take a look at our upgrade guidance.

More...
James Wing By James Wing on 2016-10-31
Amazon S3 Ingest with Apache NiFi

A frequent goal for an Apache NiFi flow is to ingest data into S3 object storage. You might be using S3 as a Data Lake. Maybe S3 is an intermediate destination, awaiting another pipeline to Redshift or HDFS. S3 has become a default parking lot for data because S3 is general-purpose, cheap, accessible, and reliable. But it isn't always pretty, the results of accumulated data frequently exemplifies the "Data Swamp" label.

Just saving data on S3 doesn't make it an analytic data store. But it could be. In this post, we'll explore how Apache NiFi can help you get your S3 data storage into proper shape for analytic processing with EMR, Hadoop, Drill, and other tools.

More...
James Wing By James Wing on 2016-09-12
Using NiFi for Database Extract

In an earlier post, I wrote about using Apache NiFi to ingest data into a relational database. Today, we'll reverse the polarity of the stream, and show how to use NiFi to extract records from a relational database for ingest into something else -- a different database, Hadoop on EMR, text files, anything you can do with NiFi. Apache NiFi has a couple built-in processors for extracting database data into NiFi FlowFiles, and we'll look at the pros and cons of each in building your database flow. Last, I built a sample flow that reads database records incrementally.

More...
James Wing By James Wing on 2016-08-15

The latest NiFi release, 0.7.0, is now available in the AWS Marketplace as Apache NiFi provided by BatchIQ. You can read about the release in the official Apache NiFi Release Notes for 0.7.0. 0.7.0 is expected to be the last significant release before the big 1.0 release later this summer.

As always, I would like to call out a few exciting new AWS-related features:

  • DynamoDB - New processors for reading and writing to DynamoDB
  • Credentials - New options for providing AWS credentials including named profiles
  • ListS3 - New ListS3 processor for iterating over objects in an S3 bucket

AMI Updates

The BatchIQ AMI has also been improved:

  • NiFi-Init included
  • NiFi configuration now backed up daily

Anyone migrating from an earlier version of NiFi should take a look at our upgrade guidance.

More...
James Wing By James Wing on 2016-08-11
Amazon Elastic MapReduce Ingest with Apache NiFi

Amazon Elastic MapReduce (EMR) is a great managed Hadoop offering that allows clusters to be both easily deployed and easily disolved. EMR can be used to set up long-lived clusters or run scripted jobs priced by the hour. But getting your data into an EMR cluster can be challenging, for a number of reasons:

  • Network access to EMR clusters is tightly restricted by default
  • Data formats and patterns in Hadoop are specialized
  • Hadoop clusters host a number of components -- HDFS, HBase, Hive, Spark, etc. -- each with their own configuration

In this post, we'll go over some of the basics for connecting Apache NiFi to an EMR cluster.

More...
Older Posts