Flume is a distributed system that gives the possibility to collect, to aggregate and move large amount of log data from many different sources to a centralized data store. It’s a “package” of Hadoop and a project of the Apache foundation.
The source is a Java Virtual Machine (JVM) that is listening to some event. The event can be produced by a web server, the writing into a file (log), some applications, the reception of a packet on an Ethernet port, etc. When such an event is received, the sources store it into one or more channels.
The channel can be compared to a FIFO (First In First Out) queue. This channel can be of 3 types:
- MemoryChannel: All the events are stored in memory (RAM) to improve the performance
- FileChannel: all the events will be stored into the disk to prevent any lost in case of failure of the host.
- JDBCChannel: The events will be stored into a database.
As we can see in figure “flume architecture”, the sink consumes the events stored in the channels and takes, if possible, several events at one time and sends them to the final destination. The destination can be the HDFS, a database or anything else that you put in place. There are several types of sink, for example:
- HDFSSink: allows to write in the HDFS.
- HBaseSink: allows to write in a NoSQL database.
- AvroSink: it is used to write in a log format.