21 Aug 2018

What is the Objective?

Objective is live transfer of data from source to destination. There can be various type of sources and there can be various types of destination. Use case which we are going to consider have a web application server as the source, once data is read from the source there are two things we do one is to store a copy of the data in Database and other is to do real time processing on the other copy of the same data.
Idea behind saving one of the copy in database is to retain a golden copy of data in case something goes havoc.

    Components Used

  • Spark streaming 2.3.1
  • Kafka 0.10.X
  • Hadoop 2.7.3 (yarn)
  • Java 1.8
  • Scala 2.11
  • Python 3.6
  • MySQL 5.7

What are the bottlenecks?

  1. Primary bottleneck which we have come across is incorrect network configurations of cluster.
    1. Sometime it may be due incorrect definitions of TCP and UDP protocols on the port on which data is going to be arrived or from which data is going to be read.
  2. Another Issue is, Spark running out of memory. Now, this memory issue can be of various types and can be at various levels.
  3. Note: One thing to remember is Spark always consider memory at two levels one is the memory which it is going to use and other is the memory which it keeps in buffer although it is not used it but it is required that this particular amount of memory should be their this tuning can be configured but we will come on it in another blog.
  4. It can also be related to application master not able to accept job. Due to memory not avalilable

  5. How does it works?

    • At source we have a file based system where we store the logs generated for the web application these logs contains information related to:
      • Response Time
      • Request Type
      • Response Code
      • Map ID - This represent unique id of a printer (You can ignore. It's some business requirment)
    • At Intermediate Stage, We have Spark cluster running over Yarn.
    • At destination, We have a database in which data is getting stored
      • We also have a python based dashboard which is showing analysis of live stream of data.
    • You can go in more detail even see the sample of code which I have prepared for this page on this link.
      I can tell you on length about each detail but I don’t think that will be of any use until you try and fail. Moreover you can find the theoretical details about this scenario on many web pages.
      You can reach out to me if you need any help

      Happy Streaming!!!!