Apache Mesos and Its Benefits in Spark Streaming and Big Data in General
Date: February 5, 2018
Posted by:

Situation Overview

Nowadays there is a plethora of big data tools and frameworks, each with a unique set of advantages. This is particularly pronounced in the case of streaming data. Inevitably, most organizations end up using a set of these systems for their data analytics needs. So, before they know it, they end up with a mosaic of tools, which somehow much coordinate and liaise with the big data governance platform the organization is using (e.g. Spark).

Problems Involved

This excessive diversity of tools make coordination among them a very challenging task. Also, conventional systems like the process coordinator of Hadoop are insufficient for all this. Namely, YARN has a series of limitations, such as the fact that it only recognizes a certain kind of jobs (e.g. MapReduce) and not others. In addition, due to the microservices aspect of the platform, Hadoop needs to own the whole cluster in order to perform a coordination role, while work loads cannot stem from mixed applications. What’s more, due to batch orientation, it needs to manage certain frameworks (e.g. Kafka) separately.

Implications of These Problems

All this creates a lot of overhead in the big data system and there is an increased risk of something going wrong and causing delays in the whole pipeline. This general procedural discord translates into increased costs in a big data project, while the specialist working on the data governance system is not a cheap resource either. In addition, as new tools become available, adopting them may be tricky as they may not integrate well with the default scheduler of the big data platform, potentially causing additional issues and delays. However, not adopting a new system may not be a sound move strategically, since many of these novel tools bring about additional efficiency that would not be otherwise possible.

Apache Mesos as a Potential Solution

On the bright side, this whole situation could be handled by a scheduling system like Apache Mesos. This big data tool acts like a dispatcher of sorts, taking care of the scheduling of the various processes involved in a pipeline, regardless of where they come from. All the tasks are isolated as they are run in containers, each with its own set of resources for that task. This is one of the things that makes it fault-tolerant.

When deployed to a computer cluster, Mesos comprises of four main parts: the Zookeeper configuration manager (part of Hadoop), the Mesos masters (a set of instances of Mesos that are in control of a cluster each), the Mesos slaves (specialized instances of Mesos that offer resources to the various clusters), and the frameworks (the components of the cluster responsible for bridging the Mesos layer with your applications). All of that can be accessed through an API, which is available in various programming languages, such as C++, Python, and Java.

As a result of all this, Mesos is able to offer resilience, scalability, and continuous processing, to your streaming projects. This translates into a more efficient pipeline and more mitigated risks of issues in it. Naturally, its effect on the bottom line is similarly positive.

Next Steps

Data Science Partnership (DSP) is atop developments like this one, as well as many other technologies related to data science. So, DSP can help you fill that gap of Mesos expertise in your organization, by providing you with the right human resources, contract-based or in-house, able to undertake the implementation of such a system. Feel free to Chris Wrightat cwright@dsp.ai to learn more about how DSP can help your organization take advantage of this promising data science technology.

Share with...

Zacharias Voulgaris

Zach is the Chief Technical Officer at Data Science Partnership. He studied Production Engineering and Management at the Technical University of Crete, shifted to Computer Science through a Masters in Information Systems & Technology (City University of London), and then to Data Science through a PhD on Machine Learning (University of London). He has worked at Georgia Tech as a Research Fellow, at an e-marketing startup in Cyprus as an SEO manager, and as a Data Scientist in both Elavon (GA) and G2 (WA). He also was a Program Manager at Microsoft, on a data analytics pipeline for Bing.


Leave a Reply

Your email address will not be published. Required fields are marked *