Monday, December 7, 2015

How can I use docker to containerize my data analytics app: A general overview

I recently answered this question on Quora about Dockerizing data analytics application, intriguing a thought on starting my series of posts on Docker.

This is my first post on that series. As opposed to traditional ways of a series, I will touch very little of preliminary details on Docker and it's purpose. I will take a straight dip into design and architecture aspects of dockerizing a system.

First thing first
Docker is a very light-weight application engine that deploys VM-like containers that shares system level resources to allow easy deploy and multi-tenancy. But it has its own network and process space, as well as a layered union mount file system. It's all written in Go. The three main components are:
·        docker client,
·        docker daemon or server (REST API)
·        docker containers.

How big is the docker container?
The docker container rootfs (lossely speaking, the operating system FS layer) and tmpfs can be of any size depending on the service you are dockerinzing. A small python app can be under 1MB or an full blown service can range around 16G or whatever it is configured to be.
The docker image can be a few hundred megabytes, if that is what you are asking. Usually it takes fractions of a second to launch a container from an image.
Can I put a data analytics product in it
Like I said, yes, you can. But let's break the problem down here.
·        Services - Docker centers around Service Oriented Architecture, or SOA. How will you you reorganize your application into micro-level, self-sufficient services that can communicate with each other? Let's say you have a web app. You need the web engine server (WARs) to be dockerized and there are plenty of examples on the internet to do this. You sure have a database instance, and that can be in a container. Then say you have a few daemons running for something - you need to make a call on where to put them. In short,  the key design principle is to identify the services to dockerize. Then maybe start with writing your own dockerfile for one component and get the ball rolling from there.
·        Networking - Docker solves the port-conflicts in multi-tenancy by dynamic mapping of ports. Each docker container has configurable and statically mapped ports exposed to the user that maps to physical ports in the system (a process abstracted by docker). Docker containers also have IPs assigned that are not discoverable outside the host. In case of service colocation being absent, you might also need the host IPs or configure the docker containers with unique discoverable IPs.
·        The data - Docker does not work with all the filesystems, so based on how your files and other data is stored, it might become a tall order. But in general, you can expose a volume on the host, or even dockerize a volume, and make it available to dockerized services. So yes, data can be ported too.
·        Handling - You might want a resource manager like YARN to allocate container. Zookeeper or Consul can take care of failover. Consul has built in support for configuration management too.