Sep 7, 2014

Virtualization and Hadoop: Whys, Whats and Hows

The idea of this article came after a series of conversations with Hortonwork’s Adam Muise and Aaron Weibe, reflecting on my own experience and things I’ve heard from some other Hadoop practitioners.

Let me start with an outrageous statement: I believe that in the Big Data context Virtual Machines offer very little or no benefits beyond improved security.

Let’s start with looking at the reasons why people use virtualization at all, then talk about security aspects, virtualization techniques and then I’ll try to summarize this experience along with some practical recommendations.

Virtualization: The Why

1. Resource Utilization

Underused CPU and memory resources are a direct loss. The idea is that you can cramp disparate services into the same physical box to have better chances that it’s being used all the time. While utilization used to be a 100%-valid concern with Hadoop-1, nowadays it’s not. YARN addresses this problem, and if you stay within the YARN framework, no additional technology is required. Another aspect of this problem is cross-domain resource utilization. The question is: if I use my Hadoop-2 cluster only one week a month, can I run something else than Hadoop on part of the nodes the rest of the time? YARN is obviously of no help in this situation, so a form of virtualization would be required.

2. Deployment Convenience

Instead of reconfiguring a system each time a new application is installed, it’s possible to just boot different pre-built OS images as virtual machines. When its mission is accomplished, a virtual machine gets discarded and the physical host is ready for running a new app. This is invaluable in a development environment. Consider a situation when you need to try SparkR, then say H2O on Hadoop, then some another weird machine learning library, etc. Many of these things would require installing dependencies. How to manage all this stuff? Even though both Cloudera and Hortonworks offer cluster management tools, those are — putting it mildly — not extremely helpful with OS dependency management. At this moment virtualization is still the best way to achieve the button-push deployment experience.

3. Security

One of the most lucrative promises of virtualization is zero-maintenance process isolation. If your processes run in different physical boxes, they cannot poke each other’s eyes, period. It seems that we can reproduce this effect (to certain extent) by running processes in virtual machines. Some people might say that Hadoop is integrated with Kerberos, so what’s the problem? Kerberos is simply not enough as the apps run on the same hardware, sharing the same memory, file handles and buffers of the same instance of an OS kernel. Besides, if you need multitenancy, only true virtualization is the answer.

Security is Isolation

Let’s gloss over different levels of isolation and their corresponding threat models.

0. No isolation

This assumes a fully cooperative security model. If somebody, even not maliciously, does something bad or just stupid, everybody gets hurt. There are no security threats in this world.

1. Imposed isolation

That’s what you see in most operating systems. You have got users, resources and permissions. If you’re lucky, you get Access Control Lists (ACLs). Access to shared resources is limited according to the permission settings. This type of isolation is enough when you assume no serious insider threats. That is, you kinda can hide things from people, and most of them will not violently try to break the locks. Think of lockers in a gym, this is it.

2. Implicit isolation

Imagine reinforced concrete walls between different security compartments. Resources are not only shared, they are separated by design — ideally, at the physical level. All users are adversaries and as such are considered security threats. This level of isolation is a requirement in public multitenancy systems.


Now let’s briefly look at different virtualization techniques and their suitability for Hadoop.

0. Bare metal

Boot to Hadoop. This level does not exist in reality, I made it up. When I tell people this idea, they typically go like “this might be interesting, but it’s not feasible”. Anyways, with bare-metal Hadoop the goal is not to eliminate the operating system completely, but rather to make the barrier between the hardware and the app as thin as possible. That would mean replacing some system-level services with Hadoop-specific ones and possibly bringing things like HDFS and YARN down to the OS Core. The benefit is obvious – uncompromised performance. The drawback is that it can be an unsurmountable task. There’s no isolation at this level, which means no multitenancy at all.

1. Operating System

An OS is an abstraction of hardware, so to some extent it’s a valid form of virtualization. At any point of time you can wipe out an OS with all the apps and install it over again on the same hardware. At this level you get some security, very simplistic user-based isolation model and reasonably good performance. This is what people typically mean when they say “bare metal”. Although possible, isolation at the OS level is absolutely not adequate to public multitenancy. Shared hosting lessons were learned and nobody is doing it anymore.

2. Containers

Solaris Zones, LXC-derivatives, etc. The idea is to isolate process groups at the OS kernel level, effectively creating security compartments. Such a compartment, along with its associated resources forms a container. Containers are assumed a reasonably safe technology, although their fundamental isolation level — and this is important — is the same as permission-based security. This is so, because containers, in essence, are a mechanism of restricting access to shared resources. Containers are susceptible to all the same attacks as regular OS users. Although containers are currently not popular in the Hadoop world, things are changing and some Hadoop vendors directly recommend using containers instead of Virtual Machines.

3. Virtual Machines

Instead of fiddling with OS-level security mechanisms, you basically say “OK, lets create a virtual machine with a virtual CPU and totally separate memory within my host machine, and run the app inside that VM”. Because resources are [mostly] not shared, VMs achieve much better level of isolation. Of course, it’s more complicated than that and there are at least three different types of VM-based virtualization, but roughly this is the idea. Virtual Machines is what you use when you run Hadoop on AWS or Rackspace.

So, why not use Virtual Machines all the time? There are two major problems with this approach:

  • VM’s are expensive. Expect 10-15%% CPU overhead comparing to “bare metal” Hadoop.

  • VM’s are unpredictable, especially when running in parallel. The same task would take 30 minutes today and 45 minutes tomorrow simply because of all the funny ways the VM’s OS interacts with the host OS.

  • If more than one VM is running on the same box, the IO performance will suffer because of the host OS I/O system overload. And it will hit hard because Hadoop is I/O-bound almost all the time.

Use cases and recommendations

Here is what I’ve seen working or heard from other people that it works.

Safe environment, stable load

It’s simple and straight — go bare metal. You will not be sorry about that, seriously. If you need another security compartment, build another cluster.

Safe environment, varying load

Either rely on YARN for resource management or consider using containers, especially if you ever need to switch node roles. Use one container per physical box to not harm the performance.

Unsafe environment

Only Virtual Machines are up to the task. If you cannot trust your users, simply don’t use anything except VMs. Prepare to pay the performance tax.

Containers: What to Use

Unfortunately, neither Hortonworks nor Cloudera currently offer a turn-on container-based solution. Fortunately, it’s not that difficult to build from scratch. A quick search would bring a few Hadoop-on-Docker projects such as docker-hoya.

I don’t see why it should not be possible to achieve the same with private cloud solutions such as RedHat’s Openshift or ActiveState’s Stackato.


We looked at some basic virtualization concepts in the Big Data context. Full virtualization technologies should be avoided if possible. Containers play nicer with Hadoop than VMs, although there are not so many out of the box Hadoop containerization solutions.

Nov 9, 2013

Installing SBCL 1.1+ on RHEL/CentOS systems

The version of SBCL available on RedHat Enterprise Linux 6.4 (and CentOS) is 1.0.38, which is quite old. If your project requires a newer SBCL, it has to be installed manually.

Although offers some Linux binaries, those are incompatible with RHEL/CentOS 6.4. Compiling from the sources, unfortunately, is the only option.

This tutorial assumes a 64-bit system (x86_64). Compiling SBCL on a 32-bit platform might or might not work — I never tried it.

  1. The first step is to make sure you can compile programs:

    sudo yum groupinstall "Development Tools"
  2. Then enable EPEL, this is necessary for the next step:

    sudo rpm -Uvh epel-release-6*.rpm 
  3. Now, let’s install the old SBCL. We need it because SBCL’s Lisp compiler is written in Lisp, so it requires a working Lisp compiler to compile itself. This older SBCL binary can be safely removed later.

    sudo yum install -y sbcl.x86_64
  4. Download SBCL source code. At the time of writing this post the latest version was 1.1.13:

    tar xfj sbcl-1.1.13-source.tar.bz2
    cd sbcl-1.1.13
  5. Compile the sources. Expect to see a lot of diagnostic messages.

  6. Install the compiled binary. The warnings about missing doc directory can be safely ignored. By default, the binary is installed in /usr/local/bin:

    sudo sh
  7. Make sure it works. You should see “SBCL 1.1.13” in the response:

    sbcl --version
  8. Remove the old SBCL:

    sudo yum remove -y sbcl
  9. Optional: install Quicklisp. This is not strictly necessary, but having a CPAN-like Lisp package manager around will definitely make your life easier:

    sbcl --load quicklisp.lisp \
         --eval '(quicklisp-quickstart:install)' \
         --eval '(ql:add-to-init-file)' \
         --eval '(quit)' 

Enjoy your new SBCL.

Aug 29, 2013
"Catching Salmon" by Daniela Ivanova -

"Catching Salmon" by Daniela Ivanova -

Jun 27, 2013

MySQL Connector: inherited transactions

This is what I have noticed today: if a process opens an MySQL connection and then forks, the child process not just inherits the open connection, but also the transaction state. The current transaction becomes shared between the child and the parent. That is, if the child process rolls back, the parent also gets a roll back. Also, as it is the same transaction, a lock set by one process has no effect on another.

Here is a proof of concept:

Create and populate a database before running this script:

create database mytest;
grant all on mytest.* to ''@'localhost';
flush privileges;
create table foo(a int);
insert into foo (a) values (0);

import time
from multiprocessing import Process
import _mysql

reconnect = False  # change to true to make the child process block (it should)

conn = _mysql.connect("localhost", user="mike", db="mytest", passwd="")

def sub():
    if reconnect:
        sub_conn = _mysql.connect("localhost", user="mike", db="mytest", passwd="")
        sub_conn = conn
    print "SUB: start", sub_conn.thread_id()
    print "SUB: do this to get the number of connections -> sudo lsof | grep mysql.sock"
    sub_conn.query('select * from foo for update')
    if not reconnect:
        print "SUB: NOT BLOCKED, sleeping for 30 sec to hold the conneciton open"
    print "SUB: result", sub_conn.use_result().fetch_row()
    print "SUB: end"

print "HOST: start", conn.thread_id()
conn.query('select * from foo for update')
print "HOST: result", conn.use_result().fetch_row()

process = Process(target=sub)
print "HOST: start sub"
print "HOST: sub joined"

print "HOST: end"

When reconnect is set to False, the parent’s thread id will be the same as in the child. The reason why is that MySQL uses server-side thread ids as connection identifiers. Here’s the mysql_thread_id function (mysql-connector-c-6.1.0-src/libmysql/libmysql.c:1070):

ulong STDCALL mysql_thread_id(MYSQL *mysql)
  return (mysql)->thread_id;

And this is how it is set in CLI_MYSQL_REAL_CONNECT (mysql-connector-c-6.1.0-src/sql-common/client.c:3613):

server_version_end= end= strend((char*) net->read_pos+1);

The direct consequence is that children processes, created for example using the multiprocessing module, must close the inherited MySQL connections and then reopen them to avoid surprises.

When I discovered it, I immediately thought about Django management commands splitting workload between children.

Open questions:

  1. Are Celery tasks affected by this? — probably yes.
  2. What happens when two processes sharing a transaction update data at the same time?
Jun 6, 2013

Multimethods in Python

So, PEP-443 aka Single-dispatch generic functions has made it into Python. There is a nice writeup of the singledispatch package features by Łukasz Langa.

Although I’m glad that Python is evolving in the right direction, I can’t see how single dispatch alone could be enough. In essence, PEP-443 defines a way of dynamically extending existing types with externally defined generic functions. Which is nice, of course, but too limited.

What is really interesting is multiple dispatch. There are a few packages bringing multimethods to Python; all of them are overcomplicated to my taste.

Here’s my take on it. I will not talk much, better show you the code.

This is the complete implementation:


import operator
from collections import OrderedDict

class DuplicateCondition(Exception): pass

class NoMatchingMethod(Exception): pass

class defmulti(object):
    def __init__(self, predicate):
        self.registry = OrderedDict()
        self.predicate = predicate

    def __call__(self, *args, **kw):
        method = self.dispatch(*args, **kw)
        return method(*args, **kw)

    def dispatch(self, *args, **kw):
        for condition, method in self.registry.items():
            if self.predicate(condition, *args, **kw):
                return method
        return self.notfound

    def notfound(self, *args, **kw):
        raise NoMatchingMethod()

    def when(self, condition):
        if condition in self.registry:
            raise DuplicateCondition()
        def deco(fn):
            self.registry[condition] = fn
            return fn
        return deco

    def default(self, fn):
        self.notfound = fn
        return fn

    def typedispatch(cls):
        return cls(lambda type, first, *rest, **kw: isinstance(first, type))

And here’s how to use it:

import types
from multidispatch import defmulti, NoMatchingMethod

# Exhibit A: Dispatch on the type of the first parameter.
#            Equivalent to `singledispatch`.

cupcakes = defmulti.typedispatch()

def str_cupcakes(ingredient):
    return "Delicious {0} cupcakes".format(ingredient)

def int_cupcakes(number):
    return "Integer cupcakes, anyone? I've got {0} of them.".format(number)

def any_cupcakes(thing):
    return ("You can make cupcakes out of ANYTHING! "
            "Even out of {0}!").format(thing)

print cupcakes("bacon")
print cupcakes(4)
print cupcakes(cupcakes)

# Exhibit B: dispatch on the number of args, no default

def jolly(num, *args):
    return len(args) == num

def single(a):
    return "For {0}'s a jolly old fellow!".format(a)

def couple(a, b):
    return "{0} and {1} are such a jolly couple!".format(a, b)

print jolly("Lukasz")
print jolly("Fish", "Chips")
    jolly("Good", "Bad", "Ugly")
except NoMatchingMethod:
    print "Noo! Angel Eyes!"
Apr 17, 2013

I will never CNAME my root domain again.

I will never CNAME my root domain again.

I will never CNAME my root domain again.


Apr 13, 2013

Hello Tumblr

Good bye, Posterous. Rest in peace.

Jun 6, 2012

A simple callback chain macro for elisp

The Problem

As usual, it started with a tiny piece of ugly code:

(bd-create-stage datafile-id
                 (lambda (stage-id)
                   (bd-insert-rows stage-id 
                                   [[10 20 30] [40 50 60]]
                                   (lambda (stage-id)
                                     (bd-commit-stage stage-id 

The snippet above is basically a callback chain. When bd-create-stage finishes its work, it calls the first lambda, which calls bd-insert-rows with the second lambda as its callback argument and so on, until it all stops at the ignore function.

I wanted to rewrite it as something like this:

(=> datafile-id
    (bd-create-stage it next)
    (bd-insert-rows  it [[1 2 3 4 5] [6 7 8 9 0]] next)
    (bd-commit-stage it next))

Where the it variable would represent the current callback’s parameter and next would refer to the next callback in the chain. As with the -> macro, I wanted explicit anaphoric variables.

The Idea

Each line in the snippet above could be wrapped in a lambda, lust like this:

(=> datafile-id
    (lambda (next it)
      (bd-create-stage it next))
    (lambda (next it)
      (bd-insert-rows it [[1 2 3 4 5] [6 7 8 9 0]] next)
    (lambda (next it)
      (bd-commit-stage it next))))

Then it should somehow call each function in the list with the consequent function as the first parameter and the result of execution of the previous function as the second parameter.

The Solution

This function chaining thing looks a lot like a binary function fold:

(defun chain2 (f1 f2)
  (apply-partially f1 f2))

(defun chain (&rest fns)
  (if fns
      (reduce #'chain2 fns :from-end t)

Applying chain to a function list creates a new function taking one parameter and passing it through the whole function list, much like the -> macro does.

In fact, this is enough to start working on the macro.

(defmacro => (initial &rest forms)
  `(funcall ,(build-form-chain forms) ,initial))

The build-form-chain function wraps each form into a lambda and then chains them together:

(defun build-form-chain (forms)
  `(apply #'chain 
          (list ,@(mapcar #'build-form-link forms) #'ignore)))

At the end it adds ignore as a terminator. The terminator is necessary because the last callback’s result is almost always ignored.

The build-form-link's implementation is trivial:

(defun build-form-link (form)
  `(lambda (next it) ,form))

Done! Here’s the full source for your convenience:

(defun chain2 (f1 f2)
  (apply-partially f1 f2))

(defun chain (&rest fns)
  (if fns
      (reduce #'chain2 fns :from-end t)

(defun build-form-link (form)
  `(lambda (next it) ,form))

(defun build-form-chain (forms)
  `(apply #'chain 
          (list ,@(mapcar #'build-form-link forms) #'ignore)))

(defmacro => (initial &rest forms)
  `(funcall ,(build-form-chain forms) ,initial))

Now let’s see how the macro expands:

ELISP> (macroexpand
     '(=> datafile-id
          (bd-create-stage it next)
          (bd-insert-rows  it [[1 2 3 4 5] [6 7 8 9 0]] next)
          (bd-commit-stage it next)))

(funcall (apply (function chain) 
                (list (lambda (next it) 
                        (bd-create-stage it next))
                      (lambda (next it)
                        (bd-insert-rows it [[1 2 3 4 5] [6 7 8 9 0]] next))
                      (lambda (next it)
                        (bd-commit-stage it next))
                      (function ignore)))

Exactly as intended.

This macro covers 95% of my callback chaining needs. For the rest 5% there is the all-powerful deferred.el library.

May 9, 2012

Thread operator in Elisp


ELISP> (-> 1
           (+ 2 it)
           (* 3 it))
ELISP> (macroexpand
           '(-> 1
                (+ 2 it)
                (* 3 it)))
(let* ((it 1) (it (+ 2 it)) (it (* 3 it))) it)


(defmacro -> (arg &rest forms)
  `(let* ((it ,arg) .
      ,(mapcar (lambda (form) `(it ,form))

The Long Story

When I see code like this, I frown:

(defun bd-search (api-key query callback)
  (send-request "GET"
            (format "search?%s"
                (make-query-string `(("api_key" . ,api-key)
                             ("query" . ,query))))

It’s a very simple case, yet the parameter list is already at the fourth level of indentation. When it gets really ugly I usually wrap the whole thing into a let statement and start moving inner parts into variables.

What I have noticed, however, is that almost always constructs like this are sequential by their nature, in other words the output of the innermost statement serves as input for the statement one level up, and so on and so forth. This is the very reason why Clojure had its thread operator macro since beginning.

Remembering that, I started literally morphing my bd-search function into something more prettier. I came up with this variant:

  (-> `(("api_key" . ,api-key)
    ("query" . ,query))
      (make-query-string it)
      (format "search?%s" it)
      (send-request "GET" it callback)))

Then I put together the -> macro and that was it.

I decided to make the macro anaphoric instead of implicitly injecting an extra parameter as in Clojure. This allowed me to put the threaded parameter at any place, not just at the beginning or at the end of the parameter list.

Aug 27, 2011

How much can be done in four hours

Today I had an awesome day at the first OpenDataBC hackathon which took place at Mozilla Labs Vancouver.

Tara Gibbs pitched this wonderful idea of consolidating shelter availability data and displaying it on a few window displays, so the homeless people living DTES would not waste their time going from one shelter to another just to find a free spot.

This doesn’t solve all the problems of course, but it does solve a little yet very annoying one.

So… At 11:30 we had nothing but an idea. We discussed possible approaches for a while, then came David Eaves and suggested using Twitter as a message queue service.

At approximately 12:00 we still had nothing but a piece of paper covered with boxes and arrows, then we started coding. Tara did the frontend, I was busy hacking the backend and the Twitter stuff.

Four hours later we had a fully functional, production ready system -

How it is supposed to work:

  1. Shelters tweet their availability data (they all have internet access)
  2. VanShelter monitors — each of them independently — receive Twitter updates and
  3. Refresh their displays when something changes.

For displays we can use cheap LCD monitors, probably even donated. The software will run on those amazing Raspberry thingies -, $25 each. This brings the full cost of installing 10 displays down to $250+.

Thank you Tara and David. Also, thank you Jeff and all the people who made this hackathon possible.

« To the past Page 1 of 2