Saturday, March 18, 2017

Classifying Dogs vs Cats on a Regular Laptop with 2GB GPU and 90% Accuracy

Machine learning ecosystem has evolved a lot during recent years.
I am amazed that I could run a very sophisticated experiment of classifying dogs vs cats with 90% accuracy on my regular laptop laptop.
It has 2GB NVidia GPU card and 8GB RAM.
Just in 2012 the state of art result of the dogs vs cats classification was 80%.

I ran it based on an excellent course provided by (
The competition is organized by Kaggle:

Here's an overview of the approach taken to achieve 90% accuracy.
First, retrieve a publicly available model VGG16, which was prepared by scientists for image recognition competition (for ImageNet). Then remove last layer out of it and replace with Yes / No layer for recognizing cats vs dogs. The remaining layers were set as non trainable. Then run learning process for such model.

The main libraries used here are Keras with Tensorflow backend.

Full code is available on website. Here in an overview of the most important parts.
Training code:

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()

# Import our class, and instantiate
import vgg16; reload(vgg16)
from vgg16 import Vgg16
vgg = Vgg16()

path = "data/dogscats/"
#path = "data/dogscats/sample/"
batches = vgg.get_batches(path+'train', batch_size=batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size=batch_size)
vgg.finetune(batches), val_batches, nb_epoch=1)'vgg2.h5')

The code uses vgg.finetune call to update the last layer of the model. Here's how it looks like:

model = self.model
        for layer in model.layers: layer.trainable=False
        model.add(Dense(num, activation='softmax'))

Next, it trains model using call and saves result to vgg2.h5 file. 

I had to put a few tweaks to the model related to device placement for Tensorflow so it could fit in GPU memory. The last few layers were placed on CPU. Here's the code:

      model = self.model = Sequential()
        model.add(Lambda(vgg_preprocess, input_shape=(3,224,224), output_shape=(3,224,224)))

        with tf.device('/gpu:0'):
            self.ConvBlock(2, 64)
            self.ConvBlock(2, 128)
            self.ConvBlock(3, 256)
            self.ConvBlock(3, 512)
            self.ConvBlock(3, 512)

        with tf.device('/cpu:0'):
            model.add(Dense(1000, activation='softmax'))

        fname = 'vgg16.h5'
        model.load_weights(get_file(fname, self.FILE_PATH+fname, cache_subdir='models'))

Here's the result of a learning process:

23000/23000 [==============================] - 2103s - loss: 0.5482 - acc: 0.8676 - val_loss: 0.4194 - val_acc: 0.9060

The training process completed in 35 minutes with 90% accuracy on validation set. 

I'm very positively surprised that such powerful machine learning tools are available these days and are runnable on regular computers. Moreover the approach presented by is very interesting and resembles natural evolution of intelligence by adding new layers. 

Sunday, March 13, 2016

Learning Sinus Function Using Neural Network

Recently I stumbled upon an excellent demo of a 2 layer neural network written by Florian Muellerklein:

It is written in Python using numpy and focuses on digit recognition based on sklearn dataset.
I decided to play around with it and add visualization for learning process of a sinus function.
I used matplotlib for creating an animation.

The neural network implementation is typical. It uses standard gradient descent procedure with some optimizations like momentum and regularization (also random initialization).
If you are interested in understanding how exactly gradient descent works, I highly recommend an article from Matt Mazur:

The network architecture I used was following:

  • 1 input neuron (x parameter of sinus function)
  • 60 hidden neurons
  • 1 output value (the function result)

Here is a link to source code:

It converges pretty well. Here's an animation showing convergence of sinus function during consecutive learning iterations (10 learing iterations per frame):

Machine Learning is a very fascinating domain that has been emerging rapidly over recent years, mainly in visual object recognition. I hope that it keeps this pace in future (or even exceeds it!).

Saturday, May 16, 2015

Asymptotic Benchmark in Java

Most of the time when we analyze performance of different programs, we use Big O Notation and run performance tests for a single input size N, which measures execution time.

Both of these methods have disadvantages.
Big O Notation is theoretical (has no code verification).
Time based performance tests are not very reliable (it's hard to make assertions on them) and don't show the actual asymptotic function behind the programs.

Here I present a slightly different approach, which is a combination of those 2 techniques.
It uses explicit instruction counting, samples programs for different input sizes and tries to guess asymptotic function behind them.
It also plots the result as a chart using HTML and Google Visualization JavaScript Library.

Note that it's more of a toy for visualization than something one could use in the industry for real world projects.

First we start by implementing an example program (Binary Counter), which we want to measure. It'll extend a base class Benchmark, which looks like this (full source code is available on github:

public abstract class Benchmark {
    private int instructionCount = 0;

    protected void incrementInstructionCount() {

    public void resetInstructionCount() {
        instructionCount = 0;

    public int getInstructionCount() {
        return instructionCount;

    protected abstract Object run(int n);

It has abstact method "run", which takes input size as argument. This will be provided by the benchmark framework for different executions. 
Binary Counter implementation looks like this:

public class BinaryCounter extends Benchmark {
    protected Object run(int n) {
        int[] counter = new int[64];
        for (int i = 0; i < n; i++) {
            int p = 0;
            while (counter[p] == 1) {
                counter[p++] = 0;
            counter[p] = 1;
        return counter;

It is a 64 bit binary counter, which starts from 0 and increments values N times by one. Theoretically it should run in O(n) time. We'll verify that. 

We need to run it through the framework using following code:

public class Main {
    public static void main(String[] args) {
        BenchmarkRunner benchmarkRunner = new BenchmarkRunner();
        benchmarkRunner.addFormatter(new TextResultFormatter(new PrintWriter(System.out)));
        benchmarkRunner.addFormatter(new HtmlResultFormatter(new File("out")));

        for (Benchmark benchmark : new Benchmark[]{
                new BinaryCounter(),
                ) {
  , 1);

Here's the chart generated for Binary Counter along with guessed function behind it:

In this case the guessed function is linear, which is what we expected. For input size of 5.6 millions, the instruction count was 11 million.

The guessing process is an iteration over all samples and optimization of Mean Squared Error.
It is implemented in BenchmarkRunner class. 
You can add more functions by extending Function interface. Here's an example of N log N function:

public class NLogNFunction implements Function {
    public float eval(float x) {
        return (float) (Math.log(x) * x);

Here's some more benchmarks: MergeSort (expected asymptotic function is N log N) and NestedLoop (N^2). 

Implementing reliable performance tests is not a very well solved problem today. I think that the idea of instruction counting is something one can try to use in real world projects. It can be used for testing performance of isolated chunks of code, within unit tests, and can give predictable results. 

Sunday, November 23, 2014

Handling requests Asynchronously in Java using Jersey 2.13 and Glassfish 4.1

In recent Jersey release 2.x, there is a new API for async request processing. It includes an excellent example server-async-standalone-webapp.
In this article, I'll show how to run it and the results we can get by leveraging async processing.

First, we need to download Glassfish 4.1 Web profile (
Then start it using command:

./asadmin start-domain

Next, we need to download Jersey examples bundle from and compile server-async-standalone example by running 'mvn package'.

Then we can deploy it using following command:

./asadmin deploy <path>/jersey/examples/server-async-standalone/webapp/target/server-async-standalone-webapp.war

Now we are ready to login to Glassfish Admin Console (localhost 4848) and see the status of deployed application:

Now we're ready to run a client GUI application, which allows us to run tests against server. Go to 'server-async-standalone/client' and run 'mvn exec:java'. 
The application looks like below. I ran sync vs async test on 100 requests and got response times improved from 20 secs to 1.2 secs. 



The code is following for sync:

    public String syncEcho(@PathParam("echo") final String echo) {
        try {
        } catch (final InterruptedException ex) {
            throw new ServiceUnavailableException();
        return echo;


And for async:

private static final ExecutorService TASK_EXECUTOR = Executors.newCachedThreadPool();
    public void asyncEcho(@PathParam("echo") final String echo, @Suspended final AsyncResponse ar) {
        TASK_EXECUTOR.submit(new Runnable() {

            public void run() {
                try {
                } catch (final InterruptedException ex) {

Note that in Async example, the number of background threads is unbounded (TASK_EXECUTOR is a cached thread pool without limit). So in reality it won't improve the amount of resources (threads) consumed by server in Aync mode. 
In Sync mode, the threads hold http executor. The default number of http worker threads in Glassfish is 5. This explains why we get response time around 20 secs for 100 requests. 

"http-listener-1(4)" daemon prio=6 tid=0x000000000cc7d000 nid=0xcf8 waiting on condition [0x00000000110ad000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.glassfish.jersey.examples.server.async.LongRunningEchoResource.syncEcho(
at sun.reflect.GeneratedMethodAccessor437.invoke(Unknown Source)

Note that this is just an example, which illustrates how to enable Async processing. In real world scenarios there can be a third party service, which can be invoked using Jersey Client on the server side. If that service slows down, then the whole http thread pool on the server side can be exhausted and prevent other requests from getting processed. 
The suggested solution in this case would be to switch to Async Jersey Client and Async service implementation. 

Async server processing can also help in case where there is a lot of slow clients. For example 20k mobile clients, who send payloads in chunks with 1 second delays. This scenario can easily bring down Sync server side implementations.

Async processing is gaining momentum on server side nowadays. There are multiple technologies, 
which enable it. Among them are NodeJS, Netty, Akka and Play Framework. Jersey Async processing is one of them.  

Saturday, July 5, 2014

Implementing a Distributed Counter Service Using Hazelcast, Jersey 2 and Guice

Modern web services are often required to scale horizontally in order to handle growing load.
Here I present an example distributed Counter service, which uses modern technology stack consisting of Hazelcast, Jersey 2 and Guice.

The sample is based on an excellent example posted by Piersy and can be downloaded from here:

The counter is just a shared value across all participating nodes within a cluster. All nodes can query for data and increase it's value by specified delta. Note that Hazelcast handles synchronization within distributed environment. In order to update the state atomically, we use Hazelcast EntryProcessor.

public class CounterService {

    public int increase(final int delta) {
        return (Integer) map.executeOnKey("counterKey", new CounterEntryProcessor(delta));

public class CounterEntryProcessor implements EntryProcessor<String, Integer>, EntryBackupProcessor<String, Integer> {
    private final int delta;

    public CounterEntryProcessor(int delta) { = delta;

    public Integer process(Map.Entry<String, Integer> entry) {
        int newValue = entry.getValue() + delta;
        return newValue;

The service is just a Guice Singleton. It has an operation called 'increase', which takes a delta and creates EntryProcessor job for Hazelcast to submit to a node, which owns the value at the time and will update it atomically.
Hazelcast is a library, which implements Java Collections in distributed fashion. It handles replication, cluster membership and distributed locking.

Web service is a simple JAXRS REST Resource. Following code invokes Guice service layer from Resouce implementation:

public class CounterResource {

    private CounterService service;

    public CounterResource(CounterService service) {
        this.service = service;

    public String increase(String delta) {
        return "" + service.increase(Integer.parseInt(delta));

We also have a test case, which invokes the whole stack and does a request to the service. The whole stack is very lightweight. In runs on Jersey with Embedded Tomcat. The test case completes within a few seconds.

In order to run the example, you need to build distribution first (run, then start nodes in separate shells and invoke curl POST requests to manipulate counter state. Here's a sample interaction:

./ 8090
./ 8092
Jul 05, 2014 10:20:06 AM com.hazelcast.cluster.ClusterService
INFO: []:5702 [dev] [3.2.3]

Members [2] {
        Member []:5701
        Member []:5702 this

Jul 05, 2014 10:20:08 AM com.hazelcast.core.LifecycleService
INFO: []:5702 [dev] [3.2.3] Address[]:5702 is STARTED

$ curl -d '1' -H 'Content-Type: text/plain;' http://localhost:8092/webapp/api/counter
$ curl -d '1' -H 'Content-Type: text/plain;' http://localhost:8090/webapp/api/counter
$ curl -d '5' -H 'Content-Type: text/plain;' http://localhost:8090/webapp/api/counter

I think that Hazelcast is a very good step forward into distributed computing. For an every day programmer, it's a set of primitives to manipulate in order to implement a distributed system. It is very easy to integrate it into existing project, which could be either JavaEE or standalone app.

Saturday, June 14, 2014

Running hadoop 2.2.0 wordcount example under Windows

Recently I went to Hadoop Summit in San Jose ( The conference was quite interesting (excluding a few boring talks). I found out that HortonWorks is trying to push hard Hadoop into enterprise environment with Hadoop 2.x and Yarn. I love this idea, since there seems to be no good standard for distributed containers in Java these days (forget about JEE clustering).

Surprisingly enough, it looks like Hadoop 2.2.0 is supported natively on Windows, which IMO is a great achievement and is a sign of platform getting more mature.
In this article I show how to run a simple WordCount example in Hadoop 2.2.0 under Windows.

First of all, you need to compile hadoop 2.2.0 distribution, which takes a lot of time (and sometimes tweaking pom files). I uploaded a precompiled version here. You need to edit windows environment variables and add path to bin dir and HADOOP_HOME variable pointing to the dir.

Then you need to format node and run example. Following commands do that:

E:\test>hdfs namenode -format


starting yarn daemons

E:\test>hdfs dfs -mkdir /input

E:\test>hdfs dfs -copyFromLocal file1.txt input

E:\test>hdfs dfs -cat /input/words.txt

E:\test>yarn jar E:\hadoop-2.2.0\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar wordcount /input/words.txt /output
14/06/14 14:29:47 INFO Configuration.deprecation: is deprecated. Instead, use dfs.metrics.session-id
14/06/14 14:29:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/06/14 14:29:48 INFO input.FileInputFormat: Total input paths to process : 1

E:\test>hdfs dfs -cat  /output/part-r-00000
abc1    1
abc2    3
abc3    1

I hope that Hadoop 2.x gains wide adoption in Enterprise Environment, since the industry needs the next gen standard for distributed apps.

Saturday, August 25, 2012

How to create a native Java App

Recently, I stumbled upon JCGO, an interesting project, which translates Java 1.4 code into C.
In this article, I show how to create a native Windows app out of a small Java app.

The Java app I will use is NetCat ( You can download precompiled executable, netcat.exe, from

So the first step is to download all dependencies. I will use MinGW, MinGW GCC, jcgo-lib-1_14.tar.gz, jcgo-src-1_14.tar.bz2, classpath-0.93 ( and Java sources for the app with dependent libraries:, commons cli 1.2 ( You need to put all this in the same directory, so it'll have structure like this:


Then, you need to run Java to C translator by using command:

jcgo.exe -sourcepath netcat/src -sourcepath commons-cli-1.2-src/src/java netcat.NetCat -d out


Analysis pass...
Output pass...
Writing class tables...
Creating main file...
Parsed: 293 java files (2699 KiB). Analyzed: 3067 methods.
Produced: 640 c/h files (3769 KiB).
Contains: 1490 java methods, 4119 normal and 288 indirect calls.
Done conversion in 1 seconds. Total heap size: 36572 KiB.

Next step is to compile it into final executable. Following command does this:

gcc -DJCGO_INET -DJCGO_NOFP -DJCGO_WIN32 -DJCGO_THREADS -I src/include/ -I src/include/boehmgc/ -I src/native/ out/Main.c -o netcat.exe libs/x86/mingw/libgcmt.a -lws2_32

I used some switches, which are suitable for this particular app. For example, by default JCGO doesn't use multithreading or networking. This has to be enabled explicitly. 

And that's it. Now you can try out the app by calling, like this:

$ netcat.exe -p 80
Connecting to port 80
HTTP/1.0 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=2f3085ac38771e98:FF=0:TM=1345885031:LM=1345885031:S=8A-IkreMgCogMsey; expires=Mon, 25-Aug-2014 08:57:11 GMT; path=/;
Set-Cookie: NID=63=O_QZ4bDrzYNiiE0DY8RT-34c_pGt_OZagP3gzrzqCAx_Xo2kO7s9zVrUOx7FVz4TyAEY7Wx9UhglYZSX9UHSdzT7c9mUKzfkJFp5lk5FyfiMIcKITLhgSX4__3QwEYBS; expires=Sun
, 24-Feb-2013 08:57:11 GMT; path=/;; HttpOnly
P3P: CP="This is not a P3P policy! See for more info."
Date: Sat, 25 Aug 2012 08:57:11 GMT
Server: gws
Content-Length: 218
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<H1>302 Moved</H1>
The document has moved
<A HREF="">here</A>.

I like the approach of translating Java code into C, because compared to other tools, which generate C++ code, this is more suitable for embedded devices. For example it is possible to generate code for iOS, because Objective C is a superset of C.

One feature I would like to see though is to be able to use reference counting instead of full gc. This is because one of the advantages of C over Java is that it doesn't have GC hangs. So then the programmer would have to make sure there's no cycles in orphaned object structure.

Update: Ivan Maidansky, an author of JCGO, has put some interesting comments regarding this article. In particular, he is aware of some apps in Apple Store, which do this kind of translation. Also, reference counting is discouraged due to multithreading issues. These comments can be found here: