Machine learning algorithms are notoriously CPU intensive. With PHP 7 being touted as twice as fast as its predecessor, is it now practical for simple machine learning tasks? In this post, it is compared to both Java and Python to show why speed isn't everything.
Despite years of flak, PHP still dominates the server-side scripting language market. Rightly or wrongly, part of its success is that it is a forgiving language that is incredibly accessible. In the right hands, a developer can quickly build all manner of web applications - in the wrong hands, one can quickly make a mess! This flexibility not only has cost at the code level. Internally, PHP has to do a lot of juggling and this is where processor time is lost.
Each minor version of PHP 5 has made small incremental improvements to performance, but PHP 7’s refactored core is something we haven’t seen before. Zend’s infographic shows big improvements for many frameworks and CMSs. These traditional uses for PHP typically hit databases - ideal for PHP 7’s optimised data structures. Most of the time spent on machine learning problems is actual computation so will there be similar returns even when not juggle data types?
The motivation for this post was my wish to run simple classification tasks natively in PHP. In the past, I have used PHP/Java Bridge but it has always felt a bit wobbly. Now I prefer creating a simple web service for PHP to query. This also means the brain can be moved off the web server to a more suitable machine. This disconnect still feels like overkill for simple problems like spam filtering and it adds a point of failure. Would it not be nice to be able to query the model natively rather than through an API? Does PHP 7 now make this feasible?
To measure PHP 7’s grunt, we will be using the k-nearest neighbours algorithm - probably the most basic classifier there is. It is like sticking a pin in a scatter chart and determining the class by the colours of the surrounding points. In the example below, we assume this point belongs to the red pluses because of the three nearest, two were of this class.
K-nearest neighbour is known as a lazy classifier - there is no model per se - all the work is done when we come to classify a new instance. That is why this classifier is ideal for this experiment. To keep things simple, we’re only looking for the closest neighbour (k = 1). The script will have to calculate the distance to every point in our dataset in n-dimensional space to find the closest match. This means lots of floating point operations and a more efficient language implementation means more CPU time is spent working on our problem rather than itself.
The script loads the Iris dataset from a CSV and then runs leave-one-out cross validation continuously for thirty seconds. The more iterations classified in this time - the better the performance of that language implementation. We don’t care about accuracy so we will be discarding the answer and simply move to the next instance.
The source code for each language is available on GitHub. I have tried to code each in the style of that language rather than a straight port from PHP. Sorry if it’s not the way you would implement it. I am sure there are a myriad of ways to squeeze more performance from each test but we are trying to imitate how someone would reasonably use the languages.
To keep things fair, the benchmarks will be run on the same single-core virtual machine. We are not interested in memory use but the VM only has 1GB of memory available, not that a 150-instance dataset should make much of a dent in that.
##Results Running the benchmarks multiple times showed there is a small variance from one run to the next. For this reason, the exact figures have been omitted from the charts. Feel free to run the code yourself if exact numbers for a toy problem matter to you.
These implementations of the PHP and Python languages are running simply as interpreters. Java can be run in an interpretive only mode but no one would do that in the real world.
Here we see PHP 5.6 and Python performed similarly but PHP 7 stands out with a 35% improvement on its predecessor. The Python 3 implementation is marginally slower than Python 2 for reasons beyond the scope of this article.
Unlike Zend’s benchmarks where PHP 7 sometimes outperforms HHVM, our benchmarks show HHVM is crunching nearly 30% more instances than PHP 7. If your PHP is CPU-intensive and you’re not using strange extensions, HHVM could help you. PyPy clearly dwarfs HHVM but like HHVM, there are compatibility issues that could drastically slow your development down instead. More on this in the discussion below.
Adding the Java benchmark result to this graph makes it useless. Java is just in another league. Outside VirtualBox, it is pushing nearly one million instances per second!
It is clear that any dynamically-typed language, compiled or not, will be vastly slower than something like Java. Hopefully, you’re not surprised by that. Once HotSpot kicks in, the Java code is much closer to metal.
But as fun as performance benchmarks are, it is clear there is more to the story than speed, or we would all be using assembly language. Anyone who has dabbled with Java will know how it can sometimes feel like walking through molasses compared to dynamically-typed scripting languages. This reminds me of something a lecturer pointed out during the first year of my undergraduate course: developer time is vastly more expensive than CPU time, and this gap is only going to widen further. Using a language or implementation just because it performs better in a benchmark is a form of premature optimisation.
If we use human time as the primary measure of efficiency, it makes sense to use the best performing language that will meet the requirements and that the team can maintain. Unlike the benchmarks above, in the real world one would use a machine learning library and this is where PHP falls flat on its face. There just isn’t a viable machine learning library for PHP. Any advantage one would gain from using the common language of the team goes straight out of the window when they start writing mountains of code from scratch. A Python team would have the luxury of using NumPy and scikit-learn - both of which are pre-compiled so there is a significant performance boost there too.
Until there is a viable machine learning toolkit for PHP, don’t even bother. Even if PHP 7 was faster than the competition, the development time will not be made back in all but the most exceptional circumstances. It’s all about using the right tool for the job and PHP doesn’t even have any tools to offer.
So to answer the question: Is PHP now suitable for machine learning? Practically, no - not until there is a viable library available. For now, if you want to keep your codebase heterogeneous, use a web service like Google’s Prediction API or Amazon Machine Learning.