Archive for the 'Python' Category

08
Oct
13

Multi-threading Comparison between Python and Java

I find myself unprepared this week for what I was going to write about, partly because of playing Dwarf Fortress after having had a long respite, and having been away for a good portion of the weekend. So I don’t have any screen captures done for the game I was planning on writing about.

Instead, I’ll be talking today about multi-threading on both Java and Python and why in general Java is better at it.

The story begins with a project I had at work, where I was asked to speed up a program written in python. The code was written in a single thread and I was told that it was a sequential type of algorithm that the coder thought couldn’t be multi-threaded. This sounded like a challenge to me so I got a copy of the code to look at and began optimising.

The code was designed to solve a problem within an area of mathematics, something related to topological groups. I didn’t really understand the complete idea behind what he was trying to do, after all the person that wrote the code was an honours student and being a engineer I didn’t do anything harder than second year maths, and not in this field. The basic gist is that he was trying to work out the rate of usual growth and geodesic growth of a particular word problem and if there is a pattern to the growth rates. If you understand this you probably know more than I do!

So I had some code and began investigating. I couldn’t see any obvious way to make a huge improvement to his algorithm, which isn’t surprising, but I did find some areas of implementation that I could fix to make the program faster.

The code used string variables, it used them _a lot_. After profiling the code I found that it spent a good portion of its time doing the string length function (len). Changing the code to use this function less (by keeping track of the length in a variable) I managed to speed up the code by quite a bit. This is because measuring the length of a python string is an order N operation, as the length isn’t stored but computed each time by traversing the string. This is pretty much the same as is done in many languages such as C.

I found a few other implementation details relating to strings, but quickly ran into a wall of not being able to make the program much faster. So my attention turned to changing the code to take advantage of multiple cores. Now having become more acquainted with the code I noted that there were situations where one function was called many times on many strings in a data set. This data set unfortunately grows rather quickly (exponential I think) so later iterations of the algorithm could take a long time single threaded. So I decided to go about processing more than one string at the same time in many threads.

This unfortunately led to me find a problem with using python for any kind of number crunching. After doing some research I found out about something called the Global Interpreter Lock. This lock basically means that within one instance of Python the interpreter will only be active on one thread. Because the code runs almost exclusively in the interpreter it wasn’t really going to be possible to do all the work within one instance of python. The threading model in python works well for stuff like I/O where a thread may not need the interpreter for a while and it can be used by someone else.

So being a Unix user I thought of using multiple instances of python and using OS pipes to communicate between all the different instances. This is called multi-processing rather than multi-threading, but it still allowed me to make use of many more processors and in theory it could have even been extended to multiple machines, but as it turned out this wasn’t necessary and wouldn’t have helped.

The problem I faced after much testing was that the main python process was spending way more time on I/O than pretty much anything else. Part of the problem being the large volume of data that needed to traverse the pipes in order for each thread to run. I tried several different techniques but couldn’t get an Improvement, this is when I decided a port to Java would be a good idea.

The port to Java was pretty straight forward for the code that makes up the main part of the algorithm, so I took the opportunity to do a simple test as to which one has better single threaded performance. I ran the python version of the code using pipes to both java and python processing instances in turn. I found with equivalent code that Java was about 50% faster just for the function I had put into the separate process, this looked promising and was useful in debugging some of the java code. Given that python is interpreted this shouldn’t be surprising.

I moved to java because of the threading model being much better. Firstly it actually supports having more than one VM thread running at a time, so it makes it possible to actually do some processing on multiple threads, and secondly it has a bunch of handy mechanisms to help allow you to get the maximum processing done. Synchronized blocks, locks, and the notify/wait system allows for thread safe programming whilst not having to poll for events, thus saving processing time. The main advantage over python in this case is not having to do the I/O to get data to processing threads, and being able to multi-thread more sections of the code.

Now having completed the java port the performance stats speak for themselves, it is significantly faster than the python implementation, taking about 4 days to do what would take weeks even with the multi-process python implementation. I do have to note that I was using the 2.7 version of python, so this does not reflect the performance or programming model in the new python 3 implementation.

17
Jul
12

Python Script: Clean up out of date packages in pkgsrc.

Having updated some packages on my NetBSD system, I had some duplicate copies of the built binaries in my pkgsrc tree. This of course is pretty much just a waste of disk space so I went about writing a script to clean up the extra package tarballs.

I made a script that would examine the list of files and find ones with the same package name (with the version info chopped off). It would then compare the files by date and the delete the older files. The main problem with this script is packages like apache and samba. Those packages have multiple packages within pkgsrc that are simply different versions of the same software. Unfortunately these different versions share the same package name as far as the package management system is concerned, which means this script could delete a binary package file you wanted to keep. For example apache-1.3x versus apache-2.x.x.

Fortunately I wouldn’t imagine many people installing more than one copy of packages like samba and apache, but it is something to be aware of as it does affect some other packages as well. So far it hasn’t been a problem for me. I’ve noted some packages like python are fine as the version has become a part of the package name (eg. python27 for the 2.7.x series of python). So it would be a good idea to have a quick look in your package directory.

After removing the extra tarball files, the script goes and checks all the symlinks in the packages tree to make sure they are still good, and removes the link if it is broken.

Note you will need the common.py file in addition to this one (clean-packages.py) for the common utility functions. This was in my last post containing the build script.
Continue reading ‘Python Script: Clean up out of date packages in pkgsrc.’

16
Jul
12

Python Scripts for building in pkgsrc packages in NetBSD

I’ve been a little busy of late, so sorry this post is coming in rather late.

As you may have already read, I have been learning python for work. So I decided to use it to try to solve some problems I had been having on my SparcStation.

There aren’t any binary packages for my old sparc available so I often take it upon myself to build packages from source using the pkgsrc system. There are a few problems in doing this: Firstly the old machine doesn’t have a lot of disk space, so I do not wish to keep the build dependencies installed all the time (bison, m4, digest etc…). Secondly the machine is quite slow, so when building a package I do not wish to rebuild ones that are already built. Normally I would have to perform the installation and removal manually as the pkgsrc system doesn’t seem to detect previously built/downloaded packages. So I decided to make a python script that would automate the process, so I could ask the machine to build and install something, and not have to interact with it unless there was a problem or it finished.

In starting the script I realized I could also automate a few other pkgsrc functions such as updating all the installed packages. Because there was a lot of commonality between what I was writing I made a common module that all the scripts could use to perform some basic functions that I would need on a regular basis.

Here is the common.py code…
Continue reading ‘Python Scripts for building in pkgsrc packages in NetBSD’




Enter your email address to follow this blog and receive notifications of new posts by email.


Mister G Kids

A daily comic about real stuff little kids say in school. By Matt Gajdoš

Random Battles: my life long level grind

completing every RPG, ever.

Gough's Tech Zone

Reversing the mindless enslavement of humans by technology.

Retrocosm's Vintage Computing, Tech & Scale RC Blog

Random mutterings on retro computing, old technology, some new, plus radio controlled scale modelling.

ancientelectronics

retro computing and gaming plus a little more

Retrocomputing with 90's SPARC

21st-Century computing, the hard way