Recently I was browsing Stack Overflow looking at questions asked about Java. One theme that kept on recurring is that of the performance of various data structures within the Java API, in particular the Vector and ArrayList classes were frequently compared. So I decided that i would have a go at working out which data structures perform well.
The first thing I needed to decide upon was what kind of test was fair, and what kind of circumstances would I run the tests. I made a small test framework in Java so I could apply exactly the same algorithm to each data structure. I built the program to perform tests one after another with increasing data sizes to see how the data size affected the results. I wasn’t worried about factors such as the garbage collector and other tasks running on the OS as these would affect all the tests and I could take some precautions to minimise their effect. I’ll make my code available in the code section when I’ve finished tidying it up a bit.
I made code to test several different major aspects of a data structure. I recorded how long (in milliseconds) it took to create, insert into, remove an item by index, and remove an item by its value. I was just using the System.currentTimeMillis() to work out how much time was used, so small-scale accuracy can’t be assured. If you want to be more accurate you could use a profiler on the same code. I chose to not require the data be ordered within the collections as the tests are not dependant upon ordering within the data structure.
I tested 4 different data structures and had a dummy test with no data store to check the algorithm for the tester wasn’t causing extra overhead. I tested an implementation of a standard primitive array and LinkedList as well as the Vector and ArrayList classes. I’ve graphed some of the results I had at the bottom of this post. The first result that I found very interesting is with 1000 or fewer records there are no significant differences between the various data structure types. So if you aren’t storing large amounts of data there is little to gain by picking one data type over another.
The LinkedList data type interestingly had some of the worst performance values getting worse comparatively as the data size increased. Particularly for the remove by index and remove by value operations. In theory it should have similar performance to the others so this is a bit confusing. Perhaps it is all the extra reference handling that makes it significantly slower. It’s no surprise that it isn’t faster than the other data types as the operations that work well are adding and removing at either end of the list. This means that the LinkedList data type is best suited to implement structures such as a stack or queue as that allows for dynamically sized structures that are relatively fast.
Vector and ArrayList showed some slight difference in performance but are for the most part pretty similar. This is in part because they use exactly the same underlying data storage mechanism. The only difference being that Vector is synchronized and therefore thread safe. The results from testing seem to indicate that the cost for acquiring the synchronization lock is a smaller factor the larger the data set you have. Even on smaller data sizes the difference wasn’t incredibly large. On a faster machine with a different JVM the difference between the two was even smaller, looking almost identical. So differences between these classes will largely depend on what kind of system you are running on, and will in most cases not be a huge difference. Again if the data size is small it doesn’t really matter which one you use.
Interestingly the Data type that did the best was the standard primitive array implementation. It was significantly faster than all the other data structures! The simple answer is that primitive arrays in java require fewer operations to manage when converted into machine code by the JVM compared to other methods of storage. In this case it is also faster for removal by index because the array did not worry about keeping items in order when removing an item. I simply moved the item at the end to occupy the space to be deleted and then decremented the size. This is a common practise for lists of items in games where the order in the array is not important.
Classes such as Vector, ArrayList and LinkedList preserve the order of items within them whether you require it or not. This carries an intrinsic cost when removing items, and at worst can be O(n) to remove an item. When order of the items isn’t a concern any item can be removed in constant time as I have done in this implementation of the primitive array.
So which data storage technique should you favour? There isn’t a simple answer to this question as it will depend on each particular problem. However it does seem that you will get the best results most often when you create a data structure tailored to your problem. In this particular instance a simple array was best. You can construct pretty much any type of data structure with a primitive array, so many different solutions can be implemented with better performance than the classes in the API. In my experience the Vector class is fine for most things and easy to use, but when I’ve required speed I often use a primitive array, or another type of data structure altogether such as a HashMap which is useful for creating a cache.
So why do the data store classes in the API exist in the first place? Simply put convenience and generality, when performance or data size are not going to cause problems it’s easier to use a pre-constructed class to store data in. The java storage classes are also implemented in a generic way so that they can be used for any purpose, this has a performance cost but you benefit by having data structures you can use for many different data types and situations. If you want some of that convenience and better performance, you could use the Trove library for data storage. Trove is faster than the java API but does not really have generality like the classes in the API. Check it out here.
-
-
1,000,000 records on the slower machine
-
-
100,000 records on the slower machine
-
-
1,000 items on the faster machine
-
-
100,000 on the faster machine
Recent Comments