MattHicks.com

Programming on the Edge

Multithreaded Unzip?

Published by Matt Hicks under , on Sunday, January 13, 2008
A co-worker and I were discussing the possibilities of performance gain doing extraction of ZIP files using multiple threads rather than the typically single-threading extraction that the majority (if not all) mainstream archive extraction utilities use and given Java's great ZIP file support built-in it seemed rather trivial to give this a shot, thus, UberZip was born.

UberZip is my simple little sample Java command-line application to extract files where you can specify the number of threads to utilize during extraction. My #1 biggest headache is extracting Eclipse from the ZIP file you get, so I decided that would be an ideal test of my little program. I downloaded the Eclipse J2EE bundle (3.1.1), which is a happy 132 meg with thousands of files.

I need to test further with a machine that actually can better utilize multiple threads, but here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded, so I actually get a very slight benefit) running on Windows XP:

14.7 seconds with 1 thread

13.4 seconds with 5 threads

11.6 seconds with 30 threads

Anything above 30 threads seems to actually create more overhead than gain.

Now, to be fair I matched this up against the fastest unzipper I am aware of, 7zip (http://www.7-zip.org). It took right at about 14 seconds to unzip the file inside 7zip, which seems to fall nearly perfectly in-line with the single-threaded execution of UberZip.

For those of you that would be interested in taking a look at this very simple example, I have committed the source to my public repository:

svn://captiveimagination.com/public/uberzip/trunk

Further, if you'd simply rather download the JAR or EXE I have uploaded copies of it for those of you that would like to unzip files "uber" fast. :)

uberzip.exe

uberzip.jar

After taking this to try out on my work machine (Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows) I got the following stats:

7zip - 25 seconds
UberZip (1 thread) - 22.17 seconds
UberZip (5 threads) - 19.5 seconds
UberZip (30 threads) - 21.6 seconds

Oddly it would seem about 5 threads is the "sweet-spot" for this machine. Though some of the results are a bit strange and it would seem on this machine the hard drives don't perform quite as well as my home machine it is obvious that at some level of configuration multithreading you can gain some good performance on the single-threaded applications out there.

Perhaps someone else will find this information useful and turn UberZip into a product people can use. ;)

9 comments:

David Rudder said... @ October 29, 2008 at 8:52 AM

The SVN seems to be down. I get a 404. I'd love to comment on the code. Can you email me a zip (haha) of the code to drig@noses.org? At your convenience, of course!

Have you benchmarked the code when the zip file is hosted on a shared drive? Off a local disk, the bottleneck will largely be disk I/O speed, which really doesn't benefit much from threading. But, in theory, thread-aware network file sharing, like NFS and SMB, should perform *much* better when threaded. In theory, of course.

Jacob said... @ December 13, 2012 at 1:55 AM

I just wanted to let your know that UberZip is still impressive almost 5 years later. I just tried it on an i7-2720QM with a Samsung 830 SSD (using PGP full-disk encryption).

7zip: ~25 seconds
Windows 7: ~57 seconds
UberZip(1 thread): 22.4 seconds
UberZip(8 threads): 6.7 seconds
UberZip(16 threads): 5.9 seconds

Source file was a 400 MB zip of RTC client.

Petteri said... @ January 24, 2013 at 12:54 AM

> here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded [...])
I have no idea what were you on when you wrote this.

>Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows [...]
> Oddly it would seem about 5 threads is the "sweet-spot" for this machine
It's not odd. It's natural. Server boards have slower memory but as you mentioned it's probably your hard drive that's the bottle neck there.

The logic behind n+1 threads (number of cores+1) is to better utilize the CPU by always having something for it to do. Depending on the CPU it might also work better with just n or n+2.

Matt Hicks said... @ January 24, 2013 at 8:38 AM

Jacob, good to know that even though this is code that's a few years old it still performs well. :)

Petteri, thanks for the feedback regarding CPUs.

drinehart said... @ July 3, 2013 at 4:28 PM

Does uberzip also support password-protected zip files?

Duane

Matt Hicks said... @ July 3, 2013 at 5:41 PM

Duane, no, unfortunately not. UberZip was meant as a prototype and I never expected it to be as fast as it is.

Nick Wilton said... @ November 27, 2013 at 12:31 AM

I'm playing around with UberZip on an c3.8xlarge Windows 2008R2 instance on Amazon Web Services (32 vCPU's Intel Xeon E5-2680).

Unzipping a single file containing 136,012 unique files (20.3Gb uncompressed; 1.7Gb zipped).

7Zip: 10 mins 8.2 secs.
UberZip (16 threads): 2 mins 22.5 secs.
UberZip (32 threads): 2 mins 7.9 secs.
UberZip (64 threads): 2 mins 19.9 secs.


I also verified the unzipped content to check for corruption. It was perfect!

Well done.

Matt Hicks said... @ November 27, 2013 at 7:47 AM

Nick, That's some pretty impressive results. I've been thinking a lot recently of resurrecting this project. I wrote this so long ago I'm sure there are some significant optimizations I could make to make it run even faster. I just recently created a project for multi-threaded backup (https://github.com/outr/outrbackup). It is significantly faster than any other backup solution I've seen to-date if you're interested in such a thing.

Nick Wilton said... @ November 27, 2013 at 5:43 PM

Hi Matt,

I'm sure there is a real opportunity for a small command-line high-performance unzip utility. It would be good to remove the requirement to have a JRE installed too.

-Nick

Post a Comment