How do you copy 60m files?

Apart from telling someone else to do it, that is

  • alert
  • submit to reddit

The essential guide to IT transformation

Sysadmin blog Recently I copied 60 million files from one Windows file server to another. Tools used to move files from system to another are integrated into every operating system, and there are third party options too. The tasks they perform are so common that we tend to ignore their limits.

Many systems administrators are guilty of not knowing exactly where the limits of their file management tools lie - at least until they run up against them.

But I know that using Windows Explorer in Windows XP/Server 2003 would be lunacy. The first permissions, filename or path length problem and the copy grinds to a halt.

If I were moving files from one server to another, a halt would just be a minor annoyance. The moved files are on the destination server. If the transfer halts, restart with the faulting file omitted. Irritating and time consuming, but not difficult.

Copying is another matter entirely. The file manager must handle exceptions. If it doesn't, then you have to do a lot of checking to find out what has been copied and what hasn’t.

FTP is one of my favourite ways to handle this problem. A good FTP client is designed with all sorts of abnormal situations in mind. It has bandwidth and treading controls, a transfer queue, the ability to resume failed transfers, and it reconnects to the target server if the connection is lost.

Decades after it was created, FTP still remains one of the best ways to move files. But no graphical FTP client I could find would cope with 60 million files. Filezilla blew up somewhere around one million. WS-FTP managed a few more. None of them were capable of more than about four million files.

My next attempt was to package them into a ball on the originating server and unpack them on the destination. No good. Neither Windows Server 2003’s native zip utilities, WinZip, 7Zip or WinRAR were up to it. Somewhere between four and ten million files all of them threw an exception and died.

Knowing that Windows 7/Server 2008 R2’s Windows Explorer is more advanced than the Server 2003 version, I tried using a third server to move the files from A to B via C (which was Server 2008 R2). It's much better at handling exceptions, but it too fell apart at four million files too.

Consulting my flash keyfob, I started trying my sysadmin tools. XXCopy, FastCopy, TeraCopy and Beyond Compare all made valiant, but ultimately ineffective, attempts. Of the GUI tools I tried, only Richcopy was able to handle the load. Richcopy is a free multi-threaded file management application written by Ken Tamaru at Microsoft. Increasingly it is my fallback for handling odd or exceptional file transfer scenarios.

I wanted to give several command-line tools a go as well. XCopy and Robocopy most likely would have been able to handle the file volume but - like Windows Explorer - they are bound by the fact that NTFS can store files with longer names and greater path than CMD can handle. I tried ever more complicated batch files, with various loops in them, in an attempt to deal with the path depth issues. I failed.

What worked brilliantly was using a Linux virtual machine. A simple default CentOS 5.5 install was able to mount SMB shares on both the originator and destination machines. From there, the command-line tool cp was able to succeed where every tool except Richcopy failed.

The Linux command line tool cp, while slower than Richcopy, copies in a linear fashion. It copies each file in sequence, leading to no fragmentation on the destination server.

Richcopy can handle large quantities of files, but can multi-thread the copy, and so is several hours faster than using a Linux server as an intermediary. The disadvantage is that the resulting file system on the destination server is heavily fragmented. You could restrict Richcopy to a single thread, but then it is no faster than cp.

Richcopy is not so fast that there is time to defragment an NTFS partition with 60 million files on it before CP would have finished. So the best way to move 60 million files from one Windows server to another turns out to be: use Linux.

Boost IT visibility and business value

More from The Register

next story
Pay to play: The hidden cost of software defined everything
Enter credit card details if you want that system you bought to actually be useful
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
HP busts out new ProLiant Gen9 servers
Think those are cool? Wait till you get a load of our racks
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
VMware's high-wire balancing act: EVO might drag us ALL down
Get it right, EMC, or there'll be STORAGE CIVIL WAR. Mark my words
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
prev story


Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up distributed data
Eliminating the redundant use of bandwidth and storage capacity and application consolidation in the modern data center.
The essential guide to IT transformation
ServiceNow discusses three IT transformations that can help CIOs automate IT services to transform IT and the enterprise
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.