Feeds

DOCX disaster recovery: How I rescued my wife from XM-HELL

Something strange in your closing tag, who you gonna call?

Intelligent flash storage arrays

Sysadmin blog What do you do when a critical Word document won’t open? Even in today’s world of versioned documents, it is entirely possible for corruption to squeak in and go unnoticed, wrecking your entire version history.

But all is not lost. My wife had this happen to her; here’s how we solved it.

Real world example

In my case, Word wouldn’t open an important file, dying instead with the error “the name in the end tag of the element must match the element type in the start tag”. Translated from Microsoftese: “The word processor that created this document made an XML boo-boo, and Word is going to refuse to read this document now.”

The most common kind of XML boo-boos that word processors will make are either saving tags out of order (the most famous example being Microsoft Word’s oMath tags error) or opening a tag but not closing it. Today’s issue was the latter. The wife was using an old version of LibreOffice Writer (v4.1) and had made several changes to hyperlinks in one area of the document. Writer got confused somehow, opened a hyperlink tag, but didn’t actually put in any information as to where it was hyperlinking to, and didn’t close the tag.

What should be noted here is that Writer and Word behave very differently with this broken file. Writer will open the file, but simply stop processing the document around where the XML stops making sense. Word will vomit that error and die. Both are useful.

Fixing the error

To understand how Microsoft’s DOCX (and similar OOXML formats) can go awry, we need to understand a little about the structure of these documents. What’s important to note is that OOXML documents are actually zip files, with all the goodies packed up in various XMLs like creamy filling. This generally means that if we can rename the file to something.zip and open it in 7ZIP, we can fix the document.

What we need to know is what XML file within the DOCX is causing the problem and where to look for it. When Word blew up trying to open the DOCX it gave up that information. “Location: Part: /word/document.xml, Line: 2, Column: 12464.”

Word DOCX error screenshot

The error in question

If you want the exact position of the error but don’t have Word handy, open each XML in Firefox. Firefox will let you know when one of the XMLs isn’t parsing properly, and will tell you what line and character, just like Word.

Writer also gives you the information by displaying the last parseable information in the document before it stops making sense. In this case, the last words Writer could parse were “The website needs to be able to generate invoices and log when a payment is received.” Opening the XML file in Internet Explorer will provide the same info.

So, I have two options for solving this:

  1. Open document.xml, go to “line 2, character 12464”.
  2. Open the document and search for “The website needs to be able to generate invoices and log when a payment is received.”

In either case, I have to take a look at the XML tags and see what isn’t coded properly.

Here we have a problem: the entire document is on line 2! Because there is an XML error, most development environments won’t reflow the XML for you until you find and kill the error. With everything on one long line, finding and killing the error is a lot harder than it should be.

Cheating

There are two ways to cheat. The first is use Visual Studio or Visual Studio Express. Simply open the XML file, select all, copy the contents and then paste then into a new XML document. Visual Studio will indent the XML you are pasting, flowing it over as many lines as is necessary rather than keeping it all on one line.

The line and character references from above are now meaningless, but the last parseable content clue is still valid. CTRL + F will bring up the search dialogue box, enter the relevant bit of content and voilà, you are smack in the middle of a properly indented bit of bad code.

The other way is to open the XML file in Chrome. Chrome will lie to you about the line and column where the error is, so ignore that. Most importantly, Chrome will tell you what the offending tag is. Novel concept. In this case, the offending tag is a hyperlink and Chrome is expecting it to close before the <p> tag.

Interestingly enough, right after character 12464 there is a <w:p> tag. Armed with this information I can reasonably deduce that the first hyperlink tag before character 12464 in document.xml is somehow at fault.

I deleted the tag, saved the XML, put it back into the zip file and renamed it to .docx. Open the file in either Word or Writer and, as if by magic, the document is whole again. Days of work have been saved.

Insidious error

The class of XML error described above is absolutely insidious. If you are the type of writer who obsessively saves documents you are only digging your own grave. So long as the instance of the word processor that caused the error is open, the document will look and behave perfectly normal.

You could have created the error on page two, kept the document open ever since then, saving manually (or automatically) on a regular basis. So long as you never close the word processor you’ll just keep saving corrupted versions of the file with more and more data after the corruption point.

Even if you are saving to a cloud-based versioning file storage repository, they only keep a maximum number of versions around. If you’re anything like me, you can keep documents open for weeks at a time; that error on page two can easily become embedded in every single version of the document saved for weeks, overwhelming the maximum version history and wiping out any chance of reverting.

I tried dozens of websites and tools that promised to be able to fix corrupted DOCX files. None of them worked. I eventually stumbled upon a blog post by developer Asaf Benyamin that set me on the right course. Here’s hoping that none of you ever have to put this knowledge to the test for yourselves. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Redmond top man Satya Nadella: 'Microsoft LOVES Linux'
Open-source 'love' fairly runneth over at cloud event
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.