Feeds

DOCX disaster recovery: How I rescued my wife from XM-HELL

Something strange in your closing tag, who you gonna call?

Choosing a cloud hosting partner with confidence

Sysadmin blog What do you do when a critical Word document won’t open? Even in today’s world of versioned documents, it is entirely possible for corruption to squeak in and go unnoticed, wrecking your entire version history.

But all is not lost. My wife had this happen to her; here’s how we solved it.

Real world example

In my case, Word wouldn’t open an important file, dying instead with the error “the name in the end tag of the element must match the element type in the start tag”. Translated from Microsoftese: “The word processor that created this document made an XML boo-boo, and Word is going to refuse to read this document now.”

The most common kind of XML boo-boos that word processors will make are either saving tags out of order (the most famous example being Microsoft Word’s oMath tags error) or opening a tag but not closing it. Today’s issue was the latter. The wife was using an old version of LibreOffice Writer (v4.1) and had made several changes to hyperlinks in one area of the document. Writer got confused somehow, opened a hyperlink tag, but didn’t actually put in any information as to where it was hyperlinking to, and didn’t close the tag.

What should be noted here is that Writer and Word behave very differently with this broken file. Writer will open the file, but simply stop processing the document around where the XML stops making sense. Word will vomit that error and die. Both are useful.

Fixing the error

To understand how Microsoft’s DOCX (and similar OOXML formats) can go awry, we need to understand a little about the structure of these documents. What’s important to note is that OOXML documents are actually zip files, with all the goodies packed up in various XMLs like creamy filling. This generally means that if we can rename the file to something.zip and open it in 7ZIP, we can fix the document.

What we need to know is what XML file within the DOCX is causing the problem and where to look for it. When Word blew up trying to open the DOCX it gave up that information. “Location: Part: /word/document.xml, Line: 2, Column: 12464.”

Word DOCX error screenshot

The error in question

If you want the exact position of the error but don’t have Word handy, open each XML in Firefox. Firefox will let you know when one of the XMLs isn’t parsing properly, and will tell you what line and character, just like Word.

Writer also gives you the information by displaying the last parseable information in the document before it stops making sense. In this case, the last words Writer could parse were “The website needs to be able to generate invoices and log when a payment is received.” Opening the XML file in Internet Explorer will provide the same info.

So, I have two options for solving this:

  1. Open document.xml, go to “line 2, character 12464”.
  2. Open the document and search for “The website needs to be able to generate invoices and log when a payment is received.”

In either case, I have to take a look at the XML tags and see what isn’t coded properly.

Here we have a problem: the entire document is on line 2! Because there is an XML error, most development environments won’t reflow the XML for you until you find and kill the error. With everything on one long line, finding and killing the error is a lot harder than it should be.

Cheating

There are two ways to cheat. The first is use Visual Studio or Visual Studio Express. Simply open the XML file, select all, copy the contents and then paste then into a new XML document. Visual Studio will indent the XML you are pasting, flowing it over as many lines as is necessary rather than keeping it all on one line.

The line and character references from above are now meaningless, but the last parseable content clue is still valid. CTRL + F will bring up the search dialogue box, enter the relevant bit of content and voilà, you are smack in the middle of a properly indented bit of bad code.

The other way is to open the XML file in Chrome. Chrome will lie to you about the line and column where the error is, so ignore that. Most importantly, Chrome will tell you what the offending tag is. Novel concept. In this case, the offending tag is a hyperlink and Chrome is expecting it to close before the <p> tag.

Interestingly enough, right after character 12464 there is a <w:p> tag. Armed with this information I can reasonably deduce that the first hyperlink tag before character 12464 in document.xml is somehow at fault.

I deleted the tag, saved the XML, put it back into the zip file and renamed it to .docx. Open the file in either Word or Writer and, as if by magic, the document is whole again. Days of work have been saved.

Insidious error

The class of XML error described above is absolutely insidious. If you are the type of writer who obsessively saves documents you are only digging your own grave. So long as the instance of the word processor that caused the error is open, the document will look and behave perfectly normal.

You could have created the error on page two, kept the document open ever since then, saving manually (or automatically) on a regular basis. So long as you never close the word processor you’ll just keep saving corrupted versions of the file with more and more data after the corruption point.

Even if you are saving to a cloud-based versioning file storage repository, they only keep a maximum number of versions around. If you’re anything like me, you can keep documents open for weeks at a time; that error on page two can easily become embedded in every single version of the document saved for weeks, overwhelming the maximum version history and wiping out any chance of reverting.

I tried dozens of websites and tools that promised to be able to fix corrupted DOCX files. None of them worked. I eventually stumbled upon a blog post by developer Asaf Benyamin that set me on the right course. Here’s hoping that none of you ever have to put this knowledge to the test for yourselves. ®

Internet Security Threat Report 2014

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Ubuntu 14.10 tries pulling a Steve Ballmer on cloudy offerings
Oi, Windows, centOS and openSUSE – behave, we're all friends here
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.