DOCX disaster recovery: How I rescued my wife from XM-HELL
Something strange in your closing tag, who you gonna call?
Sysadmin blog What do you do when a critical Word document won’t open? Even in today’s world of versioned documents, it is entirely possible for corruption to squeak in and go unnoticed, wrecking your entire version history.
But all is not lost. My wife had this happen to her; here’s how we solved it.
Real world example
In my case, Word wouldn’t open an important file, dying instead with the error “the name in the end tag of the element must match the element type in the start tag”. Translated from Microsoftese: “The word processor that created this document made an XML boo-boo, and Word is going to refuse to read this document now.”
The most common kind of XML boo-boos that word processors will make are either saving tags out of order (the most famous example being Microsoft Word’s oMath tags error) or opening a tag but not closing it. Today’s issue was the latter. The wife was using an old version of LibreOffice Writer (v4.1) and had made several changes to hyperlinks in one area of the document. Writer got confused somehow, opened a hyperlink tag, but didn’t actually put in any information as to where it was hyperlinking to, and didn’t close the tag.
What should be noted here is that Writer and Word behave very differently with this broken file. Writer will open the file, but simply stop processing the document around where the XML stops making sense. Word will vomit that error and die. Both are useful.
Fixing the error
To understand how Microsoft’s DOCX (and similar OOXML formats) can go awry, we need to understand a little about the structure of these documents. What’s important to note is that OOXML documents are actually zip files, with all the goodies packed up in various XMLs like creamy filling. This generally means that if we can rename the file to something.zip and open it in 7ZIP, we can fix the document.
What we need to know is what XML file within the DOCX is causing the problem and where to look for it. When Word blew up trying to open the DOCX it gave up that information. “Location: Part: /word/document.xml, Line: 2, Column: 12464.”
The error in question
If you want the exact position of the error but don’t have Word handy, open each XML in Firefox. Firefox will let you know when one of the XMLs isn’t parsing properly, and will tell you what line and character, just like Word.
Writer also gives you the information by displaying the last parseable information in the document before it stops making sense. In this case, the last words Writer could parse were “The website needs to be able to generate invoices and log when a payment is received.” Opening the XML file in Internet Explorer will provide the same info.
So, I have two options for solving this:
- Open document.xml, go to “line 2, character 12464”.
- Open the document and search for “The website needs to be able to generate invoices and log when a payment is received.”
In either case, I have to take a look at the XML tags and see what isn’t coded properly.
Here we have a problem: the entire document is on line 2! Because there is an XML error, most development environments won’t reflow the XML for you until you find and kill the error. With everything on one long line, finding and killing the error is a lot harder than it should be.
There are two ways to cheat. The first is use Visual Studio or Visual Studio Express. Simply open the XML file, select all, copy the contents and then paste then into a new XML document. Visual Studio will indent the XML you are pasting, flowing it over as many lines as is necessary rather than keeping it all on one line.
The line and character references from above are now meaningless, but the last parseable content clue is still valid. CTRL + F will bring up the search dialogue box, enter the relevant bit of content and voilà, you are smack in the middle of a properly indented bit of bad code.
The other way is to open the XML file in Chrome. Chrome will lie to you about the line and column where the error is, so ignore that. Most importantly, Chrome will tell you what the offending tag is. Novel concept. In this case, the offending tag is a hyperlink and Chrome is expecting it to close before the <p> tag.
Interestingly enough, right after character 12464 there is a <w:p> tag. Armed with this information I can reasonably deduce that the first hyperlink tag before character 12464 in document.xml is somehow at fault.
I deleted the tag, saved the XML, put it back into the zip file and renamed it to .docx. Open the file in either Word or Writer and, as if by magic, the document is whole again. Days of work have been saved.
The class of XML error described above is absolutely insidious. If you are the type of writer who obsessively saves documents you are only digging your own grave. So long as the instance of the word processor that caused the error is open, the document will look and behave perfectly normal.
You could have created the error on page two, kept the document open ever since then, saving manually (or automatically) on a regular basis. So long as you never close the word processor you’ll just keep saving corrupted versions of the file with more and more data after the corruption point.
Even if you are saving to a cloud-based versioning file storage repository, they only keep a maximum number of versions around. If you’re anything like me, you can keep documents open for weeks at a time; that error on page two can easily become embedded in every single version of the document saved for weeks, overwhelming the maximum version history and wiping out any chance of reverting.
I tried dozens of websites and tools that promised to be able to fix corrupted DOCX files. None of them worked. I eventually stumbled upon a blog post by developer Asaf Benyamin that set me on the right course. Here’s hoping that none of you ever have to put this knowledge to the test for yourselves. ®