Last night, I had the pleasure of reading Daniel Stenberg's blog post about URL Standards. It led me to the discussion happening about the WHATWG URL spec about "It's not immediately clear that "URL syntax" and "URL parser" conflict". As you can expect, the debate is inflammatory on both sides, border line hypocrite at some occasions and with a lot of the arguments I have seen in the last 20 years I have followed discussions around the Web development.
This post has no intent to be the right way to talk about it. It's more a collection of impression I had when reading the thread with my baggage of ex-W3C staff, Web agency work and, ex-Opera and now-Mozilla Web Compatibility work.
"Le chat a bon dos". French expression to basically say we are in the blaming game in that thread. Maybe not that useful.
What is happening?
- Deployed Web Content: Yes there is a lot of content broken out there and some of it will never be fixed which ever effort you put into it. That's normal and this is not a broken per se. Think about abandon editions of old dictionaries with mistakes into it. History and fabric of time. What should happen? When a mistake is frequent enough, it is interesting to try to have a part of the parsing algorithm to recover it. The decision then becomes to decide on frequent enough meaning. And that opens a new debate in itself, because it's dependent on countries, market shares, specific communities. Everything the society can provide in terms of economy, social behavior, history, etc.
- Browsers: We can also often read in that thread, that it's not browser's fault, it's because of the Web Content. Well that's not entirely true too. When a browser recovers from a previously-considered-broken pattern found on the Web, it just entrenches the pattern. Basically, it's not an act of saying, we need to be compatible with the deployed content (aka not our fault). It would be a false pretense. It's an implementation decision which further drags the once-broken-pattern into a the normal patterns of the Web, a standardization process (a king of jurisprudence). So basically it's about recognizing that this term, pattern is now part of the bigger picture. There's no such things as saying: "It is good for people who decide to be compatible with browsers" (read "Join us or go to hell, I don't want to discuss with you."). There's a form of understandable escapism here to hide a responsibility and to hide the burden of creating a community. It would be more exact to say "Yes, we make the decision that the Web should be this and not anything else." It doesn't make the discussion easier but it's more the point of the power play in place.
$BROWSER
lord: In the discussion, the$BROWSER
is Google's Chrome. A couple of years ago, it was IE. Again, saying Chrome has no specific responsibility is again an escapism. The same way that Safari has a lot of influences on the mobile Web, Chrome currently by its market share creates a tide which influences a lot the Web content and its patterns out there. I can guarantee that it's easier now for Chrome to be stricter with regards to syntax than it is for Edge or Firefox. Opera had to give up its rendering engine (Presto) because of this and switched to blink.
There are different schools for the Web specifications:
- Standards defining a syntax considered ideal and free for implementations to recover with their own strategy when it's broken.
- Standards defining how to recover for all the possible ways it is mixed up. By doing that the intent is often to recover from a previous stricter syntax, but in the end it is just defining, expanding the possibilities.
- Standards defining a different policy for parsing and producing with certain nuances in between. [Kind of Postel's law.].
I'm swaying in between these three schools all the time. I don't like the number 2 at all, but because of survival it is sometimes necessary. My preferred way it's 3, having a clear strict syntax for producing content, and a recovery parsing technique. And when possible I would prefer a sanitizer version of the Postel's law.
What did he say btw?
The implementation of a protocol must be robust. Each implementation must expect to interoperate with others created by different individuals. While the goal of this specification is to be explicit about the protocol there is the possibility of differing interpretations. In general, an implementation should be conservative in its sending behavior, and liberal in its receiving behavior. That is, it should be careful to send well-formed datagrams, but should accept any datagram that it can interpret (e.g., not object to technical errors where the meaning is still clear).
Then in RFC 1122: The 1.2.2 section, the Robustness Principle
At every layer of the protocols, there is a general rule whose application can lead to enormous benefits in robustness and interoperability [IP:1]:
"Be liberal in what you accept, and conservative in what you send"
Software should be written to deal with every conceivable error, no matter how unlikely; sooner or later a packet will come in with that particular combination of errors and attributes, and unless the software is prepared, chaos can ensue. In general, it is best to assume that the network is filled with malevolent entities that will send in packets designed to have the worst possible effect. This assumption will lead to suitable protective design, although the most serious problems in the Internet have been caused by unenvisaged mechanisms triggered by low-probability events; mere human malice would never have taken so devious a course!
Adaptability to change must be designed into all levels of Internet host software. As a simple example, consider a protocol specification that contains an enumeration of values for a particular header field -- e.g., a type field, a port number, or an error code; this enumeration must be assumed to be incomplete. Thus, if a protocol specification defines four possible error codes, the software must not break when a fifth code shows up. An undefined code might be logged (see below), but it must not cause a failure.
The second part of the principle is almost as important: software on other hosts may contain deficiencies that make it unwise to exploit legal but obscure protocol features. It is unwise to stray far from the obvious and simple, lest untoward effects result elsewhere. A corollary of this is "watch out for misbehaving hosts"; host software should be prepared, not just to survive other misbehaving hosts, but also to cooperate to limit the amount of disruption such hosts can cause to the shared communication facility.
The important point in the discussion of Postel's law is that he is talking about software behavior, not specifications. The new school of thoughts for Web standards is to create specification which are "software-driven", not "syntax-driven". And it's why you can read entrenched debates about the technology.
My sanitizer version of the Postel's law would be something along:
- Be liberal in what you accept
- Be conservative in what you send
- Make conservative what you accepted (aka fixing it)
Basically when you receive something broken, and there is a clear path for fixing it, do it. Normalize it. In the debated version, about accepting http://////
, it would be
- parse it as
http://////
- communicate it to the next step as
http://
and possibly with an optional notification it has been recovered.
Otsukare!