In this post, you will learn on how to improve and reduce the bandwidth cost for both the user and the server owner. But first we need to understand a bit the issue (but in case you know all about it, you can jump to the tips at the end of the blog post).
Well-Known Location Doom Empire
Starting a long time ago, in 1994, because of a spidering program behaving badly, robots.txt
was introduced and quickly adopted by WebCrawler, Lycos and other search engines at the time. Now, Web clients had the possibility to first inspect the robots.txt
at the root of the Web site and to not index the section of the Web sites which declared "not welcome" to Web spiders. This file was put at the root of the Web site, http://example.org/robots.txt
. It is called a "Well known location" resource. It means that the HTTP client is expecting to find something at that address when doing a HTTP GET
.
Since then many of these resources have been created unfortunately. The issues is that it imposes on server owners certain names they might have want to use for something else. Let's say, as a Web site owner, I decide to create a Web page /contact
at the root of my Web site. One day, a powerful company decides that it would be cool if everyone had a /contact
with a dedicated format. I then become forced to adjust my own URI space to not create conflict with this new de facto popular practice. We usually say that it is cluttering the Web site namespace.
What are the other common resources which have been created since robots.txt
?
- 1994
/robots.txt
- 1999
/favicon.ico
- 2002
/w3c/p3p.xml
- 2005
/sitemap.xml
- 2008
/crossdomain.xml
- 2008
/apple-touch-icon.png
,/apple-touch-icon-precomposed.png
- 2011
/humans.txt
Note that in the future if you would like to create a knew well-known resource, RFC 5785 (Defining Well-Known Uniform Resource Identifiers (URIs)) has been proposed specifically for addressing this issue.
Bandwidth Waste
In terms of bandwidth, why could it be an issue? These are files which are most of the time requested by autonomous Web clients. When an HTTP client requests a resource which is not available on the HTTP server, it will send back a 404 response. These response can be very simple light text or a full HTML page with a lot of code.
Google evaluated that the waste of bandwidth generated by missing apple-touch-icon
on mobile was 3% to 4%. This means that the server is sending bits on the wire which are useless (cost for the site owner) and the same for the client receiving them (cost for the mobile owner).
It's there a way to fix that? Maybe.
Let's Hack Around It
So what about instead of having the burden to specify every resources in place for each clients, we could send a very light 404 answer targeted to the Web clients that are requesting the resources we do not have on our own server.
Let's say for the purpose of the demo, that only favicon and robots are available on your Web site. We need then to send a specialized light 404 for the rest of the possible resources.
Apache
With Apache, we can use the Location
directive. This must be defined in the server configuration file httpd.conf
or the virtual host configuration file. It can not be defined in .htaccess
.
<VirtualHost *:80>
DocumentRoot "/somewhere/over/the/rainbow"
ServerName example.org
<Directory "/somewhere/over/the/rainbow">
# Here some options
# And your common 404 file
ErrorDocument 404 /fancy-404.html
</Directory>
# your customized errors
#<Location /robots.txt>
# ErrorDocument 404 /plain-404.txt
#</Location>
#<Location /favicon.ico>
# ErrorDocument 404 /plain-404.txt
#</Location>
<Location /humans.txt>
ErrorDocument 404 /plain-404.txt
</Location>
<Location /crossdomain.xml>
ErrorDocument 404 /plain-404.txt
</Location>
<Location /w3c/p3p.xml>
ErrorDocument 404 /plain-404.txt
</Location>
<Location /apple-touch-icon.png>
ErrorDocument 404 /plain-404.txt
</Location>
<Location /apple-touch-icon-precomposed.png>
ErrorDocument 404 /plain-404.txt
</Location>
</VirtualHost>
Here I put in comments the robots.txt
and the favicon.ico
but you can adjust to your own needs and send errors or not to specific requests.
The plain-404.txt
is a very simple text file with just NOT FOUND inside and the fancy-404.html
is an html file helping humans to understand what is happening and invite them to find their way on the site. The result is quite cool.
For a classical mistake, let say requesting http://example.org/foba6365djh
, we receive the html error.
GET /foba6365djh HTTP/1.1
Host: example.org
HTTP/1.1 404 Not Found
Content-Length: 1926
Content-Type: text/html; charset=utf-8
Date: Wed, 30 Jul 2014 05:30:33 GMT
ETag: "f7660-786-4e55273ef8a80;4ff4eb6306700"
Last-Modified: Sun, 01 Sep 2013 13:30:02 GMT
<!DOCTYPE html>
…
And then for a request to let say http://crossdomain.xml/foba6365djh
, we get the plain light error message.
GET /crossdomain.xml HTTP/1.1
Host: example.org
HTTP/1.1 404 Not Found
Content-Length: 9
Content-Type: text/plain
Date: Wed, 30 Jul 2014 05:29:11 GMT
NOT FOUND
nginx
It is probably possible to do it for nginx too. Be my guest, I'll link your post from here.
Otsukare.