Webcompat issues and the bots!

Some ideas and contexts around auto-discovering webcompat issues.

Graffiti of a robot on a wall.

Are you familiar with this?

which I answered: Yes since 2018. And I remembered the challenges and so probably it's worth to do a bit of history on identifying webcompat issues. The objectives being often:

How to massively test websites and their different renderings across browsers?
How to reduce human time spent on manually testing the site?
Can we discover the type of issues?

I have been doing webcompat work since October 2010 (when I started working at Opera Software with the amazing Opera devrel team). There's no perfect technique, but there are a couple of things you can try.

Screenshots Comparison

We often associate webcompat issues with sites which are not looking the same in two different browsers. It's a simplistic approximation but can help in some type of webcompat issues.

Mobile versus desktop

Some websites will adjust their content depending on the user agent strings. They will either deliver a specific content, or redirect to a domain which is friendly for mobile or desktop. This can be directly detected with the homepage of the website. You could quickly identify if a site sends the same design/content to Firefox Android, Safari iOS or a blink browser on Android. This is less and less meaningful, as many websites in the last ten years have switched to responsive design where the content automatically adjusts depending on the size of the screen.
Rendering Issues

This is slightly more complex. There might be multiple issues with regards to rendering. I'll talk about the caveats later. This could potentially identify a wrong color, a wrong position of the boxes, a difference in details such as scrollbars or boxes radius, etc.

With a simple URLs list and using the webdriver API, it is possible to fetch websites for Gecko, WebKit and Blink and take a screenshot for each of them. It becomes very easy to test the top 1000 websites in a specific locale. You can discriminate visually quickly the screenshots which are different.

But we said we wanted to be more effective. We can use a bit of maths for this. Let's make 𝑠¹ and 𝑠², the screenshots we want to compare, then we can use a simple library like difflib in python to compute the similarity of the images.

def diff_ratio(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    return s.quick_ratio()

Then it becomes easy to define the diff_ratio which is acceptable for the series of tests we run. After fixing a threshold this will identify the sites with potential issues. It will not identify the type of issues. It will not provide a diagnosis.

And the method has some limitations which are interesting to understand if we want to be effective in pre-filtering the issues.

Some Limitations Around Screenshots Comparison

The screenshots might be different but that doesn't necessary mean there is a webcompat issue. Here some cases:

Anti-Tracking Mechanisms

Every browser has its own strategy with regards to tracking protection. These are browsers breaking websites on purpose to reduce users fingerprinting. Hence a screenshot for the same site might create different results.
A/B Testing

Some sites test A/B scenario for a more profitable user experience. They will send two different versions of the site to different users. If one browser is one pool and the other browser in another pool at the moment of the tests, the screenshots will be different.
Android/iOS banner for apps

Testing the rendering in between a browser on iOS and a browser on Android will create different results, as the banner for apps will display and link to different stores.
Dynamic Content (News sites/Social Network)

There's a big category of websites where the content changes or offers rotations of the content in between each reload. Caroussels, ads, news article, user posts, etc. are all likely to modify the screenshots in between two queries in the same browser.
Tier 1

Some sites provide a different experience to different browsers. This one is more subtle to deal with. They rely more on a business decision. Compare for example the results of Google search on Firefox Android and Google Chrome. Google Chrome definitely receives the Tier 1. Other browsers receive different content. The diagnosis here is not technical, but business priorities.

Quick summary about autowebcompat.

autowebcompat, that Brian was mentionning, is a nice project from Marco Castelluccio to attempt to auto-detect web compatibility issues. Basically the code tries to learn if screenshots for a similar set of interactions in two different browsers create the same end result. The silverlining being that if there is a difference, there's probably something to better understand. The project used the issues already reported on webcompat.com. In that sense it's already biaised by the fact that the issues have already been identified as being different. But it make possible to train a model on learning on what creates a webcompat issue.

Training A Bot To Identify Valid Issues

Recently, Ksenia (Mozilla Webcompat team) adjusted BugBug to make it work on GitHub. It helped the webcompat team to move away from the old ML classifier to the BugBug infrastructure.

It identifies already reported issues and closes the ones which have similar features than previous invalid bugs. Invalid here means not a webcompat issue. Some sites are broken in all browsers, that doesn't create a webcompat issue.

Compatipede, Another Project For Auto Webcompat

Compatipede is a project which predates autowebcompat (started in October 2013!) with the intent to identify more parameters and extend the scope of tests.

Equal redirects
CSS style compatibility
Source code compatibility
Other custom tests

This was quite interesting as it was trying to explore the unseen issues and avoid the pitfalls of screenshots.

It had also a modular architecture providing a system of custom plugins to run probes on the payloads sent by the website.

SiteCompTester

With the same spirit than Compatipede, SiteCompTester was an extension which made possible to target some type of issues and would surface bugs associated with a specific list of known issues. This makes it easier to diagnose a website.

Template Extraction Mining

The variability of content may be avoided by using a mechanism such as templatemaker. This is a clever little tool which extracts the common features of a series of text and extract a template.

So let's say for a news website, we could imagine running template maker with one browser for a couple of days and extract its templates. And do the same in parallel with another browser. Then we would compare the templates instead of comparing two unique rendering of the websites. That would probably makes it possible to have a better understanding of certain features variability. This could be applied to markup, to JavaScript, to HTTP headers.

Webcompat Auto-Detection Caveats

The issue with auto-detection of webcompat issues is that we don't know what is broken before someone experience it in real life. The level of interactions it requires is really delicate.

And it's why the people working on triaging and diagnosis in the Mozilla webcompat team are top-notch. * Oana and Raul are triaging the issues after poor description by most users. * Ksenia, Dennis and Thomas are diagnosing relentlessly minified obfuscated code to decipher what is breaking in the current site.

Auto-Discovery Of Webcompat

The auto-discovery may work in very specific use cases when we know what we try to identify as an issue. Let's say we already identify a pattern in one bug and we want to understand to which extend this bug is affecting other websites. Then using a framework going through the sites and searching for this pattern might reveal potential webcompat issues.

Targeted surveys are the key to understand the priority of some issues.

Otsukare!