otsukare Thoughts after a day of work

Saving Webcompat images as a microservice

Update: You may want to fast forward to the latest part… of this blog post. (Head explodes).

Thinking out loud on separating our images into a separate service. The initial goal was to push the images to the cloud, but I think we could probably have a first step. We could keep the images on our server, but instead of the current save, we could send them to another service, let say upload.webcompat.com with a HTTP PUT. And this service would save them locally.

That way it would allow us two things:

  1. Virtualize the core app on heroku if needed
  2. Replace when we are ready the microservice by another cloud hosting solution.

All of this is mainly thinking for now.

Anatomy of our environment

config/environment.py defines:

UPLOADS_DEFAULT_DEST = os.environ.get('PROD_UPLOADS_DEFAULT_DEST')
UPLOADS_DEFAULT_URL = os.environ.get('PROD_UPLOADS_DEFAULT_URL')

The maximum limit for images is defined in __init__.py Currently in views.py, there is a route for localhost upload.

# set limit of 5.5MB for file uploads
# in practice, this is ~4MB (5.5 / 1.37)
# after the data URI is saved to disk
app.config['MAX_CONTENT_LENGTH'] = 5.5 * 1024 * 1024

The localhost part would probably not changed much. This is just for reading the images URL.

if app.config['LOCALHOST']:
    @app.route('/uploads/<path:filename>')
    def download_file(filename):
        """Route just for local environments to send uploaded images.

        In production, nginx handles this without needing to touch the
        Python app.
        """
        return send_from_directory(
            app.config['UPLOADS_DEFAULT_DEST'], filename)

then the api for uploads is defined in api/uploads.py

This is where the production route is defined.

@uploads.route('/', methods=['POST'])
def upload():
    '''Endpoint to upload an image.

    If the image asset passes validation, it's saved as:
        UPLOADS_DEFAULT_DEST + /year/month/random-uuid.ext

    Returns a JSON string that contains the filename and url.
    '''
    
    # cut some stuff.
    try:
        upload = Upload(imagedata)
        upload.save()
        data = {
            'filename': upload.get_filename(upload.image_path),
            'url': upload.get_url(upload.image_path),
            'thumb_url': upload.get_url(upload.thumb_path)
        }
        return (json.dumps(data), 201, {'content-type': JSON_MIME})
    except (TypeError, IOError):
        abort(415)
    except RequestEntityTooLarge:
        abort(413)

upload.save is basically where we should replace this by an HTTP PUT to a micro service.

What is Amazon S3 doing?

In these musings, I wonder if we could mimick the way Amazon S3 operates at a very high level. No need to replicate everything. We just need to save some bytes into a folder structure.

boto 3 has a documentation for uploading files.

def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

We could keep the image validation on the size of webcompat.com, but then the naming and checking is done. We can save this to a service the same way aws is doing.

So our priviledged service could accept images and save them locally in the same folder structure a separate flask structure. And later on, we could adjust it to use S3.

Surprise. Surprise.

I just found out that each time you put an image in an issue or a comment. GitHub is making a private copy of this image. Not sure if it's borderline with regards to property.

If you enter:

!['m root](http://www.la-grange.net/2019/01/01/2535-misere)

Then it creates this markup.

<p><a target="_blank"
       rel="noopener noreferrer"
       href="https://camo.githubusercontent.com/a285646de4a7c3b3cdd3e82d599e46607df8d3cc/687474703a2f2f7777772e6c612d6772616e67652e6e65742f323031392f30312f30312f323533352d6d6973657265"><img
             src="https://camo.githubusercontent.com/a285646de4a7c3b3cdd3e82d599e46607df8d3cc/687474703a2f2f7777772e6c612d6772616e67652e6e65742f323031392f30312f30312f323533352d6d6973657265"
             alt="I'm root"
             data-canonical-src="http://www.la-grange.net/2019/01/01/2535-misere"
             style="max-width:100%;"></a></p>

And we can notice that the img src is pointing to… GitHub?

I checked in my server logs to be sure. And I found…

140.82.115.251 - - [20/Nov/2019:06:44:54 +0000] "GET /2019/01/01/2535-misere HTTP/1.1" 200 62673 "-" "github-camo (876de43e)"

That will seriously challenge the OKR for this quarter.

Update: 2019-11-21 So I tried to decipher what was really happening. It seems GitHub acts as a proxy using camo, but still has a caching system keeping a real copy of the images, instead of just a proxy. And this can become a problem in the context of webcompat.com.

Early on, we had added s3.amazonaws.com to our connect-src since we had uses that were making requests to https://s3.amazonaws.com/github-cloud. However, this effectively opened up our connect-src to any Amazon S3 bucket. We refactored our URL generation and switched all call sites and our connect-src to use https://github-cloud.s3.amazonaws.com to reference our bucket.

GitHub is hosting the images on Amazon S3.

Otsukare!