Update: You may want to fast forward to the latest part… of this blog post. (Head explodes).
Thinking out loud on separating our images into a separate service. The initial goal was to push the images to the cloud, but I think we could probably have a first step. We could keep the images on our server, but instead of the current save
, we could send them to another service, let say upload.webcompat.com
with a HTTP PUT
. And this service would save them locally.
That way it would allow us two things:
- Virtualize the core app on heroku if needed
- Replace when we are ready the microservice by another cloud hosting solution.
All of this is mainly thinking for now.
Anatomy of our environment
config/environment.py
defines:
UPLOADS_DEFAULT_DEST = os.environ.get('PROD_UPLOADS_DEFAULT_DEST')
UPLOADS_DEFAULT_URL = os.environ.get('PROD_UPLOADS_DEFAULT_URL')
The maximum limit for images is defined in __init__.py
Currently in views.py, there is a route for localhost upload.
# set limit of 5.5MB for file uploads
# in practice, this is ~4MB (5.5 / 1.37)
# after the data URI is saved to disk
app.config['MAX_CONTENT_LENGTH'] = 5.5 * 1024 * 1024
The localhost part would probably not changed much. This is just for reading the images URL.
if app.config['LOCALHOST']:
@app.route('/uploads/<path:filename>')
def download_file(filename):
"""Route just for local environments to send uploaded images.
In production, nginx handles this without needing to touch the
Python app.
"""
return send_from_directory(
app.config['UPLOADS_DEFAULT_DEST'], filename)
then the api for uploads is defined in api/uploads.py
This is where the production route is defined.
@uploads.route('/', methods=['POST'])
def upload():
'''Endpoint to upload an image.
If the image asset passes validation, it's saved as:
UPLOADS_DEFAULT_DEST + /year/month/random-uuid.ext
Returns a JSON string that contains the filename and url.
'''
…
# cut some stuff.
try:
upload = Upload(imagedata)
upload.save()
data = {
'filename': upload.get_filename(upload.image_path),
'url': upload.get_url(upload.image_path),
'thumb_url': upload.get_url(upload.thumb_path)
}
return (json.dumps(data), 201, {'content-type': JSON_MIME})
except (TypeError, IOError):
abort(415)
except RequestEntityTooLarge:
abort(413)
upload.save
is basically where we should replace this by an HTTP PUT
to a micro service.
What is Amazon S3 doing?
In these musings, I wonder if we could mimick the way Amazon S3 operates at a very high level. No need to replicate everything. We just need to save some bytes into a folder structure.
boto 3 has a documentation for uploading files.
def upload_file(file_name, bucket, object_name=None):
"""Upload a file to an S3 bucket
:param file_name: File to upload
:param bucket: Bucket to upload to
:param object_name: S3 object name. If not specified then file_name is used
:return: True if file was uploaded, else False
"""
# If S3 object_name was not specified, use file_name
if object_name is None:
object_name = file_name
# Upload the file
s3_client = boto3.client('s3')
try:
response = s3_client.upload_file(file_name, bucket, object_name)
except ClientError as e:
logging.error(e)
return False
return True
We could keep the image validation on the size of webcompat.com, but then the naming and checking is done. We can save this to a service the same way aws is doing.
So our priviledged service could accept images and save them locally in the same folder structure a separate flask structure. And later on, we could adjust it to use S3.
Surprise. Surprise.
I just found out that each time you put an image in an issue or a comment. GitHub is making a private copy of this image. Not sure if it's borderline with regards to property.
If you enter:
!['m root](http://www.la-grange.net/2019/01/01/2535-misere)
Then it creates this markup.
<p><a target="_blank"
rel="noopener noreferrer"
href="https://camo.githubusercontent.com/a285646de4a7c3b3cdd3e82d599e46607df8d3cc/687474703a2f2f7777772e6c612d6772616e67652e6e65742f323031392f30312f30312f323533352d6d6973657265"><img
src="https://camo.githubusercontent.com/a285646de4a7c3b3cdd3e82d599e46607df8d3cc/687474703a2f2f7777772e6c612d6772616e67652e6e65742f323031392f30312f30312f323533352d6d6973657265"
alt="I'm root"
data-canonical-src="http://www.la-grange.net/2019/01/01/2535-misere"
style="max-width:100%;"></a></p>
And we can notice that the img src
is pointing to… GitHub?
I checked in my server logs to be sure. And I found…
140.82.115.251 - - [20/Nov/2019:06:44:54 +0000] "GET /2019/01/01/2535-misere HTTP/1.1" 200 62673 "-" "github-camo (876de43e)"
That will seriously challenge the OKR for this quarter.
Update: 2019-11-21 So I tried to decipher what was really happening. It seems GitHub acts as a proxy using camo, but still has a caching system keeping a real copy of the images, instead of just a proxy. And this can become a problem in the context of webcompat.com.
Early on, we had added s3.amazonaws.com to our connect-src since we had uses that were making requests to https://s3.amazonaws.com/github-cloud. However, this effectively opened up our connect-src to any Amazon S3 bucket. We refactored our URL generation and switched all call sites and our connect-src to use https://github-cloud.s3.amazonaws.com to reference our bucket.
GitHub is hosting the images on Amazon S3.
Otsukare!