Welcome to Twibooru! Anonymous posting only; no content restrictions beyond pony-related and legal; comments are disabled by default (Settings -> Comments). Read me!

Floor Bored's Dev Blog

Started by Anonymous #F8CC
Posted
2 replies
Login to subscribe to responses
Anonymous #F8CC

As the title of this forum thread states... I've got a dev blog now, and it's this forum thread. Read it if you want to, don't read it if you don't, post on /mlp/ or PM me here if you have any questions, think what I wrote is stupid, or have improvement suggestions with regards to anything I've posted here.


For your trouble of reading this thread, have a Floorb.


Size: 2383x2000 | Tagged: safe, artist:senaelik, derpibooru import, oc, oc:floor bored, unofficial characters only, earth pony, pony, semi-anthro, dialogue, female, headphones, headset, image, mare, open mouth, plate, png, simple background, sitting, solo, speech bubble, transparent background

Posted Report
Anonymous #F8CC

Updating Twibooru to support pastes

Adding support for pastes (and eventually, audio) has been an open issue on Twibooru's issue tracker since July 2020, which is when the issue tracker was started, and close to the initial release of the site.


Booru-on-Rails, like most imagebooru software that I know of, wasn't exactly built to support anything other than images and videos (which are really just a special case of images that happen to move.)

Goals

  • Add support for pastes, and make it easy to add other media types (ie: audio) later on.
  • Don't break the existing search flow, and try to seamlessly integrate pastes into it.
  • Don't break the existing upload flow, and try to seamlessly integrate pastes into it.

Process

There are many ways this could have been accomplished, some better than others. After careful consideration, I took roughly the following steps (some parts of some of the steps were interspersed with others due to accidental partial completion of eg: refactorings):

Preparing for different media types

1) Refactor the codebase to rename Images to Posts, both in the user-facing UI and the backend code / database.
2) Oops, now you've got a lot of conflicts because turns out there's something in the Forums called Posts too. Who'd have thought? Let's rename those to ForumPosts.
3) Now that the existing Images are all named Posts, extract the parts of Posts that are specific to Image posts to a separate database model called Images, and add appropriate associations in the code so that every Image belongs to a Post.
4) Move the actual Image-specific data in the database from Posts to Images and drop the now-redundant columns on the Posts table.
5) Update the code to pull the new data from the associated Images model instead of trying to get it from the Posts model, where it no longer exists.

Actually adding Pastes

1) Create a new Paste model that can also belong to Posts. Add a database field to Posts that specifies what type of media is associated with it (image, paste, audio).
2) Add appropriate checks to ensure that a Post can only have one type of media, and that it actually has an associated record for the same type of media that it declares, and no other.
3) Modify the code that handles uploads to determine if it's a paste or an image, and create the appropriate type of associated model / kick off only the appropriate processing jobs (we don't want to generate thumbnails for pastes, and we don't want to count the number of words in an image.)


I hope this process sounds relatively reasonable. All of this was relatively straightforward. However, there was one more issue left to solve, and that was...

Search

Twibooru utilizes ElasticSearch for most post loading and searching. The authoritative source of truth for the data is the main PostgreSQL database - ElasticSearch indices are updated from this data source, and can be (and frequently are) destroyed and re-created without any effect on the integrity of the data in the actual database. Search queries are parsed by a custom parser, mainly written by the previous maintainer of Booru-on-Rails (byte[]), and then fed to ElasticSearch to be answered. Returned records are then loaded from the Postgres database.


The ElasticSearch index mapping, which defines what fields are indexed by ElasticSearch and how they are indexed, is defined in the app's Ruby code. Previously, almost everything about a post was indexed in a flat manner; a massively simplfied version might have looked like this:

mappings dynamic: false do
  indexes :id,    type: 'integer'
  # ... a million other things ...
  indexes :width, type: 'integer'
  # ... more fields omitted ...
end

There was nothing stopping me from maintaining this flat mapping, and I briefly considered it. However, I knew there was a better way to keep this data separated appropriately while also maintaining it in the same index, so only one ES index has to be hit to find any post on the site: nested mappings.


Again, over-simplified, it looks a little bit like this:

mappings dynamic: false do
  indexes :id, type: 'integer'

  indexes :image, type: :nested do
    indexes :width, type: 'integer'
    # ...
  end
end

Fields other than the ones we're talking about are omitted once again, for brevity.


Querying this looks a little bit different than what we might be used to. While a query for a root field might look a little bit like this:

{
  "bool": {
    "must": [
      { "term": { "width": 1024 } }
    ]
  }
}

a query for a nested field now looks like this:

{
  "nested": {
    {
      "path": "images",
      "query": {
        "bool": {
          "must": [
            { "term": { "images.width": 1024 } }
          ]
        }
      }
    }
  }
}

(Yeah, I know that you don't exactly need a bool query in these cases, but due to the way some of the queries are built, it's easier to just always use one.)


Luckily, the search parser already has a way built in to handle this... field_transforms!


field_transforms is a mapping of field names to lambda functions that return an appropriate query to query that field. When you do a search for width:1024, the code will ask field_transforms for the field transform for the width field, and it'll return a lambda function you can call as func(1024) (where 1024 is the search query) that returns an appropriate ElasticSearch query to query that field.


The code automatically generates field transforms for the previous root image fields (things like width, height, aspect ratio, etc) that were moved to a nested mapping. This allows the search to continue working exactly how it did before, from the perspective of an end user. There is also a little bit more magic done behind the scenes for fields that can apply to more than one nested query, eg: mime_type. A query is generated that searches both the nested image object and the nested paste object, as an OR query (SHOULD in ElasticSearch speak.)


So, this worked pretty well in my testing, and when I deployed it to production, everything seemed to be working and I didn't receive any complaints.


Recently, however, one issue was announced to me that I completely failed to consider...

Sorting

So, you can sort on different fields (eg: image width). Some of the sort fields were moved to nested fields. The sorting code does not take into account the nested fields or field_transforms in any way whatsoever. This went unnoticed until somebody tried to sort by width and got back a mysterious error. Let's fix it.


The sorting code is really simple right now. When you add something like ?sf=width&sd=desc to the HTTP query string, the code in the backend just generates something that looks a little like this:

{
  "sort": [
    {"width": "desc"}
  ]
}

Instead, what we really want to be generating, if the sort field is a nested one, might look a little like this:

{
  "sort": [
    {
      "image.width": {
        "order": "desc",
        "nested": {
          "path": {
            "image"
          }
        }
      }
    }
  ]
}

I added another method to the PostIndex, called nested_sorts, which returns something similar to field_transforms except for sorts. It's a little bit ugly, and I'm not going to share it here for that reason, but it does essentially the same thing; it returns a hash of field names to lambda functions, and you can call each of the lambda functions with the sorting direction to get back a query like the one shown above.

Wrapping up

Welp, that's about it! Search hopefully works on Twibooru again by the time you're reading this, including sorting by nested fields. Thank you for reading my random blabbering, and I hope you at least mildly enjoyed it.


Edited to add: Somebody asked me why use nested and not just sub-objects. The answer is because searching breaks in subtle ways if you have an array of sub-objects, and future plans involve that so I would much prefer to get it out of the way now rather than change everything later.

Anonymous #F8CC

The Symlink Situation on Twibooru

Wow, that's a bit of an ominous title. Let's get out of the way what, exactly, I mean by that.


Talking with a friend of mine, it was recently brought to my attention that Twibooru makes a lot of symlinks to images on its local filesystem. Well, I knew this already, but I didn't really consider the implications of this quite fully.

Why all the symlinks?

Let's look at some code:

# VERSIONS_TO_GENERATE looks a little like this:
VERSIONS_TO_GENERATE = { thumb: [250, 250], ... }

# And somewhere else...
VERSIONS_TO_GENERATE.each_pair do |type, size|
  w, h = size
  dest_file = "#{dir}/#{type}.#{version_file_ext}"

  # If the version we're about to generate is smaller, generate it.
  if ((w < model.image_width || h < model.image_height)) || processor.is_video? || model.image_mime_type == 'image/gif'
    processor.generate_version(size, dest_file)
  else
    # just link it
    platform_link(processor.rasterized, dest_file)
  end
end

Pretty simple, hopefully. A little messy, but it gets the job done. If a given image size/"version" we're about to generate is going to be LARGER than the original image file, skip generating it and just symlink the original image file to it, to save both CPU time and disk space, since it makes absolutely no sense to enlarge an image to create a thumbnail.


platform_link is simply a function that was added by the original Booru-on-Rails developers, which doesn't attempt to make a symlink on Windows because Windows doesn't support typical unix-style symlinks.

The better way

Why do we need to store all of these symlinks? They take up disk space (seriously! It's small, but it does add up when you have many thousands or even millions of them), slow down access (you have to dereference the symlink), and hurt cache hit rates (you're serving the exact same file twice with a different name.)


Well... We don't! What if, instead of creating these links at image thumbnail generation time, we instead defer determining which versions should exist to the point at which we actually generate URLs to those versions, and if a version is larger than the original image, we simply return a URL to the best fitting size?


And that's exactly what I did. The URL-generation code now contains a bit of code that looks like this:

urls = {
  full: "#{base_path}/full.#{file_ext}"
}

smallest_candidate = :full

VERSIONS_TO_GENERATE.each do |version, dimensions|
  # Are we requesting a version that is larger than the original image? If so, just return the next best fit (which may be the original image.)
  if dimensions[0] >= model.image_width && dimensions[1] >= model.image_height
    urls[version] = urls[smallest_candidate]
  else
    urls[version] = "#{base_path}/#{version}.#{file_ext}"
    smallest_candidate = version
  end
end

How does this look in practice?

We can turn to the site's API to have a look at what this looks like in practice.


Here's all the versions, from the code, that should be served:

tall:        [1024, 4096],
large:       [1280, 1024],
medium:      [800, 600],
small:       [320, 240],
thumb:       [250, 250],
thumb_small: [150, 150],
thumb_tiny:  [50, 50]

Here's a nice, big image, with dimensions 2383x2000. https://twibooru.org/1740018.json

"representations": {
  "full":"https://cdn.twibooru.org/img/2020/7/20/1740018/full.png",
  "tall":"https://cdn.twibooru.org/img/2020/7/20/1740018/tall.png",
  "large":"https://cdn.twibooru.org/img/2020/7/20/1740018/large.png",
  "medium":"https://cdn.twibooru.org/img/2020/7/20/1740018/medium.png",
  "small":"https://cdn.twibooru.org/img/2020/7/20/1740018/small.png",
  "thumb":"https://cdn.twibooru.org/img/2020/7/20/1740018/thumb.png",
  "thumb_small":"https://cdn.twibooru.org/img/2020/7/20/1740018/thumb_small.png",
  "thumb_tiny":"https://cdn.twibooru.org/img/2020/7/20/1740018/thumb_tiny.png"
}

This is exactly how the API response would have looked for ANY image before my change, and exactly how the API response will continue to look for images whose original versions are all larger than the biggest thumb size.


Things get different if we pick an image that's a little smaller, like this one whose size is 350x350. https://twibooru.org/1223260.json

"representations": {
  "full":"https://cdn.twibooru.org/img/2020/7/18/1223260/full.png",
  "tall":"https://cdn.twibooru.org/img/2020/7/18/1223260/full.png",
  "large":"https://cdn.twibooru.org/img/2020/7/18/1223260/full.png",
  "medium":"https://cdn.twibooru.org/img/2020/7/18/1223260/full.png",
  "small":"https://cdn.twibooru.org/img/2020/7/18/1223260/small.png",
  "thumb":"https://cdn.twibooru.org/img/2020/7/18/1223260/thumb.png",
  "thumb_small":"https://cdn.twibooru.org/img/2020/7/18/1223260/thumb_small.png",
  "thumb_tiny":"https://cdn.twibooru.org/img/2020/7/18/1223260/thumb_tiny.png"
}

Would you look at that? If we consult our handy table of versions up there, you'll find that for every version which is bigger than the original image (medium and larger), the full version's URL is just returned instead. For every version which is actually smaller (small and smaller), we return a URL to a generated thumbnail.


Note that the actual files that are being served have not changed in any way. Only the URLs. Previously, for this image, you would get served a URL such as https://cdn.twibooru.org/img/2020/7/18/1223260/medium.png for the medium version, but on the server side it would just see it as a link to the full version and serve you that file.


And, one more for completeness, a really tiny one, of size 8x8. https://twibooru.org/2144747.json


Perhaps as you would expect, the URLs served:

"representations": {
  "full":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "tall":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "large":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "medium":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "small":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "thumb":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "thumb_small":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png",
  "thumb_tiny":"https://cdn.twibooru.org/img/2020/7/23/2144747/full.png"
}

So, yeah. That's why that's like that now. Any questions, as usual, poke me here or on the thread and I'll see what I can do!