Updating Twibooru to support pastes
Adding support for pastes (and eventually, audio) has been an open issue on Twibooru's issue tracker since July 2020, which is when the issue tracker was started, and close to the initial release of the site.
Booru-on-Rails, like most imagebooru software that I know of, wasn't exactly built to support anything other than images and videos (which are really just a special case of images that happen to move.)
Goals
- Add support for pastes, and make it easy to add other media types (ie: audio) later on.
- Don't break the existing search flow, and try to seamlessly integrate pastes into it.
- Don't break the existing upload flow, and try to seamlessly integrate pastes into it.
Process
There are many ways this could have been accomplished, some better than others. After careful consideration, I took roughly the following steps (some parts of some of the steps were interspersed with others due to accidental partial completion of eg: refactorings):
Preparing for different media types
1) Refactor the codebase to rename Images to Posts, both in the user-facing UI and the backend code / database.
2) Oops, now you've got a lot of conflicts because turns out there's something in the Forums called Posts too. Who'd have thought? Let's rename those to ForumPosts.
3) Now that the existing Images are all named Posts, extract the parts of Posts that are specific to Image posts to a separate database model called Images, and add appropriate associations in the code so that every Image belongs to a Post.
4) Move the actual Image-specific data in the database from Posts to Images and drop the now-redundant columns on the Posts table.
5) Update the code to pull the new data from the associated Images model instead of trying to get it from the Posts model, where it no longer exists.
Actually adding Pastes
1) Create a new Paste model that can also belong to Posts. Add a database field to Posts that specifies what type of media is associated with it (image, paste, audio).
2) Add appropriate checks to ensure that a Post can only have one type of media, and that it actually has an associated record for the same type of media that it declares, and no other.
3) Modify the code that handles uploads to determine if it's a paste or an image, and create the appropriate type of associated model / kick off only the appropriate processing jobs (we don't want to generate thumbnails for pastes, and we don't want to count the number of words in an image.)
I hope this process sounds relatively reasonable. All of this was relatively straightforward. However, there was one more issue left to solve, and that was...
Search
Twibooru utilizes ElasticSearch for most post loading and searching. The authoritative source of truth for the data is the main PostgreSQL database - ElasticSearch indices are updated from this data source, and can be (and frequently are) destroyed and re-created without any effect on the integrity of the data in the actual database. Search queries are parsed by a custom parser, mainly written by the previous maintainer of Booru-on-Rails (byte[]), and then fed to ElasticSearch to be answered. Returned records are then loaded from the Postgres database.
The ElasticSearch index mapping, which defines what fields are indexed by ElasticSearch and how they are indexed, is defined in the app's Ruby code. Previously, almost everything about a post was indexed in a flat manner; a massively simplfied version might have looked like this:
mappings dynamic: false do
indexes :id, type: 'integer'
# ... a million other things ...
indexes :width, type: 'integer'
# ... more fields omitted ...
end
There was nothing stopping me from maintaining this flat mapping, and I briefly considered it. However, I knew there was a better way to keep this data separated appropriately while also maintaining it in the same index, so only one ES index has to be hit to find any post on the site: nested mappings.
Again, over-simplified, it looks a little bit like this:
mappings dynamic: false do
indexes :id, type: 'integer'
indexes :image, type: :nested do
indexes :width, type: 'integer'
# ...
end
end
Fields other than the ones we're talking about are omitted once again, for brevity.
Querying this looks a little bit different than what we might be used to. While a query for a root field might look a little bit like this:
{
"bool": {
"must": [
{ "term": { "width": 1024 } }
]
}
}
a query for a nested field now looks like this:
{
"nested": {
{
"path": "images",
"query": {
"bool": {
"must": [
{ "term": { "images.width": 1024 } }
]
}
}
}
}
}
(Yeah, I know that you don't exactly need a bool
query in these cases, but due to the way some of the queries are built, it's easier to just always use one.)
Luckily, the search parser already has a way built in to handle this... field_transforms
!
field_transforms
is a mapping of field names to lambda functions that return an appropriate query to query that field. When you do a search for width:1024
, the code will ask field_transforms
for the field transform for the width
field, and it'll return a lambda function you can call as func(1024)
(where 1024
is the search query) that returns an appropriate ElasticSearch query to query that field.
The code automatically generates field transforms for the previous root image fields (things like width, height, aspect ratio, etc) that were moved to a nested mapping. This allows the search to continue working exactly how it did before, from the perspective of an end user. There is also a little bit more magic done behind the scenes for fields that can apply to more than one nested query, eg: mime_type. A query is generated that searches both the nested image
object and the nested paste
object, as an OR query (SHOULD in ElasticSearch speak.)
So, this worked pretty well in my testing, and when I deployed it to production, everything seemed to be working and I didn't receive any complaints.
Recently, however, one issue was announced to me that I completely failed to consider...
Sorting
So, you can sort on different fields (eg: image width). Some of the sort fields were moved to nested fields. The sorting code does not take into account the nested fields or field_transforms in any way whatsoever. This went unnoticed until somebody tried to sort by width and got back a mysterious error. Let's fix it.
The sorting code is really simple right now. When you add something like ?sf=width&sd=desc
to the HTTP query string, the code in the backend just generates something that looks a little like this:
{
"sort": [
{"width": "desc"}
]
}
Instead, what we really want to be generating, if the sort field is a nested one, might look a little like this:
{
"sort": [
{
"image.width": {
"order": "desc",
"nested": {
"path": {
"image"
}
}
}
}
]
}
I added another method to the PostIndex, called nested_sorts
, which returns something similar to field_transforms
except for sorts. It's a little bit ugly, and I'm not going to share it here for that reason, but it does essentially the same thing; it returns a hash of field names to lambda functions, and you can call each of the lambda functions with the sorting direction to get back a query like the one shown above.
Wrapping up
Welp, that's about it! Search hopefully works on Twibooru again by the time you're reading this, including sorting by nested fields. Thank you for reading my random blabbering, and I hope you at least mildly enjoyed it.
Edited to add: Somebody asked me why use nested
and not just sub-objects. The answer is because searching breaks in subtle ways if you have an array of sub-objects, and future plans involve that so I would much prefer to get it out of the way now rather than change everything later.