{
    "version" : "https://jsonfeed.org/version/1",
    "content" : "news",
    "type" : "single",
    "title" : "A Picture Is Worth a Thousand Tokens: Part II |Digital.gov",
    "description": "A Picture Is Worth a Thousand Tokens: Part II",
    "home_page_url" : "/preview/gsa/digitalgov.gov/cm-topics-button-component/","feed_url" : "/preview/gsa/digitalgov.gov/cm-topics-button-component/2014/11/04/a-picture-is-worth-a-thousand-tokens-part-ii/index.json","item" : [
    {"title" :"A Picture Is Worth a Thousand Tokens: Part II","summary" : "In the first part of A Picture Is Worth a Thousand Tokens, I explained why we built a social media-driven image search engine, and specifically how we used Elasticsearch to build its first iteration. In this week’s post, I’ll take a deep dive into how we worked to improve relevancy, recall, and the searcher’s experience","date" : "2014-11-04T10:00:48-04:00","date_modified" : "2024-04-02T09:45:13-04:00","authors" : {"loren-siebert" : "Loren Siebert"},"topics" : {
        
            "content-strategy" : "Content Strategy",
            "open-government" : "Open Government",
            "social-media" : "Social Media"
            },"branch" : "cm-topics-button-component",
      "filename" :"2014-11-04-a-picture-is-worth-a-thousand-tokens-part-ii.md",
      
      "filepath" :"news/2014/11/2014-11-04-a-picture-is-worth-a-thousand-tokens-part-ii.md",
      "filepathURL" :"https://github.com/GSA/digitalgov.gov/blob/cm-topics-button-component/content/news/2014/11/2014-11-04-a-picture-is-worth-a-thousand-tokens-part-ii.md",
      "editpathURL" :"https://github.com/GSA/digitalgov.gov/edit/cm-topics-button-component/content/news/2014/11/2014-11-04-a-picture-is-worth-a-thousand-tokens-part-ii.md","slug" : "a-picture-is-worth-a-thousand-tokens-part-ii","url" : "/preview/gsa/digitalgov.gov/cm-topics-button-component/2014/11/04/a-picture-is-worth-a-thousand-tokens-part-ii/","content" :"\u003cp\u003eIn the first part of \u003ca href=\"/preview/gsa/digitalgov.gov/cm-topics-button-component/2014/10/28/a-picture-is-worth-a-thousand-tokens/\" title=\"A Picture Is Worth a Thousand Tokens\"\u003e\u003cem\u003eA Picture Is Worth a Thousand Tokens\u003c/em\u003e\u003c/a\u003e, I explained why we built a social media-driven image search engine, and specifically how we used Elasticsearch to build its first iteration. In this week’s post, I’ll take a deep dive into how we worked to improve relevancy, recall, and the searcher’s experience as a whole.\u003c/p\u003e\n\u003ch2 id=\"redefine-recency\"\u003eRedefine Recency\u003c/h2\u003e\n\u003cp\u003eTo solve the scoring problem on older photos for archival photostreams, we decided that after some amount of time, say six weeks, we no longer wanted to keep decaying the relevancy on photos. To put that into effect, we modified the functions in the function score like this:\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/df85de9536216ae32b19\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-Part-2-Redefine-Recency-code.jpg\"\n    alt=\"600-x-186-tokens-Part-2-Redefine-Recency-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eNow we only apply the Gaussian decay for photos taken in the last six weeks or so. Anything older than that gets a constant decay or negative boost equal to what it would be if the photo were about six weeks old. So rather than having the decay factor continue on down to zero, we stop it at around 0.12. For all those Civil War photos in the Library of Congress’ photostream, the date ends up being factored out of the relevancy equation and they are judged solely on their similarity score and their popularity.\u003c/p\u003e\n\u003ch2 id=\"recognize-proximity\"\u003eRecognize Proximity\u003c/h2\u003e\n\u003cp\u003eTo rank “County event in Jefferson Memorial” higher than “Memorial event in Jefferson County” on a search for \u003cem\u003ejefferson memorial\u003c/em\u003e, the simplest way to handle it was to use a \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_phrase\"\u003ematch_phrase query\u003c/a\u003e to make the proximity of the terms a nice-to-have signal that could be factored into the overall score. The updated boolean clause matches on the phrase like this:\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/7741c52bd8e74d7ef626\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-Part-2-Recognize-Proximity-code.jpg\"\n    alt=\"600-x-186-tokens-Part-2-Recognize-Proximity-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003ch2 id=\"account-for-misspellings\"\u003eAccount for Misspellings\u003c/h2\u003e\n\u003cp\u003eWe already knew from prior projects that we’d get a lot of misspelled search terms, but we put off implementing spelling suggestions and overrides until we’d rolled out our minimum viable product in our first iteration.\u003c/p\u003e\n\u003cp\u003eMisspelled search terms can be handled in different ways depending on your corpus and your tolerance for false positives. This shows one way of thinking about it:\u003c/p\u003e\n\u003cp\u003eA visitor searches for \u003cem\u003ejeferson memorial\u003c/em\u003e (sic).\u003c/p\u003e\n\u003cp\u003ePerform search with misspelled term.\u003c/p\u003e\n\u003cp\u003eAre there any results at all for the misspelled \u003cem\u003ejeferson memorial\u003c/em\u003e?\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eShow them.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003eCan we suggest a similar query that yields \u003cstrong\u003emore\u003c/strong\u003e results from our indexes (such as \u003cem\u003ejefferson memorial\u003c/em\u003e)?\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003eSurface suggestion above results: “Did you mean \u003cem\u003ejefferson memorial\u003c/em\u003e?”\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eCan we find a similar query that would yield \u003cstrong\u003eany\u003c/strong\u003e results?\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003ePerform search with that new overridden corrected term.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003eSurface override above results: “We’re showing results for \u003cem\u003ejefferson memorial\u003c/em\u003e.”\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eThe problem with suggesting a “better” search term than what the visitor typed is that it’s easy to get false positives that vary from hilarious to embarrassing:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eYou searched on \u003cem\u003epresident obama\u003c/em\u003e. Did you mean \u003cem\u003eobama precedent\u003c/em\u003e?\u003c/li\u003e\n\u003cli\u003eYou searched on \u003cem\u003ecorrespondents dinner\u003c/em\u003e. Did you mean \u003cem\u003ecorrespondence dinner\u003c/em\u003e?\u003c/li\u003e\n\u003cli\u003eYou searched on \u003cem\u003ecivil rights\u003c/em\u003e. Did you mean \u003cem\u003ecivil right\u003c/em\u003e?\u003c/li\u003e\n\u003cli\u003eYou searched on \u003cem\u003ebetter america\u003c/em\u003e. Did you mean \u003cem\u003ebitter america\u003c/em\u003e?\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eOK, that last one didn’t really happen, but it could have, so we put that particular problem on the back shelf and instead focused on handling cases where the visitor’s search as typed didn’t return any results from our indexes but a slight variation on the query did. To do this, we introduced a new field to the indexes called “bigram” based on a \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html#analysis-shingle-tokenfilter\"\u003eshingle token filter\u003c/a\u003e we called “bigram_filter.”\u003c/p\u003e\n\u003cp\u003eThe Elasticsearch settings got modified like this:\u003c/p\u003e\n\u003cpre\u003e{\n  \"filter\": {\n    \"bigram_filter\": {\n      \"type\": \"shingle\"\n    },\n    ….\n  }\n}\u003c/pre\u003e\n\u003cp\u003eThe properties in the Flickr and Instagram index mappings got modified as well.\u003c/p\u003e\n\u003cp\u003eFlickr:\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/f08c3e2c97e7773e432e\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-Part-2-flickr-code.jpg\"\n    alt=\"600-x-186-tokens-Part-2-flickr-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eInstagram:\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/89a80170b14714f074c2\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-Part-2-instagram-code.jpg\"\n    alt=\"600-x-186-tokens-Part-2-instagram-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eThis populates the bigram field for each index with whatever natural language fields it might have. For Instagram, it’s just the caption field, but Flickr has title and description so these are essentially appended together as they are copied into the bigram field. In both cases, they are analyzed with the shingle filter which creates bigrams out of the text. The clause of the query that generates the suggestion looks like this:\u003c/p\u003e\n\u003cpre\u003e{\n  \"suggest\": {\n    \"text\": \"jeferson memorial\",\n    \"suggestion\": {\n      \"phrase\": {\n        \"analyzer\": \"bigram_analyzer\",\n        \"field\": \"bigram\",\n        \"size\": 1,\n        \"direct_generator\": [\n          {\n            \"field\": \"bigram\",\n            \"prefix_len\": 1\n          }\n        ],\n        \"highlight\": {\n          \"pre_tag\": \"\u003cstrong\u003e\",\n          \"post_tag\": \"\u003c/strong\u003e\"\n        }\n      }\n    }\n  }\n}\u003c/pre\u003e\n\u003cp\u003e\n  We only care about the top suggestion, and we\u0026#8217;re willing to take the small performance penalty of using just the first letter of the search term as the starting point for the suggestion rather than the default two-character prefix.\n\u003c/p\u003e\n\u003cp\u003e\n  Here\u0026#8217;s an example of how bigrams really help generate relevant multi-word suggestions.\n\u003c/p\u003e\n\u003cp\u003e\n  An \u003ca href=\"http://search.usa.gov/search/images?affiliate=usagov\u0026query=correspondence\"\u003eimage search on USA.gov for \u003cem\u003ecorrespondence\u003c/em\u003e\u003c/a\u003e generates lots of results. Misspell it and \u003ca href=\"http://search.usa.gov/search/images?utf8=%E2%9C%93\u0026affiliate=usagov\u0026query=correspondense\"\u003esearch on \u003cem\u003ecorrespondense\u003c/em\u003e\u003c/a\u003e and it works as you might expect, showing results for \u003cem\u003ecorrespondence\u003c/em\u003e.\n\u003c/p\u003e\n\u003cp\u003e\n  But now when you \u003ca href=\"http://search.usa.gov/search/images?utf8=%E2%9C%93\u0026affiliate=usagov\u0026query=correspondense+dinner\"\u003esearch on \u003cem\u003ecorrespondense dinner\u003c/em\u003e\u003c/a\u003e, you get results for \u003cem\u003ecorrespondents dinner\u003c/em\u003e. It correctly recommends \u003cem\u003ecorrespondents dinner\u003c/em\u003e even though \u003cem\u003ecorrespondence\u003c/em\u003e has a higher term frequency than \u003cem\u003ecorrespondents\u003c/em\u003e does.\n\u003c/p\u003e\n\u003cp\u003e\n  Bigrams (word pairs) let us generate phrase suggestions rather than term suggestions by giving the suggester some collocation information. This increases the likelihood of a good suggestion for a multi-word search query when there are multiple possibilities for each individual word in the query.\n\u003c/p\u003e\n\u003ch2\u003e\n  Group Photos into Albums\n\u003c/h2\u003e\n\u003cp\u003e\n  Most of the near-duplicate photo problems came from Flickr profiles. Flickr has the notion of an album, so we thought we could take advantage of this and save ourselves a lot of work building a classifier. Even if retrieving a photo\u0026#8217;s albums (they can belong to many) from the Flickr API had been straightforward, it would still not have helped as some albums contain thousands of very different photos. Some of the Library of Congress albums on Flickr have over 10,000 photos, all with very different titles and descriptions.\n\u003c/p\u003e\n\u003cp\u003e\n  As we were already using Elasticsearch to do everything else, we wondered if it could also help us group photos into albums and then return just the most relevant photo from each album in the search results. The answer turned out to be \u0026#8220;yes\u0026#8221; on both fronts by using the \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html\"\u003emore_like_this query\u003c/a\u003e as a starting point for classification and the \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html#_options\"\u003etop_hits aggregation\u003c/a\u003e to pluck the best photos from each album.\n\u003c/p\u003e\n\u003cp\u003e\n  First we added an unanalyzed \u0026#8220;album\u0026#8221; field to the mappings on each index:\n\u003c/p\u003e\n\u003cpre\u003e{\n  \"album\": {\n    \"type\": \"string\",\n    \"index\": \"not_analyzed\"\n  }\n}\u003c/pre\u003e\n\u003cp\u003e\n  Then we established some criteria to describe when two photos should be considered part of the same album:\n\u003c/p\u003e\n\u003cul\u003e\n  \u003cli\u003e\n    Same index (Flickr/Instagram)\n  \u003c/li\u003e\n  \u003cli\u003e\n    Same profile/username\n  \u003c/li\u003e\n  \u003cli\u003e\n    Taken on the same day\n  \u003c/li\u003e\n  \u003cli\u003e\n    Very similar tags and natural language fields (i.e., title, description, and caption)\n  \u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\n  For a given Flickr photo with ID #12345, this query finds other Flickr photos from the same Flickr user profile \u0026#8220;flickr_user_1@n02\u0026#8221; also taken on April 23rd, 2012 that could potentially be grouped into the same album:\n\u003c/p\u003e\n\u003cp\u003e\n  \u003ca href=\"https://gist.github.com/loren/cbc7e95ed9d015e70e4a\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-Part-2-More-Like-This-code.jpg\"\n    alt=\"600-x-186-tokens-Part-2-More-Like-This-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\n\u003c/p\u003e\n\u003cp\u003e\n  The filter part of this query is straightforward, as it\u0026#8217;s just enforcing two of the criteria we established for classifying photos. The more_like_this (MLT) part is actually broken down into multiple pieces, each with its own parameters, and wrapped up in a boolean clause. For all of the MLT queries, we set the minimum term frequency to 1 as a given term may only show up once in any particular field. The max_query_terms parameter is raised up really high to 500 terms, as sometimes a field can have that many terms in it and we want to take them all into account. From there, we just used some trial and error to see what percent_terms_to_match threshold to use for each field.\n\u003c/p\u003e\n\u003cp\u003e\n  The aggregation on the raw document scores came about after looking at the distribution of relevancy scores from the MLT query. Often, some group of, say, 100 photos would be pretty similar to a given photo, but the distribution of scores would be clumped around a few scores. Perhaps 60 photos would have an identical score of 4.5 and another 20 would have the same score of 4.4, and next group down would have a few clumped much lower at 0.6 and then the remainder would have different but all very low scores. The photos that ended up with the same scores to each other tended to have identical metadata. Usually the first two buckets from the aggregations would have very similar scores, so we assigned all of those photos to the same Elasticsearch album.\n\u003c/p\u003e\n\u003cp\u003e\n  Now that we had some notion of an album, we needed to pick the most relevant photo from each album and then sort all of those top picks by their relevancy scores to generate the actual search results. And don\u0026#8217;t forget, we could be searching across hundreds of thousands of albums spanning hundreds of Flickr and Instagram profiles, and we still need to take each photo\u0026#8217;s dynamic recency and popularity into account and then blend the results from both Flickr and Instagram indexes. And ideally, all this should happen within a few dozen milliseconds. It seems like an awfully tall order but the top_hits query made it pretty simple. The filtered query part of our request remained the same. We just added a nested aggregation to bucket by album and then pick the top hit from each album:\n\u003c/p\u003e\n\u003cpre\u003e{\n  \"aggs\": {\n    \"album_agg\": {\n      \"terms\": {\n        \"field\": \"album\",\n        \"order\": {\n          \"top_score\": \"desc\"\n        }\n      },\n      \"aggs\": {\n        \"top_image_hits\": {\n          \"top_hits\": {\n            \"size\": 1\n          }\n        },\n        \"top_score\": {\n          \"max\": {\n            \"script\": \"_doc.score\"\n          }\n        }\n      }\n    }\n  }\n}\n\u003c/pre\u003e\n\u003cp\u003e\n  We changed the type of query to the more \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#count\"\u003eefficient search_count\u003c/a\u003e, as we no longer needed \u0026#8220;hits\u0026#8221;. We are only looking at the aggregation buckets now.\n\u003c/p\u003e\n\u003cblockquote\u003e\n  \u003cp\u003e\n    GET http://localhost:9200/development-asis-flickr_photos,development-asis-instagram_photos/_search?search_type=count\u0026size=0\n  \u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\n  Like any fuzzy matching solution, this album classification strategy is practically guaranteed to both under-classify photos that should be in the same album as well as over-classify photos that should be kept separate. But we were pretty confident that the search experience had improved, and were impressed with how easy Elasticsearch made it to pull a solution together.\n\u003c/p\u003e\n\u003cp\u003e\n  One downside is that the aggregation query is more CPU and memory intensive than the more typical \u0026#8220;hits\u0026#8221; query we had before, but we still get results in well under 100ms and we haven\u0026#8217;t done anything to optimize it yet. The other problem we created with these aggregated results centered around pagination. If you request 10 results from the API, the 10 photos you get may each come from a different album, and each album may have thousands of photos. So the 10th photo might actually have been the 10,000th \u0026#8220;hit\u0026#8221;. And while it\u0026#8217;s easy for Elasticsearch to tell you how many total hits were found, currently there\u0026#8217;s no cheap way of knowing how many potential buckets you\u0026#8217;ll have in an aggregation unless you go and compute them all, and that can lead to both \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html#_high_cardinality_memory_implications\"\u003ememory problems\u003c/a\u003e and \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/pagination.html\"\u003ewasted CPU\u003c/a\u003e.\n\u003c/p\u003e\n\u003ch2\u003e\n  Managing Growth\n\u003c/h2\u003e\n\u003cp\u003e\n  Although Elasticsearch defaults to five \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html#glossary-shard\"\u003eshards\u003c/a\u003e per index, we put each image index in just one shard. As we are relying so heavily on relevance across potentially small populations of photos, we wanted the results to be as accurate as possible (see \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-is-broken.html#relevance-is-broken\"\u003eElasticsearch’s Relevance Is Broken!\u003c/a\u003e).\n\u003c/p\u003e\n\u003cp\u003e\n  With just a million photos in our initial index this is not a problem, but a billion photos will require the sort of horizontal scaling that Elasticsearch is known for. Changing the number of shards will require a full reindex. We also update our synonyms from time to time, and that requires reindexing, too. To accommodate this without any downtime, we use index aliases. We spin up a new index in the background, populate it with \u003ca href=\"https://github.com/elasticsearch/stream2es\"\u003estream2es\u003c/a\u003e, and just adjust the alias on the running system in real-time. As the number of shards grows, we can experiment with routing the indexing and the queries to hit the same shards.\n\u003c/p\u003e\n\u003ch2\u003e\n  Why I Wrote This\n\u003c/h2\u003e\n\u003cp\u003e\n  Many Elasticsearch articles involve closed proprietary systems that cannot be fully shared with the rest of the world. With \u003ca href=\"https://github.com/GSA/asis\"\u003eASIS\u003c/a\u003e, we\u0026#8217;ve taken a different approach and published the entire codebase along with this explanation of how we went about building it and the decisions (good and bad) we made along the way. This stemmed from our commitment to transparency and \u003ca href=\"http://www.whitehouse.gov/open\"\u003eopen government\u003c/a\u003e, and we\u0026#8217;d also like others to be able to \u003ca href=\"https://github.com/GSA/asis/fork\"\u003efork the ASIS codebase\u003c/a\u003e and either help improve it or perhaps just use it to build their own image search engine.\n\u003c/p\u003e"}
  ]
}
