{
    "version" : "https://jsonfeed.org/version/1",
    "content" : "news",
    "type" : "single",
    "title" : "A Picture Is Worth a Thousand Tokens |Digital.gov",
    "description": "A Picture Is Worth a Thousand Tokens",
    "home_page_url" : "/preview/gsa/digitalgov.gov/bc-archive-content-3/","feed_url" : "/preview/gsa/digitalgov.gov/bc-archive-content-3/2014/10/28/a-picture-is-worth-a-thousand-tokens/index.json","item" : [
    {"title" :"A Picture Is Worth a Thousand Tokens","summary" : "Increasingly, we’ve noticed that our agency customers are publishing their highest quality images on social media and within database-driven multimedia galleries on their websites. These sources are curated, contain metadata, and have both thumbnails and full-size images. That’s a big improvement in quality over the images embedded within HTML pages on agencies’ websites. After some","date" : "2014-10-28T11:15:34-04:00","date_modified" : "2025-01-27T19:42:55-05:00","authors" : {"loren-siebert" : "Loren Siebert"},"topics" : {
        
            "application-programming-interface" : "Application programming interface",
            "multimedia" : "Multimedia",
            "search" : "Search",
            "social-media" : "Social media"
            },"branch" : "bc-archive-content-3",
      "filename" :"2014-10-28-a-picture-is-worth-a-thousand-tokens.md",
      
      "filepath" :"news/2014/10/2014-10-28-a-picture-is-worth-a-thousand-tokens.md",
      "filepathURL" :"https://github.com/GSA/digitalgov.gov/blob/bc-archive-content-3/content/news/2014/10/2014-10-28-a-picture-is-worth-a-thousand-tokens.md",
      "editpathURL" :"https://github.com/GSA/digitalgov.gov/edit/bc-archive-content-3/content/news/2014/10/2014-10-28-a-picture-is-worth-a-thousand-tokens.md","slug" : "a-picture-is-worth-a-thousand-tokens","url" : "/preview/gsa/digitalgov.gov/bc-archive-content-3/2014/10/28/a-picture-is-worth-a-thousand-tokens/","content" :"\u003cp\u003eIncreasingly, we’ve noticed that our agency customers are publishing their highest quality images on social media and within database-driven multimedia galleries on their websites. These sources are curated, contain metadata, and have both thumbnails and full-size images. That’s a big improvement in quality over the images embedded within HTML pages on agencies’ websites.\u003c/p\u003e\n\u003cp\u003eAfter some investigating, we decided we could leverage their Flickr and Instagram photos to build an image search engine that better met their needs. We gave it a plucky name and put it in production.\u003c/p\u003e\n\u003cp\u003eSee the sample results page below that shows image results displayed on \u003ca href=\"http://search.doi.gov/search/images?utf8=%E2%9C%93\u0026amp;affiliate=doi.gov\u0026amp;query=moon\"\u003eDOI.gov for a search on \u003cem\u003emoon\u003c/em\u003e\u003c/a\u003e.\u003c/p\u003e\n\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-658-DOIgov-search-moon.jpg\"\n    alt=\"DOI.gov DigitalGov Search on the word moon.\"/\u003e\u003c/div\u003e\n\n\n\u003cp\u003eWe also \u003ca href=\"https://github.com/GSA/oasis\"\u003eopen-sourced the entire codebase\u003c/a\u003e behind this project.\u003c/p\u003e\n\u003cp\u003eThis post is the first of \u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2014/11/04/a-picture-is-worth-a-thousand-tokens-part-ii/\" title=\"A Picture Is Worth a Thousand Tokens: Part II\"\u003etwo in a series\u003c/a\u003e where I take a technical deep dive into the details of how the image search engine works, and specifically how we used Elasticsearch to build it.\u003c/p\u003e\n\u003ch2 id=\"our-goal\"\u003eOur Goal\u003c/h2\u003e\n\u003cp\u003eOur initial goal was to provide a search interface across photos from the \u003ca href=\"https://www.flickr.com/services/api/\"\u003eFlickr API\u003c/a\u003e and \u003ca href=\"http://instagram.com/developer/\"\u003eInstagram API\u003c/a\u003e that blended these photos together into a single set of results. We wanted a relevancy framework that took into account the photos’ popularity, recency, and of course text metadata like titles, descriptions, captions, and tags.\u003c/p\u003e\n\u003cp\u003eWe wanted this system to be decoupled from our main codebase so it could evolve independently, and accessed 100% via API so that any client could access it. And finally, we wanted to make the entire codebase open so that others could see what we are doing and even help make improvements.\u003c/p\u003e\n\u003ch2 id=\"technology-stack\"\u003eTechnology Stack\u003c/h2\u003e\n\u003cp\u003e\u003ca href=\"http://www.elasticsearch.org/\"\u003eElasticsearch\u003c/a\u003e is the foundation for both our \u003ca href=\"http://search.digitalgov.gov/developer/jobs.html\"\u003eJobs API\u003c/a\u003e and our entire in-house analytics system, so it was an easy choice to use as the information retrieval backbone of our image search engine. To manage requests and serve up a versioned API, we’re using a pared-down Rails API called \u003ca href=\"http://intridea.github.io/grape/\"\u003eGrape\u003c/a\u003e. To parallelize fetching and indexing Flickr and Instagram photos, we’re using \u003ca href=\"http://sidekiq.org/\"\u003eSidekiq\u003c/a\u003e.\u003c/p\u003e\n\u003ch2 id=\"the-data\"\u003eThe Data\u003c/h2\u003e\n\u003cp\u003eFlickr and Instagram both publish a lot of metadata about each photo in their APIs. Some of it overlaps, some of it is particular to each platform, and some of it is not particularly useful so we ignore it.\u003c/p\u003e\n\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/482-x-176-Flickr-Instagram-metadata-table.jpg\"\n    alt=\"Flickr and Instagram both publish a lot of metadata about each photo in their APIs\"/\u003e\u003c/div\u003e\n\n\n\u003cp\u003eBoth platforms have the notion of when a photo was taken, and of course they have the image itself along with some thumbnail. A set of tags can potentially be assigned to each photo, too.\u003c/p\u003e\n\u003cp\u003eFlickr uses an owner to associate images with profiles, while Instagram uses a username. Additionally, a Flickr photo can belong to one or more Flickr group profiles, so we augment the API data with that group information. Together, these fields allow us to filter our results to just the profiles the agency wants to show to searchers on its website.\u003c/p\u003e\n\u003cp\u003eFlickr captures the number of views for each photo, while Instagram captures the comments and the “likes.” This information drives a simple popularity field we use to help with relevancy.\u003c/p\u003e\n\u003cp\u003eAnd finally, title, description, and caption are natural language full-text fields that are matched up against the searcher’s query.\u003c/p\u003e\n\u003ch2 id=\"first-iteration-developing-our-mvp\"\u003eFirst Iteration: Developing our MVP\u003c/h2\u003e\n\u003cp\u003eTo develop our \u003ca href=\"http://theleanstartup.com/principles\"\u003eminimum viable product\u003c/a\u003e (MVP), we created an index for Flickr photos and a separate index for Instagram photos. We could have created a single index called “photos” and separated Flickr and Instagram photos by types, but we kept them separate so they could have their own relevancy scores and they could be indexed and updated independently of each other. These are the initial mappings we used.\u003c/p\u003e\n\u003cp\u003eFlickr (click image to see full code block):\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/b04165195afa6895affb\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-initial-flickr-code.jpg\"\n    alt=\"600-x-186-tokens-initial-flickr-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eInstagram (click image to see full code block):\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/57780909332d570a5922\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-initial-instagram-code.jpg\"\n    alt=\"600-x-186-tokens-initial-instagram-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eWe used these settings across the indexes (click image to see full code block):\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/8410d6fc947ee6091eb1\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-initial-settings-across-indexes-code.jpg\"\n    alt=\"600-x-186-tokens-initial-settings-across-indexes-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eLooking at the mappings and the settings, you can see that we had to make a lot of decisions upfront about how fields would be treated when we indexed the documents (photo metadata). In theory, Elasticsearch is schema-less and we could have just taken whatever fields we got from the Instagram and Flickr APIs and sent them over the fence to Elasticsearch as JSON documents to be dynamically mapped. We had learned a few lessons from prior Elasticsearch and \u003ca href=\"http://lucene.apache.org/solr/\"\u003eSolr\u003c/a\u003e projects, however, so we had ideas on how the analysis chain should behave for the various fields.\u003c/p\u003e\n\u003cp\u003eFor the full-text fields (title, description, caption), we use a \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html\"\u003ecustom analyzer\u003c/a\u003e we call “en_analyzer”. This uses the custom “ignore_chars” character filter to get rid of some different types of apostrophes, and then it hands the tokens off to the chain of filters. ASCII folding lets \u003cem\u003eresume\u003c/em\u003e match \u003cem\u003eresumé\u003c/em\u003e. Lowercasing everything makes the search case insensitive. The stop filter yanks out words that contribute little to relevancy.\u003c/p\u003e\n\u003cp\u003eThe minimal English stemmer does a pretty good job of threading the needle between over-stemming and under-stemming. We experimented with several of the English stemmers using the \u003ca href=\"https://github.com/polyfractal/elasticsearch-inquisitor\"\u003eInquisitor plugin\u003c/a\u003e, and decided we could get the closest to our desired behavior by starting with the minimal English stemmer and using our own curated synonym list to fill in the gaps. When we had used the more aggressive \u003ca href=\"http://snowball.tartarus.org/\"\u003eSnowball stemmer\u003c/a\u003e in the past, we found ourselves constantly overriding it with an updated protected words list.\u003c/p\u003e\n\u003cp\u003eTo keep the JSON a little more manageable for this post, the settings above are only showing a subset of the synonyms and stopwords that we actually use. Have a look at the latest \u003ca href=\"https://github.com/GSA/asis\"\u003ecode\u003c/a\u003e to see what we are currently using.\u003c/p\u003e\n\u003cp\u003eFor the tags, we use a custom “tag_analyzer” to do the same lowercasing and ASCII folding as with the full-text fields, but we also strip out the whitespace. After looking through some sample Flickr and Instagram data, we noticed a lot of tags like \u003cem\u003ebarackobama\u003c/em\u003e or \u003cem\u003e4thofjuly\u003c/em\u003e and we wanted to match on queries like \u003cem\u003eBarack Obama\u003c/em\u003e and \u003cem\u003e4th of July\u003c/em\u003e.\u003c/p\u003e\n\u003cp\u003eTo represent popularity, we initially didn’t know how to compare Instagram comments, Instagram likes, and Flickr views, so we started with something simple knowing that we could tune it once we knew more about our relevancy model. For Flickr, we just set the popularity as the number of views. For Instagram, we used the sum of the comments and the likes. We’re considering weighting them all differently, as it takes more effort to write a comment than to “like” something, and simply viewing a photo takes the least effort of all.\u003c/p\u003e\n\u003cp\u003eWith all that in place, we looked up the Flickr profiles and Instagram usernames for a handful of \u003ca href=\"http://search.digitalgov.gov/customers.html\"\u003eour agency customers\u003c/a\u003e like the \u003ca href=\"http://www.doi.gov/index.cfm\"\u003eDepartment of the Interior\u003c/a\u003e, \u003ca href=\"http://www.army.mil/\"\u003eU.S. Army\u003c/a\u003e, and \u003ca href=\"http://www.usa.gov/\"\u003eUSA.gov\u003c/a\u003e and started fetching and indexing their photos. This all happens pretty quickly with enough Sidekiq threads chugging away, even if you are just trying this out on your laptop.\u003c/p\u003e\n\u003ch2 id=\"initial-search-query\"\u003eInitial Search Query\u003c/h2\u003e\n\u003cp\u003eWe had a few heuristics in mind as to how we wanted relevancy and precision to work for this data:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eMore recent photos are more relevant than older photos\u003c/li\u003e\n\u003cli\u003ePopularity can be a proxy for relevancy, so rank popular photos higher\u003c/li\u003e\n\u003cli\u003eAll of the search terms have to be present in at least one of the full-text or tags fields\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe initial query looked like this (click image to see full code block):\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://gist.github.com/loren/3e81ce2637f9889109b5\"\u003e\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-186-tokens-initial-search-query-code.jpg\"\n    alt=\"600-x-186-tokens-initial-search-query-code\"/\u003e\u003c/div\u003e\n\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eAt a high level, this is a filtered query that uses a custom function score to impact the final score, and the query runs across both the Instagram and Flickr indexes. The filter part of the filtered query limits the search space to just the profiles we care about. The query part of the filtered query says that the search term should match at least the tags or one of the full-text fields. The function score takes the raw score from the filtered query and multiplies it by factors based on the popularity field and the taken_at field. Rather than use the raw popularity value to impact the score, we run it through the log2p() function. This takes the base-10 logarithm of the popularity so that a photo with a popularity of 1,000,000 is only boosted 2x more than a photo with a popularity of 1,000 instead of being boosted by 1,000X more. The log2p() function adds 2 to the raw popularity value before taking the logarithm, nicely accounting for the cases where a photo’s popularity is 0 or 1. To account for recency, we applied a Gaussian \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_decay_functions\"\u003edecay function\u003c/a\u003e to the taken_at date field, essentially making photos from one month ago half as relevant as photos posted today while not penalizing photos much from just a few days ago.\u003c/p\u003e\n\u003cp\u003eBy playing around with this query a little bit using \u003ca href=\"http://www.elasticsearch.org/guide/en/marvel/current/#_sense\"\u003eSense\u003c/a\u003e, we could gain some confidence that the results were reasonable and blended across both the Flickr and Instagram indexes. We made the search interface available via an HTTP API that returned JSON, and then hooked up our first customer: ourselves! We make API calls to \u003ca href=\"https://github.com/GSA/asis\"\u003eASIS\u003c/a\u003e and then transform the results into a nicely-tiled responsive search results page with image thumbnails. The API call that generated the query above would look something like this:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003ehttp://oasis_host/api/v1/image.json?flickr_groups=flickr_group_profile_1@n07\u0026amp;flickr_users=flickr_user_profile_1\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003e@n02,flickr_user_profile_2@n03\u0026amp;instagram_profiles=instagram_username_1,\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003einstagram_username_2,instagram_username_3\u0026amp;query=4th+of+july\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"first-impressions-of-our-mvp\"\u003eFirst Impressions of our MVP\u003c/h2\u003e\n\u003ch3 id=\"speed-improved\"\u003eSpeed Improved\u003c/h3\u003e\n\u003cp\u003eThe first thing we noticed when we put this into production was the improvement in performance. Our Queries are now hitting our own Elasticsearch cluster and are running reliably in 20-30ms, while the former system that reached out to an external commercial index had taken 300-900ms. Speed is an important factor in searchers’ satisfaction with results, and the lower variance in response times has made our user experience more uniform and predictable.\u003c/p\u003e\n\u003ch3 id=\"garbage-in-garbage-out-gigo-exists\"\u003eGarbage In, Garbage Out (GIGO) Exists\u003c/h3\u003e\n\u003cp\u003eAs we worked around some hiccups with the Flickr and Instagram APIs, we started to see some issues with the agency-generated content that affected both recall and relevancy. Some profile owners attach dozens of very broad tags like \u003cem\u003egovernment\u003c/em\u003e, \u003cem\u003eusa\u003c/em\u003e, and \u003cem\u003epresident\u003c/em\u003e to hundreds or thousands of photos.\u003c/p\u003e\n\u003cp\u003eRecent photos that happened to be popular on Flickr or Instagram would get boosted to the top of the list despite having a relatively low similarity (based on Lucene’s \u003ca href=\"http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html\"\u003ePractical Scoring Function\u003c/a\u003e) score from matching on a very common term in our corpus. A similar problem would occur when profile owners would append the same hundred words of boilerplate text (e.g., source attribution, copyright) to their photo descriptions.\u003c/p\u003e\n\u003cp\u003eA much bigger problem surfaced around photo albums. Sometimes photographers would take dozens of photographs of the same event, assign very similar metadata to all of them, and upload them to their social media profile. This screenshot of a search results page with 19 nearly identical pictures of Michelle Obama in a yellow sundress sums up the problem nicely:\u003c/p\u003e\n\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2014/10/600-x-614-USAgov-search-Barack-Obama.jpg\"\n    alt=\"600-x-614-USAgov-search-Barack-Obama\"/\u003e\u003c/div\u003e\n\n\n\u003cp\u003eAll of these similar photos are relevant, but we’d rather just show a few of them for an \u003ca href=\"http://search.usa.gov/search/images?affiliate=usagov\u0026amp;query=barack+obama\"\u003eimage search on \u003cem\u003ebarack obama\u003c/em\u003e on USA.gov\u003c/a\u003e and perhaps let the visitor click through to see the rest of them.\u003c/p\u003e\n\u003ch3 id=\"relevance-needs-to-be-tweaked\"\u003eRelevance Needs to Be Tweaked\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eProximity\u003c/strong\u003e: When visitors searched on multi-word terms like \u003cem\u003ejefferson memorial\u003c/em\u003e, we weren’t treating “Memorial event in Jefferson County” any differently than “County event in Jefferson Memorial.” We needed to take into account where the tokens appeared in the document in relation to each other.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDate\u003c/strong\u003e: We initially focused on surfacing the most relevant pictures on the first page of the search results, but as we dug into page two and beyond, we saw some profiles had photos that were all scored 0.0. The culprit was the Gaussian decay function we were applying to decrease relevancy on older photos. The first batch of agency photos all happened to cover current affairs, like White House events and State Department conferences. But some of our other agencies use social media mainly for archival photos. The \u003ca href=\"https://www.flickr.com/photos/library_of_congress/\"\u003eLibrary of Congress Flickr photostream\u003c/a\u003e contains some photos that were taken 150 \u003cem\u003eyears\u003c/em\u003e ago, and the Gaussian decay function decayed their relevancy right down to zero.\u003c/p\u003e\n\u003cp\u003eThis clearly isn’t what we wanted so we focused on improving our relevance algorithm in our second iteration, which I’ll tell you more about in \u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2014/11/04/a-picture-is-worth-a-thousand-tokens-part-ii/\" title=\"A Picture Is Worth a Thousand Tokens: Part II\"\u003enext week’s blog post\u003c/a\u003e.\u003c/p\u003e\n\u003ch2 id=\"about-us\"\u003eAbout Us\u003c/h2\u003e\n\u003cp\u003e\u003ca href=\"http://search.digitalgov.gov/\"\u003eDigitalGov Search\u003c/a\u003e provides fast, relevant search results to 1,500 government websites. We use a combination of commercial and our own indexes built on top of open government data to give millions of visitors a good search experience each day.\u003c/p\u003e\n"}
  ]
}