{
    "version" : "https://jsonfeed.org/version/1",
    "content" : "news",
    "type" : "single",
    "title" : "Quality, Speed, and Lower Costs: Yes, You Can Have It All |Digital.gov",
    "description": "Quality, Speed, and Lower Costs: Yes, You Can Have It All",
    "home_page_url" : "/preview/gsa/digitalgov.gov/bc-archive-content-3/","feed_url" : "/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/09/02/quality-speed-and-lower-costs-yes-you-can-have-it-all/index.json","item" : [
    {"title" :"Quality, Speed, and Lower Costs: Yes, You Can Have It All","summary" : "This is post 2 in the 5-part series The Right Tools for the Job: Re-Hosting DigitalGov Search to a Dynamic Infrastructure Environment. The last major infrastructure upgrade that DigitalGov Search had was in 2010. Not only has technology evolved significantly since then, but so have business models for right-sizing costs. Moving to Amazon Web Services (AWS)","date" : "2016-09-02T10:00:42-04:00","date_modified" : "2025-01-27T19:42:55-05:00","authors" : {"nick-marden" : "Nick Marden","dmccleskey" : "Dawn Pointer McCleskey"},"topics" : {
        
            "cloud-and-infrastructure" : "Cloud and infrastructure",
            "content-strategy" : "Content strategy",
            "product-and-project-management" : "Product and project management",
            "search" : "Search",
            "software-engineering" : "Software engineering"
            },"branch" : "bc-archive-content-3",
      "filename" :"2016-09-02-quality-speed-and-lower-costs-yes-you-can-have-it-all.md",
      
      "filepath" :"news/2016/09/2016-09-02-quality-speed-and-lower-costs-yes-you-can-have-it-all.md",
      "filepathURL" :"https://github.com/GSA/digitalgov.gov/blob/bc-archive-content-3/content/news/2016/09/2016-09-02-quality-speed-and-lower-costs-yes-you-can-have-it-all.md",
      "editpathURL" :"https://github.com/GSA/digitalgov.gov/edit/bc-archive-content-3/content/news/2016/09/2016-09-02-quality-speed-and-lower-costs-yes-you-can-have-it-all.md","slug" : "quality-speed-and-lower-costs-yes-you-can-have-it-all","url" : "/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/09/02/quality-speed-and-lower-costs-yes-you-can-have-it-all/","content" :"\u003cp\u003e\u003cem\u003eThis is post 2 in the 5-part series \u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/08/18/the-right-tools-for-the-job-re-hosting-digitalgov-search-to-a-dynamic-infrastructure-environment/\"\u003eThe Right Tools for the Job: Re-Hosting DigitalGov Search to a Dynamic Infrastructure Environment\u003c/a\u003e.\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe last major infrastructure upgrade that \u003ca href=\"https://search.gov\"\u003eDigitalGov Search\u003c/a\u003e had was in 2010. Not only has technology evolved significantly since then, but so have business models for right-sizing costs. Moving to Amazon Web Services (AWS) infrastructure allowed us to improve reliability by creating self-healing servers and distributing the service across four physically isolated datacenters, and reduce datacenter costs by 40% per month — no longer do we have to pay for peak throughput capacity overnight, on weekends, or during other predictably low-traffic periods.\u003c/p\u003e\n\u003cp\u003eWe were also able to reduce our CDN costs to almost zero by insourcing the management of our content delivery network (CDN)/web application firewall (WAF). By itself this reduced our total costs by almost 50 percent, as our CDN/WAF service had cost almost the same amount as our hosting provider.\u003c/p\u003e\n\u003cp\u003eIn the prior DigitalGov Search datacenters — one in Chicago and one in Virginia — we had pools of high-powered, physical Dell “pizza box” servers running a variety of services in a composition that had been tuned to observed traffic patterns:\u003c/p\u003e\n\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2016/08/543-x-561-old%5c_datacenter%5c_network_diagram.jpg\"\n    alt=\"A diagram of the old data center network.\"/\u003e\u003c/div\u003e\n\n\n\u003cp\u003eServices had been distributed opportunistically across the servers over time. We made it a primary goal of our new architecture to separate each of our services by \u003cem\u003erole\u003c/em\u003e, and to build flexible pools for each role that could be scaled up or down as demand increased or decreased for each service. This sounds great on the drawing board, but to build robust, role-specific deployment recipes for multiple applications and services would be time-intensive and expensive.\u003c/p\u003e\n\u003ch2 id=\"aws-opsworks-and-chef-to-the-rescue\"\u003eAWS OpsWorks (and Chef) to the Rescue\u003c/h2\u003e\n\u003cp\u003eThe DigitalGov Search infrastructure is fortunate because it is comprised of applications that have very well understood deployment practices:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eusasearch: The core Rails 3.x application that serves search engine results pages (SERPs) and provides customer administration tools\u003c/li\u003e\n\u003cli\u003esearch_consumer: A NodeJS application that uses the usasearch API endpoints to render a new generation of DigitalGov Search SERPs\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/gsa/i14y\"\u003ei14y\u003c/a\u003e: A Rails 4.x application that allows government agencies to index their own documents via API for use in their DigitalGov Search results\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/gsa/asis\"\u003easis\u003c/a\u003e: The Advanced Social Image Search Rails 4.x application that indexes social images from Flickr, Instagram, and RSS feeds for inclusion in \u003ctt\u003esearch.usa.gov\u003c/tt\u003e SERP results\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/gsa/jobs_api\"\u003ejobs_api\u003c/a\u003e: The DigitalGov Search Jobs API Rails 3.x application that allows users to search government job listings\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"http://govt-urls.usa.gov/tematres\"\u003egovt-urls.usa.gov\u003c/a\u003e: A \u003ca href=\"https://sourceforge.net/projects/tematres/\"\u003eTematres\u003c/a\u003e PHP application for managing a dataset of non-.gov and .mil URLs belonging to federal, state, and local agencies\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"http://elastic.co\"\u003eElasticsearch\u003c/a\u003e: A Java-based search engine that supports clustering and failover\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe first five applications could be deployed easily using \u003ca href=\"https://aws.amazon.com/opsworks/\"\u003eAWS OpsWorks\u003c/a\u003e‘ well-known deployment recipes. We pointed OpsWorks to the GitHub repos for each app, and it took care of the rest with robust \u003ca href=\"http://capistranorb.com/\"\u003eCapistrano\u003c/a\u003e-style deployments of the Rails and NodeJS apps.\u003c/p\u003e\n\u003cp\u003eThat left just Tematres and Elasticsearch, so we reached into our bag of tricks and wrote \u003ca href=\"http://chef.io\"\u003eChef\u003c/a\u003e recipes that would fit into the OpsWorks deployment cycle for these two applications.\u003c/p\u003e\n\u003ch2 id=\"deployment-in-aws\"\u003eDeployment in AWS\u003c/h2\u003e\n\u003cp\u003eWe then enabled \u003ca href=\"http://docs.aws.amazon.com/opsworks/latest/userguide/workinginstances-autohealing.html\"\u003eAuto Healing\u003c/a\u003e on our application layers to ensure servers would be replaced automatically if they failed. With robust, well-tested recipes to build servers in place, we knew this replacement would be seamless if it occurred.\u003c/p\u003e\n\u003cp\u003eTo replace our CDN and web application firewall (WAF) provider, we implemented our own Apache proxy server layer using a modified version of the \u003ca href=\"https://www.owasp.org/index.php/Main_Page\"\u003eOWASP\u003c/a\u003e WAF rules for the \u003ca href=\"https://www.modsecurity.org/\"\u003emodsecurity\u003c/a\u003e Apache module, echoing our previous provider’s approach. This took iterative tuning that we’ll discuss later.\u003c/p\u003e\n\u003cp\u003eWe also migrated our database services (MySQL and Redis) to the hosted AWS equivalents (RDS MySQL and Elasticache Redis), in configurations that were designed to automatically withstand loss of a datacenter, or AWS availability zone (AZ). This was an inexpensive way to take the hassle of database availability, backups, and upgrades out of our hands.\u003c/p\u003e\n\u003cp\u003eWith all of these pieces in place we were able to build out the following architecture in AWS:\u003c/p\u003e\n\u003cdiv class=\"image\"\u003e\n  \u003cimg\n    src=\"https://s3.amazonaws.com/digitalgov/_legacy-img/2016/08/486-x-713-aws%5c_network%5c_diagram.jpg\"\n    alt=\"A diagram of the new AWS network.\"/\u003e\u003c/div\u003e\n\n\n\u003cp\u003eThe key thing to note about this architecture is that it has four new characteristics that our old environment did not:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eServers will be replaced automatically if they fail\u003c/li\u003e\n\u003cli\u003eService pool capacity can be scaled independently, in response to short-term or long-term traffic patterns\u003c/li\u003e\n\u003cli\u003eThe size and cost of servers can be adjusted to match actual resource consumption, including OpsWorks \u003ca href=\"http://docs.aws.amazon.com/opsworks/latest/userguide/workinginstances-autoscaling-loadbased.html\"\u003eAutomatic Load-Based Scaling\u003c/a\u003e for peaks in traffic\u003c/li\u003e\n\u003cli\u003eBy design, every service is spread across multiple Availability Zones. This ensures that an \u003cem\u003eentire AWS datacenter outage\u003c/em\u003e will not bring down any service pool in particular, or the overall \u003ctt\u003esearch.usa.gov\u003c/tt\u003e service in general.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eOur cost reduction of 40% for hosting was achieved by focusing our expenses on the CPU capacity of the application server pool and the \u003ca href=\"http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-io-characteristics.html\"\u003eprovisioned IOPS\u003c/a\u003e needed for Elasticsearch. We also intend to look at annual pre-purchase options through \u003ca href=\"https://aws.amazon.com/ec2/purchasing-options/reserved-instances/\"\u003eReserved Instance pricing\u003c/a\u003e in the future as our usage and billing patterns settle.\u003c/p\u003e\n\u003ch2 id=\"proxy-servers-web-application-firewalling-and-cdn-insourcing\"\u003eProxy Servers, Web Application Firewalling, and CDN Insourcing\u003c/h2\u003e\n\u003cp\u003eOne of the original drivers of this project was the cost of our CDN/WAF provider. As mentioned above, we created our own proxy servers that run a modified version of the OWASP WAF software. How we did this could merit its own blog post, but the basic recipe was this:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBuild a proxy server that sits in front of our production application\u003c/li\u003e\n\u003cli\u003eAdd the default OWASP rules\u003c/li\u003e\n\u003cli\u003eUsing proxy configuration changes, route a small percentage of our live traffic to the proxy servers\u003c/li\u003e\n\u003cli\u003eNote the false positives and use them to tune the OWASP rule set to match our application traffic\u003c/li\u003e\n\u003cli\u003eLather, rinse, repeat until we were comfortable routing 100% of our traffic through the new proxy servers\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eRouting traffic through the new proxies was a simple, incremental shift for our customers using the \u003ctt\u003esearch.usa.gov\u003c/tt\u003e domain on their results pages. Customers using domain masks, however, account for almost 60% of our traffic. Routing their traffic to our new proxy servers required agency-by-agency outreach to request updates of their external DNS records. Once the last CNAME record was updated, we were handling all traffic in the new system.\u003c/p\u003e\n\u003cp\u003eThe CDN component was even more straightforward. With our proxy servers in place, we verified that we were setting correct \u003ca href=\"http://httpd.apache.org/docs/current/mod/mod_expires.html\"\u003eexpiration headers\u003c/a\u003e on our assets and then enabled \u003ca href=\"https://httpd.apache.org/docs/2.4/mod/mod_cache_disk.html\"\u003emod_disk_cache\u003c/a\u003e on our proxy servers. Once we verified that assets were being served from our proxy servers without calls to our origin servers, we enabled a \u003ca href=\"http://api.rubyonrails.org/classes/ActionView/Helpers/AssetUrlHelper.html\"\u003eRails asset host\u003c/a\u003e configuration on our production application to send all asset requests to a \u003ca href=\"http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-working-with.html\"\u003eCloudFront Distribution\u003c/a\u003e whose origin server was our proxy server pool. This took all asset traffic off our expensive CDN provider without directing it to our origin servers.\u003c/p\u003e\n\u003cp\u003eIn \u003ca href=\"#series\"\u003elater posts\u003c/a\u003e we’ll discuss in more detail the complexities of supporting SSL certificates for our government customers’ hostnames and how we managed to comply with the DNSSEC requirement for government agencies. Those sections go into the technical details and are worth reading if you’re interested in how we solved the security challenges of providing SaaS search for hundreds of government agencies.\u003c/p\u003e\n\u003ch2 id=\"aws-and-the-ato-process\"\u003eAWS and the ATO Process\u003c/h2\u003e\n\u003cp\u003eThe great thing about pools of servers built by hardened recipes is that individual server failures are a non-event: a new server is spun up automatically to replace one that failed. However, the Authority to Operate (ATO) process in particular, and the government security auditing process in general, become a bit tricky for an architecture like ours because they require:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eVulnerability testing of individual servers\u003c/li\u003e\n\u003cli\u003eCompliance testing of individual servers\u003c/li\u003e\n\u003cli\u003ePenetration testing of individual servers and exposed pool endpoints\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eSince individual servers are identified by IP address, these tests get more complicated when individual servers can be rebuilt without warning due to server failure, load spikes, or data center outages – because the new server will often come online with a different DHCP’d IP address than the one it replaced. Similarly, if a new server is spun up in response to increased load, it will have an IP address that is unknown to the security testing infrastructure and will cause erroneous red flags.\u003c/p\u003e\n\u003cp\u003eWe worked with GSA security personnel to modify the Assessment \u0026amp; Accreditation process in a few simple, but significant ways:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eTesting and scanning: Compliance and vulnerability scanning were performed against a frozen point-in-time inventory of the system prior to launch. After our servers were demonstrated to have acceptable compliance and vulnerability protocols, we were allowed to ‘unfreeze’ our inventory to allow for dynamic server replacement and up/down-scaling. Ongoing, we are responsible for notifying the security team when new servers are created, or existing servers are destroyed or re-IP’d.\u003c/li\u003e\n\u003cli\u003eHost inventory: In a dynamic infrastructure, an inventory maintained in a file could be outdated at any point after the file is saved. We provided our Information System Security Officer with read-only access to our AWS console, so that the current host inventory can be viewed at any time.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"conclusions\"\u003eConclusions\u003c/h2\u003e\n\u003cp\u003eBy applying some very commonly-understood modern operations practices — role-based deployment, server redundancy and pooling — to our application, we were able to achieve substantial cost savings while making the \u003ctt\u003esearch.usa.gov\u003c/tt\u003e service more resilient to failure. While some government security practices are still evolving to incorporate dynamic server environments, the success of our migration bodes well for the future of cost-effective and reliable cloud computing in government applications.\u003c/p\u003e\n\u003ch3 id=\"series\"\u003e\n  \u003cem\u003eRead more of this 5-part series:\u003c/em\u003e\n\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/08/18/the-right-tools-for-the-job-re-hosting-digitalgov-search-to-a-dynamic-infrastructure-environment/\"\u003eThe Right Tools for the Job: Re-Hosting DigitalGov Search to a Dynamic Infrastructure Environment\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003cspan style=\"line-height: 1.5\"\u003e\u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/09/06/a-domain-by-any-other-name-cnames-wildcard-records-and-another-level-of-indirection/\"\u003eA Domain by Any Other Name: CNAMES, Wildcard Records and Another Level of Indirection\u003c/a\u003e \u003c/span\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/09/07/lets-encrypt-those-cnames-shall-we/\"\u003eLet’s Encrypt Those CNAMES, Shall We?\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"/preview/gsa/digitalgov.gov/bc-archive-content-3/2016/09/12/dnssec-vs-elastic-load-balancers-the-zone-apex-problem/\"\u003eDNSSEC vs. Elastic Load Balancers: the Zone Apex Problem\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n"}
  ]
}