{
    "version" : "https://jsonfeed.org/version/1",
    "content" : "guides",
    "type" : "single",
    "title" : "Technical details |Digital.gov",
    "description": "Technical details",
    "home_page_url" : "/preview/gsa/digitalgov.gov/cm-topics-button-component/","feed_url" : "/preview/gsa/digitalgov.gov/cm-topics-button-component/guides/site-scanning/technical-details/index.json","item" : [
    {"title" :"Technical details","summary" : "Learn about the automated processes behind the site scanning program.","date" : "2023-07-19T09:00:00-05:00","date_modified" : "2024-04-02T09:45:13-04:00","primary_image" : { "uid" : "guide-site-scanning", "alt" :
  "A person works in front of a computer with many internet symbols on it", "width" :
  "1200", "height" :
  "630", "credit" :
  "agny_illustration/iStock via Getty Images", "caption" :
  "", "format" :
  "png" },"branch" : "cm-topics-button-component",
      "filename" :"technical-details.md",
      
      "filepath" :"guides/site-scanning/technical-details.md",
      "filepathURL" :"https://github.com/GSA/digitalgov.gov/blob/cm-topics-button-component/content/guides/site-scanning/technical-details.md",
      "editpathURL" :"https://github.com/GSA/digitalgov.gov/edit/cm-topics-button-component/content/guides/site-scanning/technical-details.md","url" : "/preview/gsa/digitalgov.gov/cm-topics-button-component/guides/site-scanning/technical-details/","aliases" : {"0" : "/guide/site-scanning/technical-details/"},"content" :"\u003cp\u003eThe Site Scanning program maintains a number of automated processes that, together, consitute the entire project and seek to deliver useful data. The basic flow of these events are as follows:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eEach week, a comprehensive list of public federal .gov websites is assembled as the \u003cstrong\u003eFederal Website Index\u003c/strong\u003e.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://raw.githubusercontent.com/GSA/federal-website-index/main/data/site-scanning-target-url-list.csv\"\u003eDirect download of the current Federal Website Index\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/federal-website-index/blob/main/process/index-creation.md\"\u003eProcess description\u003c/a\u003e, including details about the sources used, how the list is combined, and which criteria are used to remove entries.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/federal-website-index/tree/main/data/snapshots#readme\"\u003eSnapshots from each step in the assembly process\u003c/a\u003e, including which URLs are removed at each step and which remain.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/data/Target_URL_List_Data_Dictionary.csv\"\u003eData dictionary\u003c/a\u003e for the Federal Website Index.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/federal-website-index/blob/main/data/site-scanning-target-url-list-analysis.csv\"\u003eSummary report for the assembly process\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-analysis/blob/main/reports/target-url-list.csv\"\u003eSummary report for the completed Federal Website Index\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/federal-website-index\"\u003eTask repository\u003c/a\u003e.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eEvery day, the Federal Website Index is then scanned. This is done by loading each Target URL in a virtual browser and noting the results. This information is the \u003cstrong\u003eSite Scanning data\u003c/strong\u003e.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/pages/scan_steps.md\"\u003eScanning process description\u003c/a\u003e, including what criteria are used to create each field of data.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/data/Site_Scanning_Data_Dictionary.csv\"\u003eData dictionary\u003c/a\u003e for the Site Scanning data.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eThe resulting information is stored in a database that is queryable via API, but each week, a series of static snapshot of the data are generated and made available for download.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://open.gsa.gov/api/site-scanning-api/\"\u003eAPI Documentation\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003eThe \u003ca href=\"https://api.gsa.gov/technology/site-scanning/data/weekly-snapshot-all.csv\"\u003e\u0026lsquo;All\u0026rsquo; snapshot\u003c/a\u003e (CSV) includes every URL in the Federal Website Index.\u003c/li\u003e\n\u003cli\u003eThe \u003ca href=\"https://api.gsa.gov/technology/site-scanning/data/weekly-snapshot.csv\"\u003e\u0026lsquo;Primary\u0026rsquo; snapshot\u003c/a\u003e (CSV) is a subset of the initial snapshot and includes only live, human-readable URLs. This is likely the best starting point for most users.\u003c/li\u003e\n\u003cli\u003eThe \u003ca href=\"https://raw.githubusercontent.com/GSA/site-scanning-analysis/main/unique_website_list/results/weekly-snapshot-unique-final-urls.csv\"\u003e\u0026lsquo;Unique Final URL\u0026rsquo; snapshot\u003c/a\u003e (CSV) then further trims the Primary snapshot by removing duplicative Final URLs (\u003ca href=\"https://github.com/GSA/site-scanning-analysis/tree/main/unique_website_list/results#readme\"\u003edetails\u003c/a\u003e).\u003c/li\u003e\n\u003cli\u003eThe \u003ca href=\"https://raw.githubusercontent.com/GSA/site-scanning-analysis/main/unique_website_list/results/weekly-snapshot-unique-final-websites.csv\"\u003e\u0026lsquo;Unique Final Website\u0026rsquo; snapshot\u003c/a\u003e (CSV) then finally trims the Unique Final URL snapshot by removing duplicative Final URL - Base Websites (\u003ca href=\"https://github.com/GSA/site-scanning-analysis/tree/main/unique_website_list/results#readme\"\u003edetails\u003c/a\u003e). This is arguably the best count of federal public .gov websites.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eAfter these snapshots are generated, a series of reports are run that analyze or pull information out of them.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-analysis/blob/main/reports/snapshot-all.csv\"\u003eSummary report of the \u0026lsquo;All\u0026rsquo; snapshot\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-analysis/blob/main/reports/snapshot-primary.csv\"\u003eSummary report of the \u0026lsquo;Primary\u0026rsquo; snapshot\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-analysis/blob/main/reports/unique-url.csv\"\u003eSummary report for the \u0026lsquo;Unique Final URL\u0026rsquo; snapshot\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-analysis/blob/main/reports/unique-website.csv\"\u003eSummary report for the \u0026lsquo;Unique Final Website\u0026rsquo; snapshot\u003c/a\u003e.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eOther useful information\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/pages/schedule.md\"\u003eSchedule for the above processes is the schedule\u003c/a\u003e for the above, automated processes.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/pages/index_narrowing_steps.md\"\u003eDescription of how the list of websites is filtered down at each step\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/data/Representative_Sample_Dataset.csv\"\u003eSample dataset that represents different edge cases\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation/blob/main/pages/candidate-scans.md\"\u003eList of proposed but not yet built scans\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-snapshots/tree/main/snapshots\"\u003eArchive of historical snapshots\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/federal-website-index/blob/main/criteria/federal-web-presence.md\"\u003eDescription of the federal web presence\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning/issues\"\u003eProgram issue tracker\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://digital.gov/site-scanning/\"\u003eProgram website\u003c/a\u003e.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eProject Repositories\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning\"\u003ePrimary\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-documentation\"\u003eDocumentation\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-analysis\"\u003eAnalysis\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/federal-website-index\"\u003eFederal Website Index\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-engine\"\u003eSite Scanning Engine\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/GSA/site-scanning-snapshots\"\u003eSnapshots\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n"}
  ]
}
