Digital.gov Guide
Understanding the Site Scanning program
Technical details
Learn about the automated processes behind the site scanning program.
Reading time: 2 minutes
The Site Scanning program maintains a number of automated processes that, together, consitute the entire project and seek to deliver useful data. The basic flow of these events are as follows:
-
Each week, a comprehensive list of public federal .gov websites is assembled as the Federal Website Index.
- Direct download of the current Federal Website Index.
- Process description, including details about the sources used, how the list is combined, and which criteria are used to remove entries.
- Snapshots from each step in the assembly process, including which URLs are removed at each step and which remain.
- Data dictionary for the Federal Website Index.
- Summary report for the assembly process.
- Summary report for the completed Federal Website Index.
- Task repository.
-
Every day, the Federal Website Index is then scanned. This is done by loading each Target URL in a virtual browser and noting the results. This information is the Site Scanning data.
- Scanning process description, including what criteria are used to create each field of data.
- Data dictionary for the Site Scanning data.
-
The resulting information is stored in a database that is queryable via API, but each week, a series of static snapshot of the data are generated and made available for download.
- API Documentation.
- The ‘All’ snapshot (CSV) includes every URL in the Federal Website Index.
- The ‘Primary’ snapshot (CSV) is a subset of the initial snapshot and includes only live, human-readable URLs. This is likely the best starting point for most users.
- The ‘Unique Final URL’ snapshot (CSV) then further trims the Primary snapshot by removing duplicative Final URLs (details).
- The ‘Unique Final Website’ snapshot (CSV) then finally trims the Unique Final URL snapshot by removing duplicative Final URL - Base Websites (details). This is arguably the best count of federal public .gov websites.
-
After these snapshots are generated, a series of reports are run that analyze or pull information out of them.
-
Other useful information
- Schedule for the above processes is the schedule for the above, automated processes.
- Description of how the list of websites is filtered down at each step
- Sample dataset that represents different edge cases.
- List of proposed but not yet built scans.
- Archive of historical snapshots.
- Description of the federal web presence
- Program issue tracker.
- Program website.
-
Project Repositories