Presidential Robots and 404s

The field of presidential candidates has started to heat up and the websites are the first stop for a lot of prospective voters. For my purposes though, I was less interested in their political platform and more curious about the technology behind the websites. Others have already compared the SSL security of the candidates, so I wanted to check out what sort of information the presidential hopefuls' robots.txt files and 404 responses return. To generate the 404 response I chose a random URL /test (turns out I'm really bad at being random).

Without further ado, let me show the results of the requests:

Democrats

Hillary Clinton

https://www.hillaryclinton.com

robots.txt
User-agent: *
Disallow: /api/

Looks like there's an API for their website that is undocumented publicly.

404

Bernie Sanders

https://berniesanders.com/

robots.txt
User-agent: *
Disallow: /wp-admin/

The website uses Wordpress as its framework.

404

Martin O'Malley

robots.txt
User-agent: *
Disallow: /wp-admin/

The website uses Wordpress as its framework.

404

Jim Webb

robots.txt
User-agent: *
Disallow: /wp-admin/

Sitemap: http://www.webb2016.com/sitemap.xml
404

Lincoln Chafee

robots.txt
User-agent: *
Disallow: /wp-admin/
404

Republicans

Jeb Bush

robots.txt

No robots.txt file available.

404

Rand Paul

robots.txt
User-agent: *
Disallow:
404

Ted Cruz

https://www.tedcruz.org

robots.txt
User-agent: *
Disallow: /wp-admin/

The website uses Wordpress as its framework.

Rick Santorum

http://www.ricksantorum.com/

robots.txt
User-Agent: *
Disallow: /admin/
Disallow: /utils/
Disallow: /forms/
Disallow: /users/
Sitemap: http://www.ricksantorum.com/sitemap_index.xml

Based on this information the website is a hosted CMS at nationbuilder.com

404

Ben Carson

https://www.bencarson.com/

robots.txt

No robots.txt file available.

404

Most of them didn't turn out to be very interesting to look at, with the exception of the final candidate I'd like to show:

Carly Fiorina

https://www.carlyfiorina.com

robots.txt

User-agent: *
Disallow: /standing-desks2
Disallow: /standing-desks2.html
Disallow: /privacy-policy.html
Disallow: /privacy-policy
Disallow: /terms-of-use.html
Disallow: /terms-of-use
Disallow: /adjustable-height-desk.html
Disallow: /adjustable-height-desk
Disallow: /blank
Disallow: /test

404

It turned out that my random URL of /test wasn't random enough and I accidentally stumbled upon a location on Carly Fiorina's website that requires authentication.

I took away 4 lessons from this exercise:

  1. Wordpress remains incredibly popular
  2. robots.txt can tell you where the administrative area is
  3. 404s must be generated enough that it is worth investing time into making them nicer
  4. I'm bad at generating random URLs

PS: Did you know that Shodan also grabs the robots.txt data for each IP? You can access all the information via the Shodan API.