In this blog post I will go over the little research project I did about http authentication credentials hiding in plain sight.
A few month ago, I was thinking about bug bounty programs and what security issues could have high impact. Obviously, RCEs are bad, but how do you get a RCE if there isn't a direct flaw/exploit present? Well, maybe if you manage to obtain credentials that you could use to login somewhere... Again: How are credentials exposed? Mostly through configuration files... or are there any other ways?
HTTP Authentication crossed my mind. It's defined in RFC 7617 and most people should have heard about
Basic Authentication (or the
Authorization: Basic <base64(pw)> header). The web server will send a
WWW-Authenticate header to prompt the client for a username and password.
As this prompt could be annoying, most clients (read: browsers) usually allow to pass the credentials within the URL itself using the following scheme:
http://foo:firstname.lastname@example.org would try to authenticate me at
foo using password
As we know, URLs can be in many different places on a website. There are a handful of HTML tags that have
href attributes. Thus, I wanted to see how many and what kind of credentials can be found on the internet, simply be inspecting HTML sources from HTTP responses.
The methodology was straight forward:
- Obtain HTTP responses
- Extract URLs from all HTTP responses
- Check all URLs for credentials
Obtaining HTTP responses
I didn't have any datasets ready for use, so my options were either to scan the internet and save the HTTP responses myself, or to build on the shoulder of giants and use datasets created by fellow researchers. Unfortunately, scans.io does not offer its http response datasets anymore. My application to Censys.io as an academic researcher has not been answered since two weeks. Luckily, rapid7 still offers their datasets for download at:
As I'm not registered, I couldn't use the most recent datasets from July. Instead, I used the freely available datasets from June. I downloaded the datasets for HTTP and HTTPS for all ports. In the end, I had about 250gb of compressed data to process!
These datasets are gzip-compressed files with each line being a JSON-object, which contains the GET request and response body as a base64 encoded string as well as other related information such as the headers/IP address/etc:
However, the only thing I focused on were the responses' HTML source code.
My plan was to iterate over all
a elements and extract the value for the
href attribute. Originally, I wanted to use
BeautifulSoup for that, but it turns out that it does not support
xpath queries. As I needed a xpath query for another thing, I decided to switch to the
lxml library, because HTML is still XML.
Xpath allows to check for substrings using the
contains function. Following the URL pattern from above, an naive way to identify such URLs could be:
- It contains a
:to split username/password
- It contains an
@to split the credentials from the host part
Unluckily, there are a few protocol handlers that would also match this pattern. For example
mailto: which can be used to instruct the browser to open an email program. For example:
mailto:email@example.com. As email addresses contain a
@ sign and
mailto: a colon, this would cause a false positive. Thus, I had to exclude all URLs that contain
Here's the pseudo code:
for line in uncompress(dataset): html = get_html_from_json(line) xml = lxml.html.fromstring(html) a_tags = xml.query("//a[contains(@href, ":") and contains(@href, "@") and not(contains(@href, "mailto:"))]") for tag in a_tags: print(tag['href'])
In a second run, I extended the extraction to the following elements and attributes:
In a per-dataset parallelized setup this worked pretty well and produced a list of URLs within a few hours.
The last step was to go over the list of extracted URLs and check if they contain any credentials. Again, the easiest approach was to use python's
urlparse, but I needed a regular expression for another application. Therefore, I decided to follow both approaches and compare their results.
Here's the pseudo code:
for url in URLs: try: parsed = urlparse(url) if url.password and url.username: print(url) except: pass
for url in URLs: if creds_re.matches(url): print(url)
creds_re took quite a bit of experimenting to create:
^((http|ftp|rtsp|rtmp)s?:)?\/\/[^\/?]+:[^\/?]+@[^\.\/]*. Neverthelesse, the work paid off as it matches exactly the same URLs as the
As the title promised, there are credentials hiding in plain sight waiting to be found. So let's have a look at the results:
The first run of the analysis focused on
a tags and their
hrefs only. The URL extraction step resulted in:
- 40982 URLs (24190 unique)
- 1166 credentials (636 unique)
Surprisingly, not only
http URLs were found, but also
rtmp. Here's the distribution of unique URLs:
Obviously one could assume that extending the scope would increase the results. Indeed, that's the case here as well! Using all other html tags and their respective attributes, the results are:
- 188671 URLs (55457 unique)
- 1350 credentials (881 unique)
For obvious reasons, I won't post any publicly accessible credentials here. However, from the insights I got, I can tell that:
- The credentials ranged from "generic/default" credentials to highly complex
- The affected systems were publicly accessible as well as on private networks
While writing this blog post, I felt the need to create a tool, so that anyone could check their own website for leaked credentials.
$> python3 httpcreds.py -u https://uploads.blogbasis.net/test/ [*] Checking: https://uploads.blogbasis.net/test/ [+] Found: http://test:firstname.lastname@example.org
You can download the tool from Github: https://github.com/gehaxelt/python-httpcreds
Detectify Hackerschool #10 Talk
On August 12th, I gave a lightning talk about this topic at Detectify's Hackerschool #10!