Credentials hiding in plain sight or how I pwned your http auth

In this blog post I will go over the little research project I did about http authentication credentials hiding in plain sight.

Idea

A few month ago, I was thinking about bug bounty programs and what security issues could have high impact. Obviously, RCEs are bad, but how do you get a RCE if there isn't a direct flaw/exploit present? Well, maybe if you manage to obtain credentials that you could use to login somewhere... Again: How are credentials exposed? Mostly through configuration files... or are there any other ways?

HTTP Authentication crossed my mind. It's defined in RFC 7617 and most people should have heard about Basic Authentication (or the Authorization: Basic <base64(pw)> header). The web server will send a WWW-Authenticate header to prompt the client for a username and password.

As this prompt could be annoying, most clients (read: browsers) usually allow to pass the credentials within the URL itself using the following scheme: [protocol]://[username]:[password]@[host]

So http://foo:bar@example.com would try to authenticate me at example.com as foo using password bar.

As we know, URLs can be in many different places on a website. There are a handful of HTML tags that have src or href attributes. Thus, I wanted to see how many and what kind of credentials can be found on the internet, simply be inspecting HTML sources from HTTP responses.

Methodology

The methodology was straight forward:

Obtain HTTP responses
Extract URLs from all HTTP responses
Check all URLs for credentials

Obtaining HTTP responses

I didn't have any datasets ready for use, so my options were either to scan the internet and save the HTTP responses myself, or to build on the shoulder of giants and use datasets created by fellow researchers. Unfortunately, scans.io does not offer its http response datasets anymore. My application to Censys.io as an academic researcher has not been answered since two weeks. Luckily, rapid7 still offers their datasets for download at:

As I'm not registered, I couldn't use the most recent datasets from July. Instead, I used the freely available datasets from June. I downloaded the datasets for HTTP and HTTPS for all ports. In the end, I had about 250gb of compressed data to process!

Extracting URLs

These datasets are gzip-compressed files with each line being a JSON-object, which contains the GET request and response body as a base64 encoded string as well as other related information such as the headers/IP address/etc:

root@hacking:~/scanning# zcat 2020-06-26-1593172001-https_get_4434.json.gz | head -n1
{"data":"SFRUUC8xLjEgNTAzIFNlcnZpY2UgVW5hdmFpbGFibGUNCkNvbnRlbnQtVHlwZTogdGV4dC9odG1sDQpDYWNoZS1Db250cm9sOiBuby1jYWNoZQ0KQ29ubmVjdGlvbjogY2xvc2UNCkNvbnRlbnQtTGVuZ3RoOiA2ODMNClgtSWluZm86IDExLTMwOTI2MTM0LTAgME5OTiBSVCgxNTkzMTcyMTQzMjk1IDEyNykgcSgwIC0xIC0xIC0xKSByKDAgLTEpDQoNCjxodG1sIHN0eWxlPSJoZWlnaHQ6MTAwJSI+PGhlYWQ+PE1FVEEgTkFNRT0iUk9CT1RTIiBDT05URU5UPSJOT0lOREVYLCBOT0ZPTExPVyI+PG1ldGEgbmFtZT0iZm9ybWF0LWRldGVjdGlvbiIgY29udGVudD0idGVsZXBob25lPW5vIj48bWV0YSBuYW1lPSJ2aWV3cG9ydCIgY29udGVudD0iaW5pdGlhbC1zY2FsZT0xLjAiPjxtZXRhIGh0dHAtZXF1aXY9IlgtVUEtQ29tcGF0aWJsZSIgY29udGVudD0iSUU9ZWRnZSxjaHJvbWU9MSI+PC9oZWFkPjxib2R5IHN0eWxlPSJtYXJnaW46MHB4O2hlaWdodDoxMDAlIj48aWZyYW1lIGlkPSJtYWluLWlmcmFtZSIgc3JjPSIvX0luY2Fwc3VsYV9SZXNvdXJjZT9DV1VETlNBST0yNiZ4aW5mbz0xMS0zMDkyNjEzNC0wJTIwME5OTiUyMFJUJTI4MTU5MzE3MjE0MzI5NSUyMDEyNyUyOSUyMHElMjgwJTIwLTElMjAtMSUyMC0xJTI5JTIwciUyODAlMjAtMSUyOSZpbmNpZGVudF9pZD0wLTE0NjQ2MTY3OTExMDkxNDE4NyZlZGV0PTIyJmNpbmZvPWZmZmZmZmZmJnJwaW5mbz0wIiBmcmFtZWJvcmRlcj0wIHdpZHRoPSIxMDAlIiBoZWlnaHQ9IjEwMCUiIG1hcmdpbmhlaWdodD0iMHB4IiBtYXJnaW53aWR0aD0iMHB4Ij5SZXF1ZXN0IHVuc3VjY2Vzc2Z1bC4gSW5jYXBzdWxhIGluY2lkZW50IElEOiAwLTE0NjQ2MTY3OTExMDkxNDE4NzwvaWZyYW1lPjwvYm9keT48L2h0bWw+","host":"107.154.195.130","ip":"107.154.195.130","path":"/","port":4434,"subject":{"C":"US","ST":"Delaware","CN":"incapsula.com","O":"Incapsula Inc","L":"Dover"},"vhost":"107.154.195.130"}

However, the only thing I focused on were the responses' HTML source code.

My plan was to iterate over all a elements and extract the value for the href attribute. Originally, I wanted to use BeautifulSoup for that, but it turns out that it does not support xpath queries. As I needed a xpath query for another thing, I decided to switch to the lxml library, because HTML is still XML.

Xpath allows to check for substrings using the contains function. Following the URL pattern from above, an naive way to identify such URLs could be:

It contains a : to split username/password
It contains an @ to split the credentials from the host part

Unluckily, there are a few protocol handlers that would also match this pattern. For example mailto: which can be used to instruct the browser to open an email program. For example: mailto:foobar@example.com. As email addresses contain a @ sign and mailto: a colon, this would cause a false positive. Thus, I had to exclude all URLs that contain mailto:.

Here's the pseudo code:

for line in uncompress(dataset):
	html = get_html_from_json(line)
	xml = lxml.html.fromstring(html)
	a_tags = xml.query("//a[contains(@href, ":") and contains(@href, "@") and not(contains(@href, "mailto:"))]")
	for tag in a_tags:
		print(tag['href'])

In a second run, I extended the extraction to the following elements and attributes:

a:href
link:href
iframe:src
img:scr
embed:src
audio:src
video:src
script:src
source:src

In a per-dataset parallelized setup this worked pretty well and produced a list of URLs within a few hours.

Finding credentials

The last step was to go over the list of extracted URLs and check if they contain any credentials. Again, the easiest approach was to use python's urlparse, but I needed a regular expression for another application. Therefore, I decided to follow both approaches and compare their results.

Here's the pseudo code:

for url in URLs:
	try:
		parsed = urlparse(url)
		if url.password and url.username:
			print(url)
	except:
		pass

and

for url in URLs:		
	if creds_re.matches(url):
		print(url)

The creds_re took quite a bit of experimenting to create: ^((http|ftp|rtsp|rtmp)s?:)?\/\/[^\/?]+:[^\/?]+@[^\.\/]*. Neverthelesse, the work paid off as it matches exactly the same URLs as the urlparse approach.

Results

As the title promised, there are credentials hiding in plain sight waiting to be found. So let's have a look at the results:

a tags

The first run of the analysis focused on a tags and their hrefs only. The URL extraction step resulted in:

40982 URLs (24190 unique)
1166 credentials (636 unique)

Surprisingly, not only http URLs were found, but also ftp, rtsp, rtmp. Here's the distribution of unique URLs:

http://: 58
https://: 21
ftp://: 553
rtsp://: 4
rtmp://: 0

Extended tags

Obviously one could assume that extending the scope would increase the results. Indeed, that's the case here as well! Using all other html tags and their respective attributes, the results are:

188671 URLs (55457 unique)
1350 credentials (881 unique)

and

http://: 93
https://: 21
ftp://: 672
rtsp://: 14
rtmp://: 1

Credentials

For obvious reasons, I won't post any publicly accessible credentials here. However, from the insights I got, I can tell that:

The credentials ranged from "generic/default" credentials to highly complex
The affected systems were publicly accessible as well as on private networks

`httpcreds` tool

While writing this blog post, I felt the need to create a tool, so that anyone could check their own website for leaked credentials.

$> python3 httpcreds.py -u https://uploads.blogbasis.net/test/
[*] Checking: https://uploads.blogbasis.net/test/
[+] Found: http://test:test@example.com

You can download the tool from Github: https://github.com/gehaxelt/python-httpcreds

Detectify Hackerschool #10 Talk

On August 12th, I gave a lightning talk about this topic at Detectify's Hackerschool #10!

You can find the slides for my talk Security risks hiding in plain sight here.

Sebastian Neef - 0day.work