Credentials hiding in plain sight or how I pwned your http auth
In this blog post I will go over the little research project I did about http authentication credentials hiding in plain sight.
Idea
A few month ago, I was thinking about bug bounty programs and what security issues could have high impact. Obviously, RCEs are bad, but how do you get a RCE if there isn't a direct flaw/exploit present? Well, maybe if you manage to obtain credentials that you could use to login somewhere... Again: How are credentials exposed? Mostly through configuration files... or are there any other ways?
HTTP Authentication crossed my mind. It's defined in RFC 7617 and most people should have heard about Basic Authentication
(or the Authorization: Basic <base64(pw)>
header). The web server will send a WWW-Authenticate
header to prompt the client for a username and password.
As this prompt could be annoying, most clients (read: browsers) usually allow to pass the credentials within the URL itself using the following scheme: [protocol]://[username]:[password]@[host]
So http://foo:bar@example.com
would try to authenticate me at example.com
as foo
using password bar
.
As we know, URLs can be in many different places on a website. There are a handful of HTML tags that have src
or href
attributes. Thus, I wanted to see how many and what kind of credentials can be found on the internet, simply be inspecting HTML sources from HTTP responses.
Methodology
The methodology was straight forward:
- Obtain HTTP responses
- Extract URLs from all HTTP responses
- Check all URLs for credentials
Obtaining HTTP responses
I didn't have any datasets ready for use, so my options were either to scan the internet and save the HTTP responses myself, or to build on the shoulder of giants and use datasets created by fellow researchers. Unfortunately, scans.io does not offer its http response datasets anymore. My application to Censys.io as an academic researcher has not been answered since two weeks. Luckily, rapid7 still offers their datasets for download at:
As I'm not registered, I couldn't use the most recent datasets from July. Instead, I used the freely available datasets from June. I downloaded the datasets for HTTP and HTTPS for all ports. In the end, I had about 250gb of compressed data to process!
Extracting URLs
These datasets are gzip-compressed files with each line being a JSON-object, which contains the GET request and response body as a base64 encoded string as well as other related information such as the headers/IP address/etc:
root@hacking:~/scanning# zcat 2020-06-26-1593172001-https_get_4434.json.gz | head -n1
{"data":"SFRUUC8xLjEgNTAzIFNlcnZpY2UgVW5hdmFpbGFibGUNCkNvbnRlbnQtVHlwZTogdGV4dC9odG1sDQpDYWNoZS1Db250cm9sOiBuby1jYWNoZQ0KQ29ubmVjdGlvbjogY2xvc2UNCkNvbnRlbnQtTGVuZ3RoOiA2ODMNClgtSWluZm86IDExLTMwOTI2MTM0LTAgME5OTiBSVCgxNTkzMTcyMTQzMjk1IDEyNykgcSgwIC0xIC0xIC0xKSByKDAgLTEpDQoNCjxodG1sIHN0eWxlPSJoZWlnaHQ6MTAwJSI+PGhlYWQ+PE1FVEEgTkFNRT0iUk9CT1RTIiBDT05URU5UPSJOT0lOREVYLCBOT0ZPTExPVyI+PG1ldGEgbmFtZT0iZm9ybWF0LWRldGVjdGlvbiIgY29udGVudD0idGVsZXBob25lPW5vIj48bWV0YSBuYW1lPSJ2aWV3cG9ydCIgY29udGVudD0iaW5pdGlhbC1zY2FsZT0xLjAiPjxtZXRhIGh0dHAtZXF1aXY9IlgtVUEtQ29tcGF0aWJsZSIgY29udGVudD0iSUU9ZWRnZSxjaHJvbWU9MSI+PC9oZWFkPjxib2R5IHN0eWxlPSJtYXJnaW46MHB4O2hlaWdodDoxMDAlIj48aWZyYW1lIGlkPSJtYWluLWlmcmFtZSIgc3JjPSIvX0luY2Fwc3VsYV9SZXNvdXJjZT9DV1VETlNBST0yNiZ4aW5mbz0xMS0zMDkyNjEzNC0wJTIwME5OTiUyMFJUJTI4MTU5MzE3MjE0MzI5NSUyMDEyNyUyOSUyMHElMjgwJTIwLTElMjAtMSUyMC0xJTI5JTIwciUyODAlMjAtMSUyOSZpbmNpZGVudF9pZD0wLTE0NjQ2MTY3OTExMDkxNDE4NyZlZGV0PTIyJmNpbmZvPWZmZmZmZmZmJnJwaW5mbz0wIiBmcmFtZWJvcmRlcj0wIHdpZHRoPSIxMDAlIiBoZWlnaHQ9IjEwMCUiIG1hcmdpbmhlaWdodD0iMHB4IiBtYXJnaW53aWR0aD0iMHB4Ij5SZXF1ZXN0IHVuc3VjY2Vzc2Z1bC4gSW5jYXBzdWxhIGluY2lkZW50IElEOiAwLTE0NjQ2MTY3OTExMDkxNDE4NzwvaWZyYW1lPjwvYm9keT48L2h0bWw+","host":"107.154.195.130","ip":"107.154.195.130","path":"/","port":4434,"subject":{"C":"US","ST":"Delaware","CN":"incapsula.com","O":"Incapsula Inc","L":"Dover"},"vhost":"107.154.195.130"}
However, the only thing I focused on were the responses' HTML source code.
My plan was to iterate over all a
elements and extract the value for the href
attribute. Originally, I wanted to use BeautifulSoup
for that, but it turns out that it does not support xpath
queries. As I needed a xpath query for another thing, I decided to switch to the lxml
library, because HTML is still XML.
Xpath allows to check for substrings using the contains
function. Following the URL pattern from above, an naive way to identify such URLs could be:
- It contains a
:
to split username/password - It contains an
@
to split the credentials from the host part
Unluckily, there are a few protocol handlers that would also match this pattern. For example mailto:
which can be used to instruct the browser to open an email program. For example: mailto:foobar@example.com
. As email addresses contain a @
sign and mailto:
a colon, this would cause a false positive. Thus, I had to exclude all URLs that contain mailto:
.
Here's the pseudo code:
for line in uncompress(dataset):
html = get_html_from_json(line)
xml = lxml.html.fromstring(html)
a_tags = xml.query("//a[contains(@href, ":") and contains(@href, "@") and not(contains(@href, "mailto:"))]")
for tag in a_tags:
print(tag['href'])
In a second run, I extended the extraction to the following elements and attributes:
a:href
link:href
iframe:src
img:scr
embed:src
audio:src
video:src
script:src
source:src
In a per-dataset parallelized setup this worked pretty well and produced a list of URLs within a few hours.
Finding credentials
The last step was to go over the list of extracted URLs and check if they contain any credentials. Again, the easiest approach was to use python's urlparse
, but I needed a regular expression for another application. Therefore, I decided to follow both approaches and compare their results.
Here's the pseudo code:
for url in URLs:
try:
parsed = urlparse(url)
if url.password and url.username:
print(url)
except:
pass
and
for url in URLs:
if creds_re.matches(url):
print(url)
The creds_re
took quite a bit of experimenting to create: ^((http|ftp|rtsp|rtmp)s?:)?\/\/[^\/?]+:[^\/?]+@[^\.\/]*
. Neverthelesse, the work paid off as it matches exactly the same URLs as the urlparse
approach.
Results
As the title promised, there are credentials hiding in plain sight waiting to be found. So let's have a look at the results:
a tags
The first run of the analysis focused on a
tags and their hrefs
only. The URL extraction step resulted in:
- 40982 URLs (24190 unique)
- 1166 credentials (636 unique)
Surprisingly, not only http
URLs were found, but also ftp
, rtsp
, rtmp
. Here's the distribution of unique URLs:
http://
: 58https://
: 21ftp://
: 553rtsp://
: 4rtmp://
: 0
Extended tags
Obviously one could assume that extending the scope would increase the results. Indeed, that's the case here as well! Using all other html tags and their respective attributes, the results are:
- 188671 URLs (55457 unique)
- 1350 credentials (881 unique)
and
http://
: 93https://
: 21ftp://
: 672rtsp://
: 14rtmp://
: 1
Credentials
For obvious reasons, I won't post any publicly accessible credentials here. However, from the insights I got, I can tell that:
- The credentials ranged from "generic/default" credentials to highly complex
- The affected systems were publicly accessible as well as on private networks
httpcreds
tool
While writing this blog post, I felt the need to create a tool, so that anyone could check their own website for leaked credentials.
$> python3 httpcreds.py -u https://uploads.blogbasis.net/test/
[*] Checking: https://uploads.blogbasis.net/test/
[+] Found: http://test:test@example.com
You can download the tool from Github: https://github.com/gehaxelt/python-httpcreds
Detectify Hackerschool #10 Talk
On August 12th, I gave a lightning talk about this topic at Detectify's Hackerschool #10!
You can find the slides for my talk Security risks hiding in plain sight here.