<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[pentester.io]]></title><description><![CDATA[Chronicles from hacking]]></description><link>https://pentester.io/</link><generator>Ghost 0.11</generator><lastBuildDate>Wed, 11 Mar 2026 07:10:46 GMT</lastBuildDate><atom:link href="https://pentester.io/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Commonspeak: Content discovery wordlists built with BigQuery]]></title><description><![CDATA[Content discovery wordlists (subdomains, files and directories) built with BigQuery for use in penetration testing, offensive security and bug bounties]]></description><link>https://pentester.io/commonspeak-bigquery-wordlists/</link><guid isPermaLink="false">0c8ff42e-cdc5-4448-913b-95ba649cf88c</guid><category><![CDATA[research]]></category><dc:creator><![CDATA[Shubham Shah]]></dc:creator><pubDate>Mon, 04 Dec 2017 11:14:00 GMT</pubDate><content:encoded><![CDATA[<p>Recently, I was dealing with a <a href="https://censys.io/data/80-http-get-full_ipv4">scans.io dataset</a> which was around 400GB and needed a solution to quickly identify information that was valuable to me. To achieve this, I used <a href="https://cloud.google.com/bigquery/">Google BigQuery</a>.</p>

<p>Google's BigQuery is quite amazing. It let's you process and analyse large datasets with simple SQL queries. If you're a security researcher, you may have already seen how powerful BigQuery can be without even knowing it (i.e. <a href="https://censys.io">Censys.io</a>).</p>

<p>The first thing that caught my eye about Google's BigQuery was the public datasets. The four datasets I found most interesting were the following:</p>

<ul>
<li><mark>updated weekly</mark>: <a href="https://cloud.google.com/bigquery/public-data/github">Contents from 2.9M public, open source licensed repositories on GitHub</a></li>
<li><mark>updated quarterly</mark>: <a href="https://cloud.google.com/bigquery/public-data/stackoverflow">All of the publicly available data for StackOverflow</a></li>
<li><mark>updated daily</mark>: <a href="https://cloud.google.com/bigquery/public-data/hacker-news">All stories and comments from Hacker News from its launch in 2006</a></li>
<li><mark>updated daily</mark>: <a href="https://medium.com/cali-dog-security/retrieving-storing-and-querying-250m-certificates-like-a-boss-31b1ce2dfcf8">Every certificate from certificate transparency logs (access needs to be requested)</a></li>
<li><mark>updated bi-weekly</mark>: <a href="https://cloud.google.com/bigquery/public-data/hacker-news">HTTPArchive's dataset obtained by crawling Alexa's Top 1M list</a></li>
</ul>

<p>These datasets can be used to create wordlists that reflect current technologies. </p>

<p>Of particular interest to me, was that we can generate wordlists that are not years outdated (i.e. <a href="https://github.com/danielmiessler/SecLists/blob/master/Discovery/Web_Content/raft-large-files.txt">raft-large-files.txt</a>)</p>

<p>This post details how we processed and created better web discovery dictionaries from using publicly available datasets such as GitHub, StackOverflow, HackerNews and HTTPArchive.</p>

<hr>

<p>Table of contents:</p>

<ol>
<li><a href="https://pentester.io/commonspeak-bigquery-wordlists/#commonspeak">Introducing the Commonspeak tool</a>  </li>
<li><a href="https://pentester.io/commonspeak-bigquery-wordlists/#filesandfolders">Directory and file names</a>  </li>
<li><a href="https://pentester.io/commonspeak-bigquery-wordlists/#parameters">Parameter names</a>  </li>
<li><a href="https://pentester.io/commonspeak-bigquery-wordlists/#subdomains">Subdomains</a></li>
</ol>

<hr>

<div id="commonspeak"></div>  

<h4 id="introducing">Introducing:</h4>

<p><a href="https://github.com/pentest-io/commonspeak"><img src="https://i.imgur.com/ANSqOBE.png" alt="Drawing" style="width: 400px;"></a></p>

<p>Commonspeak is a tool that was developed to automate the process of extracting wordlists from BigQuery. All of the techniques described in this blog post have been turned into bash scripts. These scripts are particularly useful for generating wordlists on a recurring basis.</p>

<p>You can get Commonspeak here:</p>

<p><a href="https://github.com/pentest-io/commonspeak">https://github.com/pentest-io/commonspeak</a></p>

<p>If you'd like to download the datasets from this blog post, they are available here:</p>

<p><a href="https://github.com/pentest-io/commonspeak/archive/master.zip">https://github.com/pentest-io/commonspeak/archive/master.zip</a></p>

<p>Installation instructions can be found <a href="https://github.com/pentest-io/commonspeak/blob/master/README.md">here</a>.</p>

<p>Here's how it looks generating subdomains from the HackerNews dataset:</p>

<p><img src="https://i.imgur.com/7cK5RqF.gif" alt=""></p>

<p>Commonspeak is still under active development and pull requests are very welcome :) </p>

<p>We're always looking to improve BigQuery and create unique datasets.</p>

<div id="filesandfolders"></div>  

<h4 id="extractingdirectoryandfilenames">Extracting directory and file names</h4>

<p>When testing a web application, extending an attack surface is extremely valuable. Having a large list of realistic directories and files to use with a tool such as <a href="https://github.com/maurosoria/dirsearch">dirsearch</a> or <a href="https://github.com/OJ/gobuster">gobuster</a> helps immensely with this process.</p>

<p>We found that the most useful data (directories, filenames and subdomains) came from parsing URLs recorded from HTTPArchive, StackOverflow and HackerNews.</p>

<p>We can construct a SQL script to analyse every stored HTTP request in HTTPArchive's public dataset. The table we're interested in is <code>httparchive.runs.latest_requests</code> as it contains every request made in the latest run.</p>

<p>From this table, we will extract full URLs using the <code>url</code> column and we will filter this data by extension by using the <code>ext</code> column.</p>

<p>This data will then be processed by a user defined function (UDF) that utilises the <a href="https://medialize.github.io/URI.js/docs.html">URI.js</a> library in order to extract relevant information.</p>

<p>Our first query is responsible for extracting every URL with a <code>.php</code> extension to a new BigQuery table. This is done for optimisation, as we will be running other queries on this dataset:</p>

<pre><code class="language-language sql">SELECT  
  url,
  COUNT(url) AS cnt
FROM  
  [httparchive:runs.latest_requests]
WHERE  
  ext = "php"
GROUP BY  
  url
ORDER BY  
  cnt DESC;
</code></pre>

<p><img src="https://i.imgur.com/tkts8MR.png" alt=""></p>

<p>Once this query is completed, we can then run the following query (which contains the UDF) in order to extract every filename that contains a <code>.php</code> extension in the latest HTTPArchive dataset:</p>

<pre><code class="language-language sql">CREATE TEMPORARY FUNCTION getPath(x STRING)  
RETURNS STRING  
  LANGUAGE js AS """
  function getPath(s) {
    try {
      return URI(s).filename(true);
    } catch (ex) {
      return s;
    }
  }
  return getPath(x);
"""
OPTIONS (  
  library="gs://commonspeak-udf/URI.min.js"
);

SELECT  
  getPath(url) AS url
FROM  
  `crunchbox-160315.commonspeak.httparchive_php`
GROUP BY url  
</code></pre>

<p>By calling <code>URI(s).filename</code>, we are able to extract the filename for every URL in our table.</p>

<p><img src="https://i.imgur.com/PIpJIBg.png" alt=""></p>

<p>If we only want directory names, we can use <code>URI(s).directory(true)</code>.</p>

<pre><code class="language-language sql">CREATE TEMPORARY FUNCTION getPath(x STRING)  
RETURNS STRING  
  LANGUAGE js AS """
  function getPath(s) {
    try {
      return URI(s).directory(true);
    } catch (ex) {
      return s;
    }
  }
  return getPath(x);
"""
OPTIONS (  
  library="gs://commonspeak-udf/URI.min.js"
);

SELECT  
  getPath(url) AS url
FROM  
  `crunchbox-160315.commonspeak.httparchive_php`
GROUP BY url  
</code></pre>

<p><img src="https://i.imgur.com/tL63GXA.png" alt=""></p>

<p>We can also extract path names (e.g. <code>/foo/bar/test.php</code>) by using <code>URI(s).pathname(true)</code>:</p>

<pre><code>CREATE TEMPORARY FUNCTION getPath(x STRING)  
RETURNS STRING  
  LANGUAGE js AS """
  function getPath(s) {
    try {
      return URI(s).pathname(true);
    } catch (ex) {
      return s;
    }
  }
  return getPath(x);
"""
OPTIONS (  
  library="gs://commonspeak-udf/URI.min.js"
);

SELECT  
  getPath(url) AS url
FROM  
  `crunchbox-160315.commonspeak.httparchive_php`
GROUP BY url  
</code></pre>

<p><img src="https://i.imgur.com/j7WjN1A.png" alt=""></p>

<div id="parameters"></div>  

<h4 id="extractingparameternamesforspecifictechnologies">Extracting parameter names for specific technologies</h4>

<p>Often when performing content discovery, it's common to come across scripts (PHP, JSP, ASPX, etc) which may have input parameters that you don't know. Using these datasets, an attacker can attempt to blindly guess these parameters.</p>

<p>Using the GitHub dataset and PHP as an example, we can extract user input parameter names that come from <code>$_REQUEST</code>, <code>$_GET</code> or <code>$_POST</code>.</p>

<pre><code class="language-language sql">SELECT  
  line AS query_parameter,
  COUNT(*) AS count
FROM  
  FLATTEN( (
    SELECT
      SPLIT(SPLIT(REGEXP_EXTRACT(content, r'.*\$_(?:REQUEST|GET|POST)[\[](?:\'|\")([^)$]*)(?:\"|\')[\]]'), '\n'), ';') AS line,
    FROM (
      SELECT
        id,
        content
      FROM
        [bigquery-public-data:github_repos.contents]
      WHERE
        REGEXP_MATCH(content, r'.*\$_(?:REQUEST|GET|POST)[\[](?:\'|\")([^)$]*)(?:\"|\')[\]]')) AS C
    JOIN (
      SELECT
        id
      FROM
        [bigquery-public-data:github_repos.files]
      WHERE
        path LIKE '%.php'
      GROUP BY
        id) AS F
    ON
      C.id = F.id), line)
GROUP BY  
  query_parameter
HAVING  
  query_parameter IS NOT NULL
ORDER BY  
  count DESC
LIMIT 10000  
</code></pre>

<p>Running this query will process 2.07 TB of data. The cost of running this can be reduced by extracting the PHP files to a separate table and then using that table for all future queries.</p>

<div id="subdomains"></div>  

<h4 id="extractingsubdomains">Extracting subdomains</h4>

<p>Using better quality datasets allows an attacker to identify obscure assets and extend their attack surface. </p>

<p>Using the HackerNews dataset we are able to extract ~63,000 unique subdomains. </p>

<p>We optimised our queries by extracting all of these URL's to a new table before running additional queries. </p>

<pre><code class="language-language sql">SELECT  
  url,
  COUNT(url) AS cnt
FROM  
  [bigquery-public-data:hacker_news.full]
GROUP BY  
  url
ORDER BY  
  cnt DESC;
</code></pre>

<p>We can then run the following query on our table in order to extract all subdomains:</p>

<pre><code class="language-language sql">CREATE TEMPORARY FUNCTION getSubdomain(x STRING)  
RETURNS STRING  
  LANGUAGE js AS """
  function getSubdomain(s) {
    try {
      return URI(s).subdomain();
    } catch (ex) {
      return s;
    }
  }
  return getSubdomain(x);
"""
OPTIONS (  
  library="gs://commonspeak-udf/URI.min.js"
);

SELECT  
  getSubdomain(url) AS subdomain
FROM  
  `crunchbox-160315:hackernews_2017_10_22.urls`
GROUP BY subdomain  
</code></pre>

<p>The final output looks like this:</p>

<p><img src="https://i.imgur.com/HXUfnwW.png" alt=""></p>

<p>While these public datasets contain useful subdomains, it's possible to extract larger, higher quality lists by parsing certificate transparency logs. Ryan Sears of CaliDog Security has done the heavy lifting for importing this into BigQuery, this can be seen <a href="https://medium.com/cali-dog-security/retrieving-storing-and-querying-250m-certificates-like-a-boss-31b1ce2dfcf8">here</a>. Ryan maintains this dataset and updates it on a daily basis.</p>

<p>If you'd like access to Ryan's BigQuery Certificate Transparency dataset, I recommend contacting him (his details can be found at the end of <a href="https://medium.com/cali-dog-security/retrieving-storing-and-querying-250m-certificates-like-a-boss-31b1ce2dfcf8">his blog post</a>). Included in commonspeak is a script that parses and extracts useful subdomains from his dataset on BigQuery.</p>

<p>We've created the following query to extract subdomains from Ryan's dataset:</p>

<pre><code>language sql  
CREATE TEMPORARY FUNCTION getSubdomain(x STRING)  
RETURNS STRING  
  LANGUAGE js AS """
  // Helper function for error handling
  function getSubdomain(s) {
    try {
      var url = 'http://' + s;
      var subdomain = URI(url).subdomain();
      if (subdomain == '*' || subdomain == ' ') {
        // do nothing
      } else {
        // clean subdomain further
        if (subdomain.startsWith('*.')) {
          subdomain = subdomain.replace('*.', '');
        }
        return subdomain;
      }
    } catch (ex) {
      return s;
    }
  }
  return getSubdomain(x);
"""
OPTIONS (  
  library="gs://commonspeak-udf/URI.min.js"
);

SELECT  
  getSubdomain( dns_names ) AS dns_names, APPROX_COUNT_DISTINCT(dns_names) AS cnt
FROM  
  `crunchbox-160315.ctl_2017_11_22.all_dns_names`
GROUP BY  
  dns_names
ORDER BY  
  cnt DESC
LIMIT  
  1000000;
</code></pre>

<p><img src="https://i.imgur.com/2EWpjhh.png" style="width:300px;margin:auto;"></p>

<p>This image represents the top one million subdomains extracted by parsing <strong>3,482,694,371</strong> records in the dataset.</p>

<hr>

<p>The queries shown above have mostly been integrated into commonspeak. The generation of these wordlists can be automated by using cron jobs to run the commonspeak wordlist generation scripts on a regular basis.</p>

<p>I believe I was the first one to write about using BigQuery for content discovery / offensive purposes. Feel free to follow me on <a href="https://twitter.com/infosec_au">Twitter</a> or check out <a href="https://assetnote.io">Assetnote</a> and our <a href="https://blog.assetnote.io">corporate blog</a> which contains numerous articles similar to this.</p>]]></content:encoded></item></channel></rss>