<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[HDFS - Shodan Blog]]></title><description><![CDATA[The latest news and developments for Shodan.]]></description><link>https://blog.shodan.io/</link><generator>Ghost 0.7</generator><lastBuildDate>Sun, 12 Apr 2026 02:16:36 GMT</lastBuildDate><atom:link href="https://blog.shodan.io/tag/hdfs/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The HDFS Juggernaut]]></title><description><![CDATA[<p>There's been much focus on MongoDB, Elastic and Redis in terms of data exposure on the Internet due to their general popularity in the developer community. However, in terms of data volume it turns out that HDFS is the real juggernaut. To give you a better idea here's a quick</p>]]></description><link>https://blog.shodan.io/the-hdfs-juggernaut/</link><guid isPermaLink="false">c469ddda-3cd3-48db-b4dc-a2d771993b61</guid><category><![CDATA[NoSQL]]></category><category><![CDATA[research]]></category><category><![CDATA[Python]]></category><category><![CDATA[HDFS]]></category><category><![CDATA[CLI]]></category><dc:creator><![CDATA[John Matherly]]></dc:creator><pubDate>Wed, 31 May 2017 17:32:11 GMT</pubDate><media:content url="http://blog.shodan.io/content/images/2017/05/hdfs-map-1600.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.shodan.io/content/images/2017/05/hdfs-map-1600.png" alt="The HDFS Juggernaut"><p>There's been much focus on MongoDB, Elastic and Redis in terms of data exposure on the Internet due to their general popularity in the developer community. However, in terms of data volume it turns out that HDFS is the real juggernaut. To give you a better idea here's a quick comparison between MongoDB and HDFS:</p>

<table>  
<thead>  
<tr>  
<th></th>  
<th>MongoDB</th>  
<th>HDFS</th>  
</tr>  
</thead>  
<tbody>  
<tr>  
<td>Number of Servers</td>  
<td>47,820</td>  
<td>4,487</td>  
</tr>  
<tr>  
<td>Data Exposed</td>  
<td>25 TB</td>  
<th>5,120 TB</th>  
</tr>  
</tbody>  
</table>

<p>Even though there are more MongoDB databases connected to the Internet without authentication in terms of data exposure it is dwarfed by HDFS clusters (25 TB vs 5 PB). Where are all these instances located?</p>

<script type="text/javascript" src="https://asciinema.org/a/6dzqir2jbssqftvcxwgh63dwp.js" id="asciicast-6dzqir2jbssqftvcxwgh63dwp" async></script>

<p>Most of the HDFS NameNodes are located in the US (1,900) and China (1,426). And nearly all of the HDFS instances are hosted on the cloud with Amazon leading the charge (1,059) followed by Alibaba (507).</p>

<p><img src="https://blog.shodan.io/content/images/2017/05/hdfs-map-600.png" alt="The HDFS Juggernaut"></p>

<p>The ransomware attacks on databases that were <a href="http://www.csoonline.com/article/3154190/security/exposed-mongodb-installs-being-erased-held-for-ransom.html">widely</a> <a href="https://www.fidelissecurity.com/threatgeek/2017/01/revenge-devops-gangster-open-hadoop-installs-wiped-worldwide">publicized</a> earlier in the year are still happening. And they're impacting both MongoDB and HDFS deployments. For HDFS, Shodan has discovered roughly <a href="https://www.shodan.io/search?query=NODATA4U_SECUREYOURSHIT">207 clusters</a> that have a message warning of the public exposure. And a quick glance at search results in Shodan reveals that most of the public MongoDB instances <a href="https://www.shodan.io/search?query=product%3Amongodb">seem to be compromised</a>. I've <a href="https://blog.shodan.io/its-the-data-stupid/">previously written</a> on the reason behind these exposures but note that both products nowadays have extensive documentation on <a href="https://docs.mongodb.com/manual/security/">secure deployment</a>.</p>

<h6 id="technicaldetails">Technical Details</h6>

<p>If you'd like to replicate the above findings or perform your own investigations into data exposure, this is how I measured the above.</p>

<ol>
<li><p>Download data using the <a href="https://cli.shodan.io">Shodan command-line interface</a>:</p>

<pre><code>shodan download --limit -1 hdfs-servers product:namenode
</code></pre></li>
<li><p>Write a Python script to measure the amount of exposed data (<strong>hdfs-exposure.py</strong>):</p>

<pre><code>from shodan.helpers import iterate_files, humanize_bytes
from sys import argv, exit


if len(argv) &lt;=1 :
    print('Usage: {} &lt;file1.json.gz&gt; ...'.format(argv[0]))
    exit(1)


datasize = 0
clusters = {}


# Loop over all the banners in the provided files
for banner in iterate_files(argv[1:]):
    try:
        # Grab the HDFS information that Shodan gathers
        info = banner['opts']['hdfs-namenode']
        cid = info['ClusterId']
        # Skip clusters we've already counted
        if cid in clusters:
            continue
        datasize += info['Used']
        clusters[cid] = True
    except:
        pass


print(humanize_bytes(datasize))
</code></pre></li>
<li><p>Run the Python script to get the amount of data exposed:</p>

<pre><code>$ python hdfs-exposure.py hdfs-data.json.gz
5.0 PB
</code></pre></li>
</ol>]]></content:encoded></item></channel></rss>