Table of Contents

Azure - Windows Azure Storage Blob (WASB) - HDFS

About

Windows Azure Storage Blob (WASB) is an file system implemented as an extension built on top of the HDFS APIs and is in many ways HDFS.

The WASB variation uses:

WASB is built into HDInsight (Microsoft's Hadoop on Azure service) and is the default file system.

Azure storage stores files as a flat key/value store without formal support for folders. The hadoop-azure file system layer simulates folders on top of Azure storage. By default, folder rename in the hadoop-azure file system layer is not atomic. That means that a failure during a folder rename could, for example, leave some folders in the original directory and some in the new one. See the parameter fs.azure.atomic.rename.dir if you want to make the operations atomic.

In Azure you store blobs on containers within Azure storage accounts.

Azure Storage Structure

Limitations

Only the commands that are specific to the native HDFS implementation (which is referred to as DFS), such as fschk and dfsadmin, show different behavior in Azure storage

Structure

Azure Storage Structure

Configuration

Chunk size

Replication factor

Management

File location

Azure doesn't have the notion of directory. However, the parsing of the file name gives the tree structure because Hadoop recognizes that a slash “/” is an indication of a directory.

Blob address:

# Fully Qualified name Local
hdfs://<namenodehost>/<path>

# HDInsight Syntax Global
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
# Example
wasb://[email protected]/SomeDirectory/ASubDirectory/AFile.txt

Scheme

The schemes wasb and wasbs identify a URL on a file system backed by Azure Blob Storage.

Driver: org.apache.hadoop.fs.azure.Wasb

Use blob storage

hdfs://<namenodehost>/<path>
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

Make a directory

hadoop fs -mkdir wasb://[email protected]/testDir

Upload

hadoop fs -put testFile wasb://[email protected]/testDir/testFile
azure storage blob upload <sourcefilename> <containername> <blobname> --account-name <storageaccountname> --account-key <storageaccountkey>

See also: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upload-data

Download

azure storage blob download <containername> <blobname> <destinationfilename> --account-name <storageaccountname> --account-key <storageaccountkey>

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#download-files

Cat the content

hadoop fs -cat wasbs://[email protected]/testDir/testFile
test file content

Delete

Remove-AzureStorageBlob -Container $containerName -Context $storageContext -blob $blob
azure storage blob delete <containername> <blobname> --account-name <storageaccountname> --account-key <storageaccountkey>

List

Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix "example/data/"
azure storage blob list <containername> <blobname|prefix> --account-name <storageaccountname> --account-key <storageaccountkey>

Hadoop Configuration

The Wasb configuration (ie the file system configuration) are in the core-site.xml file.

Example:

<property>
  <name>fs.AbstractFileSystem.wasb.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property>

<property>
  <name>fs.AbstractFileSystem.wasbs.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasbs</value>
</property>

<property>
  <name>fs.azure.account.key.hiinformaticasawe.blob.core.windows.net</name>
  <value>MIIB/QYJKoZIhvcNAQcDoIIB7jCCAeo....</value>
</property>

<property>
  <name>fs.azure.account.keyprovider.hiinformaticasawe.blob.core.windows.net</name>
  <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>

<property>
  <name>fs.azure.io.copyblob.retry.max.retries</name>
  <value>60</value>
</property>

<property>
  <name>fs.azure.io.read.tolerate.concurrent.append</name>
  <value>true</value>
</property>

<property>
  <name>fs.azure.page.blob.dir</name>
  <value>/mapreducestaging,/atshistory,/tezstaging,/ams/hbase/WALs,/ams/hbase/oldWALs,/ams/hbase/MasterProcWALs</value>
</property>

<property>
  <name>fs.azure.shellkeyprovider.script</name>
  <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>

Code

WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.

jar needed in a client installation:

Example: https://docs.microsoft.com/en-us/java/api/overview/azure/storage

Support

Invalid URI for NameNode address (check fs.defaultFS): wasb is not of scheme hdfs

When using an hadoop command line client such as hdfs, you may get the following error:

hdfs groups hdfs
Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): wasb://[email protected] is not of scheme 'hdfs'.
        at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:530)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:147)
        at org.apache.hadoop.hdfs.tools.GetGroups.getUgmProtocol(GetGroups.java:87)
        at org.apache.hadoop.tools.GetGroupsBase.run(GetGroupsBase.java:71)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.hdfs.tools.GetGroups.main(GetGroups.java:96)

Pass the URI with hdfs scheme to resolve this problem:

hdfs groups  -D "fs.default.name=hdfs://namenode/"  hdfs
hdfs : hadoop

Documentation / Reference