Windows Azure Storage Blob (WASB) is an file system implemented as an extension built on top of the HDFS APIs and is in many ways HDFS.
The WASB variation uses:
WASB is built into HDInsight (Microsoft's Hadoop on Azure service) and is the default file system.
Azure storage stores files as a flat key/value store without formal support for folders. The hadoop-azure file system layer simulates folders on top of Azure storage. By default, folder rename in the hadoop-azure file system layer is not atomic. That means that a failure during a folder rename could, for example, leave some folders in the original directory and some in the new one. See the parameter fs.azure.atomic.rename.dir if you want to make the operations atomic.
In Azure you store blobs on containers within Azure storage accounts.
Only the commands that are specific to the native HDFS implementation (which is referred to as DFS), such as fschk and dfsadmin, show different behavior in Azure storage
Azure doesn't have the notion of directory. However, the parsing of the file name gives the tree structure because Hadoop recognizes that a slash “/” is an indication of a directory.
Blob address:
# Fully Qualified name Local
hdfs://<namenodehost>/<path>
# HDInsight Syntax Global
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
# Example
wasb://[email protected]/SomeDirectory/ASubDirectory/AFile.txt
The schemes wasb and wasbs identify a URL on a file system backed by Azure Blob Storage.
Driver: org.apache.hadoop.fs.azure.Wasb
hdfs://<namenodehost>/<path>
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
hadoop fs -mkdir wasb://[email protected]/testDir
hadoop fs -put testFile wasb://[email protected]/testDir/testFile
azure storage blob upload <sourcefilename> <containername> <blobname> --account-name <storageaccountname> --account-key <storageaccountkey>
See also: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upload-data
azure storage blob download <containername> <blobname> <destinationfilename> --account-name <storageaccountname> --account-key <storageaccountkey>
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#download-files
hadoop fs -cat wasbs://[email protected]/testDir/testFile
test file content
Remove-AzureStorageBlob -Container $containerName -Context $storageContext -blob $blob
azure storage blob delete <containername> <blobname> --account-name <storageaccountname> --account-key <storageaccountkey>
Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix "example/data/"
azure storage blob list <containername> <blobname|prefix> --account-name <storageaccountname> --account-key <storageaccountkey>
The Wasb configuration (ie the file system configuration) are in the core-site.xml file.
Example:
<property>
<name>fs.AbstractFileSystem.wasb.impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.AbstractFileSystem.wasbs.impl</name>
<value>org.apache.hadoop.fs.azure.Wasbs</value>
</property>
<property>
<name>fs.azure.account.key.hiinformaticasawe.blob.core.windows.net</name>
<value>MIIB/QYJKoZIhvcNAQcDoIIB7jCCAeo....</value>
</property>
<property>
<name>fs.azure.account.keyprovider.hiinformaticasawe.blob.core.windows.net</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
<name>fs.azure.io.copyblob.retry.max.retries</name>
<value>60</value>
</property>
<property>
<name>fs.azure.io.read.tolerate.concurrent.append</name>
<value>true</value>
</property>
<property>
<name>fs.azure.page.blob.dir</name>
<value>/mapreducestaging,/atshistory,/tezstaging,/ams/hbase/WALs,/ams/hbase/oldWALs,/ams/hbase/MasterProcWALs</value>
</property>
<property>
<name>fs.azure.shellkeyprovider.script</name>
<value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.
jar needed in a client installation:
Example: https://docs.microsoft.com/en-us/java/api/overview/azure/storage
When using an hadoop command line client such as hdfs, you may get the following error:
hdfs groups hdfs
Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): wasb://[email protected] is not of scheme 'hdfs'.
at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:530)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:147)
at org.apache.hadoop.hdfs.tools.GetGroups.getUgmProtocol(GetGroups.java:87)
at org.apache.hadoop.tools.GetGroupsBase.run(GetGroupsBase.java:71)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.hdfs.tools.GetGroups.main(GetGroups.java:96)
Pass the URI with hdfs scheme to resolve this problem:
hdfs groups -D "fs.default.name=hdfs://namenode/" hdfs
hdfs : hadoop