HDFS - DistCp (distributed inter/intra-cluster copy)

Yarn Hortonworks

About

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying

Concept

distcp is a mapReduce application and run therefore in parallel. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

Management

Inter-cluster copy

Hadoop

hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo

where; nn = HDFS - NameNode

Example

between S3 and Hdfs

hadoop distcp s3n://AWS_SECRET_ID:AWS_SECRET_KEY@blaze-data/enron-email hdfs:///tmp/enron

Documentation / Reference





Discover More
Yarn Hortonworks
HDFS - Fs Shell

Fs Shell is a client command line tool to manage HDFS. where: hadoop is the hadoop client hdfs is command is a file system command (ie ls, cat, ...) uri is For copy, you can also use...
Yarn Hortonworks
HDFS - Hadoop Archive file (har)

in the Hadoop context. An archive: exposes itself as a hdfs file system layer. All the fs shell commands in the archives work then but with a different . is immutable. Rename’s, deletes and...



Share this page:
Follow us:
Task Runner