Oct 7, 2015

Scheme and Authority in Hadoop URI

Hadoop URI (uniform resource identifier) consist of Scheme, Authority and Path. In other words, fully qualified name of directory/file is determined by these three parameters. The URI format is scheme://autority/path.FS shell command (./bin/hdfs dfs) uses scheme and authority for determining which file system it has to refer. Collectively both scheme and authority determine the FileSystem implementation.
Scheme value for hadoop file system (HDFS) is "hdfs" and for local file system scheme value is "file".URI scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class.
Authority is host name/top level domain and port number associated with a file system.
The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used.
Question:- Where do we configure default scheme ?
In core-site.xml with property name "fs.default.name" and default value for it is
hdfs://<URL/toplevelDomain:port> and same can be verified from core-site.xml present in following location:-  <HADOOP_INSTALATION_DIR>/etc/hadoop/core-site.xml.
Below is the sample node property for fs.default.name and it's value is hdf://localhost:543310
where hdfs is scheme and localhost:54310 is namenodehost.
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>

Similarly, for local file system, scheme name changes too "file" and URI looks like
file:///file3 /user/hadoop/filename
Path is a valid name of a file or directory which specifies a unique location in a file system. It looks like : /user./local/<parent_dir>/<Sub_dir>/<Chile_dir/file>

FS shell command execution for HDFS and Local file system based on scheme values :- 

As stated earlier : scheme and authority determine the FileSystem implementation. Lets execute FS Shell command for HDFS and Local file system and spot the fundamental differences and understand how does scheme determine file system implementation it should refer.
For HDFS:- Here scheme name is "hdfs" and namenode host value is localhost:54310 then following command execution refers specified location as Hadoop file system and list out all the files and directory at given location.
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hdfs dfs -ls hdfs://localhost:54310/user/hduser1/hdfsdata
Found 2 items
drwxr-xr-x   - hduser1 supergroup          0 2015-10-04 11:51 hdfs://localhost:54310/user/hduser1/hdfsdata/hadoop_data
drwxr-xr-x   - hduser1 supergroup          0 2015-10-07 08:14 hdfs://localhost:54310/user/hduser1/hdfsdata/input
Note:- As mentioned earlier, scheme and authority are optional fields. So, if we have configured <HADOOP_Installation>/etc/hadoop/core-site.xml with property fs.default.name(as shown above), we do not need to qualify path(/user/hduser1/hdfsdata) with scheme and authority .Lets execute following command without scheme and authority (assuming configuration file is set-up with valid value for property name "fs.default.name")
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hdfs dfs -ls /user/hduser1/hdfsdata
Found 2 items
drwxr-xr-x   - hduser1 supergroup          0 2015-10-04 11:51 hdfs://localhost:54310/user/hduser1/hdfsdata/hadoop_data
drwxr-xr-x   - hduser1 supergroup          0 2015-10-07 08:14 hdfs://localhost:54310/user/hduser1/hdfsdata/input

For Local file system:-  List out all files or directory inside the given directory using FS shell command.
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hdfs dfs -ls file:///usr/local/
Found 15 items
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/bin
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/etc
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/games
drwxr-xr-x   - hduser  hadoop       4096 2015-10-02 09:05 file:///usr/local/hadoop
drwxr-xr-x   - root    root         4096 2013-07-22 15:26 file:///usr/local/hadoop-1.2.1
-rw-r--r--   1 root    root     63851630 2015-10-02 06:26 file:///usr/local/hadoop-1.2.1.tar.gz
drwxr-xr-x   - hduser1 hadoop       4096 2015-10-04 14:45 file:///usr/local/hadoop2.6.1
drwxr-xr-x   - hduser1 hadoop       4096 2015-10-04 02:33 file:///usr/local/hadoop_store
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/include
drwxr-xr-x   - root    root         4096 2015-10-02 04:29 file:///usr/local/java
drwxr-xr-x   - root    root         4096 2013-04-24 10:04 file:///usr/local/lib
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/man
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/sbin
drwxr-xr-x   - root    root         4096 2013-04-24 10:05 file:///usr/local/share
drwxr-xr-x   - root    root         4096 2013-04-24 10:01 file:///usr/local/src
Here scheme name is "file" then FS shell command refers to local file system and list down all files/directories at given location(/usr/local/).

Now we are in position to conclude that, scheme and authority collectively determines file system implementation to be referenced and uniquely identify file or directory in the file system.

Location: Hyderabad, Telangana, India