I’ve been doing some work with Hadoop lately, and I just ran into an interesting problem with networking. This isn’t a bug, per se, but a conflict in my configuration.

I’m running on a laptop, using a pseudo-distributed cluster. That means all the different processes are running, but they’re all running on one box. That makes it possible to test jobs with full network communication, but without deploying to a production cluster.

I’m also working remotely, connecting to the corporate network by VPN. As is commonly done, our VPN is configured to completely separate the client machine from its local network. (If it didn’t, you could use the VPN machine to bridge the secure corporate network to your home ISP, coffeeshop, airport, etc.)

Here’s the problem: when on the VPN, my machine can’t talk to its own IP address. Right now, ifconfig reports the laptops IP address as 192.168.1.105. That’s the address associated with the physical NIC on the machine.

The odd part is that Hadoop mostly works this way. I’ve configured the name node, job tracker, task tracker, datanodes, etc. to all use “localhost”. I can use HDFS, I can submit jobs, and all the map tasks work fine. The only problem is that when the map tasks finish, the task tracker cannot send data from the map tasks to the reduce tasks. The job appears to hang.

In the task tracker’s log file, I see reports every 20 seconds or so that say

2009-07-31 11:01:33,992 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200907310946_003_r_000000_0 0.0% reduce > copy >

The instant I disconnected from the VPN, the copy proceeded and the reduce job ran.

I’m sure there’s a configuration property somewhere within Hadoop that I can change. When (if) I find it, I’ll update this post.