Find command in hadoop : How to find files of specific size in hdfs

Most often I see developers struggling to mimic linux find command for hadoop files especially based on size or size range. No wonder all this pain is because there was no such find command in hadoop before version 2.7.x and even the latest hadoop release 3.0.0-beta1 leaves a lot to be desired. I myself have faced this problem on multitude of occasions and figured out a couple of ways to do it. This short writeup should help those who want to search hadoop file system to find files which match a specific size or particularly the files in a given size range.

 

Option 1: If you are using Cloudera distribution then you are in luck as in CDH has a pre-built functionality for the job. You can use the jar with almost all the features of the linux find command. Here is how you would do it to find the files with size greater than 2KB and less than 4KB:

And in case you want to find the log files with names starting with twitter, modified in last one hour with size between 2KB and 4KB then you should use:

 
 
Option 2:And in case you are not using CDH then you still have a way out where in you could work it out using a combination of available commands in hdfs and unix. Here is how you would do it using awk to find all the files under hdfs root with size between 2KB and 4KB:

Last command needs a little explanation:
a) hdfs dfs -ls -R / => A familiar command to list all the files recursively under hdfs root. Output could be like:

b) /^-/ => List of all the files in hdfs is piped into the awk command and this very first expression checks the file flag at the start of the each input line. Allows only files (leaving out directories) to pass through as:

c) gsub(/[ ,]+/,” “) => Only if b above is true (that is the current input line is for a file) then globally replace two or more consecutive spaces with a single space so that we can use it as a delimiter between various parts of the input line. Output could be like:

d) If condition => Our desired condition set. Fifth token in the output of a ls command represents size in bytes. We are checking that the size of the file in bytes is within our specified limits.

e) print $5,$8 => If the condition in e above is evaluated to true then we are printing the file size (5th token) in bytes as well as the path/name of the file (8th token) in the output. If you want to use it as part of another script (like creating HAR file) then you can just output the file path/name. Output would be:

 

Option 3:
I have also tried achieving the same using fsimage where in first the fsimage is obtained (step a below) before converting it into a human readable form using OIV (offline image viewer) image parser say XML (step b below). And then finally writing a custom XML analyzer to do the job. But i have observed that this is a little less performant as there is some gap between the fsimage creation, its verbose translation to human readable form and then the final analysis. There could always be new files getting created during this time interval which would not be part of this fsimage (but ofcourse it would be picked up next time). Essentially it is related to your real time requirements.

 

Option 4
:
Most of the bigdata fraternity is still on pre 2.7 hadoop releases and hence is not really an option but those who have this can try experimenting with it. Hadoop releases post 2.7.x include a find command which you can check using:

 

You can choose and implement any of the option above as per your requirements.

Let me know in case you have any queries and please do share your feedback or comments if this was helpful to you.

Leave a Reply

Your email address will not be published.