Nov 10, 2017

Apache Nutch is the most complete, open source crawler that you can find for Java.

Highly extensible, highly scalable Web crawler
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.