PHP – Crawl websites from command line interface

spiderman cover

Recently, i wrote a new crawler script to warn caches on some Magento websites. Today i’d like to share it with you, because i wrote it in a way that works with many websites other than Magento and many platforms.

You can see the help content by running the crawler in command interface like below, make sure there is no sitemap.xml file or you have -help option as parameter in your command line.

php -f iz_crawler.php
Usage:  php -f crawler.php -- [options]

  -sitemap <list of files>     List of sitemap xml files, delimit by semicolon ; . Default is 'sitemap.xml'
  -website <website>           Website url for input. Will be ignored if -sitemap option selected or there is sitemap.xml file in the same directory with this crawler
  -depth <number>              Set depth level. Default is 0
  -interval <number>           Set scrap interval, measure in second(s). Defalt is 0
  -exclude <extensions>        Exclude link extensions like png, css, js, etc... delimit by semicolon ; . Default is "jpg;png;jpeg;pdf;7z;zip;rar;mp3;aac;mp4;apk;bat;tar;swf;iso"
  -verbose                     Display crawler output. Default is false
  -help                        This help

  Note: sitemap.xml default location is at root, and it will add initial urls for crawler, use -depth to make most use of sitemap.xml

  Example : php -f crawler.php  -- -website http://www.google.com -depth 1 -interval 0.5 -verbose -exclude "png;pdf;html"

Because you can figure out a lot from the help content, so i will only show you how it looks here. i placed iz_crawler.php at the root directory of my website and execute this command “php -f iz_crawler.php — -verbose” :

php -f iz_crawler.php -- -verbose
http://invisiblezero.net/
http://invisiblezero.net/tour-bac-ha-sapa-than-uyen-2013-2014/
http://invisiblezero.net/control-git-with-php/
http://invisiblezero.net/javascript-continuously-respectively-send-requests-to-server/
http://invisiblezero.net/magento-remove-redundant-attributes/
http://invisiblezero.net/contacts/
http://invisiblezero.net/mac-osx-clean-duplicate-entries-in-open-with-menu/
http://invisiblezero.net/treasure-chest/
http://invisiblezero.net/philippines-2013/
http://invisiblezero.net/services/
http://invisiblezero.net/mu-cang-chai-2013/
http://invisiblezero.net/what-starts-with-f-and-ends-with-k/
http://invisiblezero.net/amazon-ec2-common-problems-theres-no-swap-space/
http://invisiblezero.net/ban-gioc-2013/
http://invisiblezero.net/wordpress-update-coreplugins-without-ftp/
http://invisiblezero.net/varnish-cache-server-guru-meditation-503/
http://invisiblezero.net/mysql-general-error-2006-mysql-server-has-gone-away/
http://invisiblezero.net/ninh-binh/
http://invisiblezero.net/ruby-on-rails-library-not-loaded-libmysqlclient-18-dylib-loaderror/
http://invisiblezero.net/compiling-latest-version-of-php-on-mac-os-x/
http://invisiblezero.net/magento-installing-extensions-via-command-line/
http://invisiblezero.net/unix-find-more-ssh-power-with-ssh-config/
http://invisiblezero.net/mac-os-x-quickly-switch-between-windows-of-an-application/
http://invisiblezero.net/magento-e-display-global-messages-when-you-have-full-page-cache-turned-on/
http://invisiblezero.net/technical-debt-in-software-development/
http://invisiblezero.net/mac-os-x-completely-disable-dashboard/

Because i have the sitemap.xml file, so it will ready the sitemap file automatically. you can specify sitemap file(s) or scrape website by domain name and the crawler will extract link automatically.

You can find download link below 😀
PHP Crawler from IZ

Leave a Comment