PHP – Crawl websites from command line interface

spiderman cover

Recently, i wrote a new crawler script to warn caches on some Magento websites. Today i’d like to share it with you, because i wrote it in a way that works with many websites other than Magento and many platforms.

You can see the help content by running the crawler in command interface like below, make sure there is no sitemap.xml file or you have -help option as parameter in your command line.

php -f iz_crawler.php
Usage:  php -f crawler.php -- [options]

  -sitemap <list of files>     List of sitemap xml files, delimit by semicolon ; . Default is 'sitemap.xml'
  -website <website>           Website url for input. Will be ignored if -sitemap option selected or there is sitemap.xml file in the same directory with this crawler
  -depth <number>              Set depth level. Default is 0
  -interval <number>           Set scrap interval, measure in second(s). Defalt is 0
  -exclude <extensions>        Exclude link extensions like png, css, js, etc... delimit by semicolon ; . Default is "jpg;png;jpeg;pdf;7z;zip;rar;mp3;aac;mp4;apk;bat;tar;swf;iso"
  -verbose                     Display crawler output. Default is false
  -help                        This help

  Note: sitemap.xml default location is at root, and it will add initial urls for crawler, use -depth to make most use of sitemap.xml

  Example : php -f crawler.php  -- -website -depth 1 -interval 0.5 -verbose -exclude "png;pdf;html"

Because you can figure out a lot from the help content, so i will only show you how it looks here. i placed iz_crawler.php at the root directory of my website and execute this command “php -f iz_crawler.php — -verbose” :

php -f iz_crawler.php -- -verbose
Tour Bac Ha – Sapa – Than Uyen 2013 – 2014
Control Git with PHP
Javascript – continuously, respectively send requests to server
Magento – Remove redundant attributes
Mac OSX – clean duplicate entries in ‘open with’ menu
Treasure chest
Philippines 2013
Mu Cang Chai 2013
What Starts with F and ends with K
Amazon EC2 common problems – there’s no swap space
Bản Giốc 2013
WordPress – Update core/plugins without FTP
Varnish cache server – Guru Meditation 503 …
MySQL general error – 2006 MySQL server has gone away
Ninh Bình
Ruby on Rails – Library not loaded: libmysqlclient.18.dylib (LoadError)
Compiling latest version of PHP on Mac OS X
Magento – Installing Extensions via Command Line
Unix – find more SSH power with ssh config
Mac OS X – Quickly switch between windows of an application
Magento EE – display global messages when you have Full Page Cache turned on
Technical debt in software development
Mac OS X – Completely disable Dashboard

Because i have the sitemap.xml file, so it will ready the sitemap file automatically. you can specify sitemap file(s) or scrape website by domain name and the crawler will extract link automatically.

You can find download link below 😀
PHP Crawler from IZ