Thursday, 13 February 2014

WGET

-recursive: download the entire Web site.

--domains website.org: don't follow links outside website.org.

--no-parent: don't follow links outside the directory tutorials/html/.

--page-requisites: get all the elements that compose the page (images, CSS and so on).

--html-extension: save files with the .html extension.

--convert-links: convert links so that they work locally, off-line.

--restrict-file-names=windows: modify filenames so that they will work in Windows as well.

--no-clobber: don't overwrite any existing files (used in case the download is interrupted and
resumed).

--limit-rate=200k: Limit download to 200 Kb /sec

--random-wait: Random waits between download - websites dont like their websites downloaded

-r: Recursive - downloads full website

-p: downloads everything even pictures (same as --page-requsites, downloads the images, css stuff and so on)

-E: gets the right extension of the file, without most html and other files have no extension

-e robots=off: act like we are not a robot - not like a crawler - websites dont like robots/crawlers unless they are google/or other famous search engine

-U mozilla: pretends to be just like a browser Mozilla is looking at a page instead of a crawler like wget

-o=/websitedl/wget1.txt: log everything to wget_log.txt

-b: runs it in background and cant see progress

-O
--restrict-file-names=windows:  modify filenames so that they will work in Windows as well. Seems to work good without it

No comments:

Post a Comment