Update 2024-03-27: Greatly expanded the "Samples" page and renamed it to "Glossary".
Update 2024-04-04: Added 5 million mid-2011 posts from the k47 post dump. Browse (mostly) them here.
Update 2024-04-07: Added ~400 October 2003 posts from 4chan.net. Browse them here.

Welcome to Oldfriend Archive, the official 4chan archive of the NSA. Hosting ~170M text-only 2003-2014 4chan posts (mostly 2006-2008).
[22 / 0 / ?]

[1187881809] wget and image galleries?

ID:88r1ZBQV No.8191 View ViewReplyOriginalReport
trying to get wget to cooperate with downloading individual pictures from a pr0n site

i've decided against spidering the content, since i dont want be obvious (or piss off the webmaster too much), and on closer examination of the method of retriving the image seems pretty simple, there is a php script with an ID and a page number (readfile.php?arg_id=1234&arg_page=5, for example), so i decided that once i had the command working I'd write a script to download all of the images, one after the other.

i've seemed to make steady progress towards the solution, telling wget to ignore robots.txt and to masquerade as a different browser seems to have gotten me past some of the errors, and figuring out that it's a good idea to escape ampersands helped it not to spawn another process, but right now I seem to be stuck at a brick wall, with me staring down the barrel of a 401 forbidden, whereas I used to be able to download short little error .html files that gave me an idea of what was wrong.

here's the command as it stands, the URL included is fake.

$ wget --cookies=off --load-cookies=cookies.txt -erobots=off --user-agent="Mozilla/4.0 (compatible; MSIE 5.5;Windows NT 5.0)" <url removed>

for clarity the cookies.txt is simply what i copied from a firefox profile that had cookies cleared and then visited the site in question.  that said, i got a couple of questions

1. is it likely that i am missing any other obvious "pretend I'm the real thing" bits that might make wget behave?
2. is this obstacle likely to be so immense that trying to bypass it is probably not worth the trouble, like something in the php script itself?

the site in question is e-hentai.org, and although it does require hentaikey for accessing most of it, the doujin stuff does not, so i should be within the rules, the reason i mentioned this last is because to keep the NWS content in this post to a minimum and i'm more interested in how they're keeping wget out than the porn itself, "the more you know" and all that

i honestly have no idea where else i could ask this question and have even a slim chance of getting an informed response, so help me anonymous