Here's something I wanted to do for a while. There are these "paste" sites, most notably pastebin. These sites are used by hackers to transfer various stuff. There are many tools already made to dig through pastebin, and the folks at pastebin know that too, so they have a PRO API offering. So, let's go somewhere else.
The idea of such scraper is simple. Each such paste has an ID - e.g. https://pastebin.com/ehxPQekC. We can brute-force these IDs, iterating over many random pastes. Some will be private, some will be deleted, but there still might be some stuff left we can use. It will take a loooong time though.
https://pastebin.com/ehxPQekC
Go through anything and everything and look for interesting stuff. You can expect credentials, private keys, emails, personal data or maybe even credit cards. So that's what we'll be looking for. Check with regexes and take notes.
Just as I published, I've stumbled upon this YARA ruleset. I will implement this later, hopefully, excited to finally try some YARA. YARA is the de-facto golden standard in security detections. Most antiviruses use it, lot of comms around vulns and threats use it too.
The other part of the script is resilience and keeping a low-enough profile. If we encounter an error, let's take a small break (like 10mins). That's to be sure we don't destroy the service or get blocked. If there are many errors in a row, let's wrap it up - we are probably blocked. Also, keep a low but steady rate-limit, 2 requests per second seems to be good enough. Another trick is to use an user-agent that's a valid browser, usually the newest chrome available. We should be patient and polite, it's a poor parasite that kills its host or triggers a removal.
So, original code had like 20 lines, I slowly expanded as I was encountering errors. I was lucky in my choice of paste site (won't disclose that, left as an exercise for the reader) they've implemented incremental IDs. So I created a simple paste, noted the ID and went down from there. Written in python, here goes:
#!/usr/bin/env python3 import requests import time import re if __name__ == "__main__": dead = 0 limit = 10 sleepin = 600 rate_limit = 0.5 url = "https://[sorry_nope]/" outdir = "./found" i = 400000 # start headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36" } # Detection engine, lol email_pattern = r"\S+@\S+\.\S+" md5_pattern = r"(^|[^a-zA-Z0-9/=-])[a-fA-F0-9]{32}([^a-zA-Z0-9/]|$)" private_pattern = r"PRIVATE KEY----" patterns = [email_pattern, md5_pattern, private_pattern] while i > 0: if i % 1000 == 0: print(f"[.] {i}: Progressin' {time.asctime(time.gmtime(time.time()))}") try: r = requests.get(f"{url}{i}.txt", headers=headers) if r.status_code != 200 and r.status_code != 404: print(f"[!] {i}: {r.status_code} - {r.text}") time.sleep(sleepin) # take a smol break dead += 1 elif dead > 0: dead -= 1 # we're working again! except Exception as e: print(f"[!] {i}: {str(e)}") dead += 1 time.sleep(sleepin) if dead == limit: print("[!!] {i}: Seems that the service is dead. Whoopsie") break if any([re.findall(pattern, r.text) for pattern in patterns]): # run the detection engine! with open(f"{outdir}/{i}.txt", 'w') as f: print(f"[+] {i}: {outdir}/{i}.txt") f.write(r.text) time.sleep(rate_limit) i -= 1
The script was running for a couple of days now (without an interruption!) and currently I've got like 80MB of data to go through. This will take some time to process, stay tuned and I'll see you soon-ish.