Web capture & analysis for threat intelligence

Hey everyone out there,

Has anybody ever tried to crawl or scrape the web pages to find out some sensitive information or a potentially malicious URL that’s infected and now falls under a phishing attack ?

I have been thinking about this a lot for few days, I checked some websites that maintain a huge database of infected or malicious links, possibly a phishing links that are submitted by the users or you know found using automated scanners. The question I am more interested in is, how these automated scanners work ? Are they web crawlers underneath, and if yes, what’s the starting point of these web crawlers? By starting point, I meant, at first we need to provide some URLs from where they’ll start crawling and scraping all the hyper links associated with that URL or a web page and then recursively go to next URLs and follow the same process there as well ?

Some of the useful online platforms for finding malicious URLs :

PhishTank
openphish
urlscan
Hunchly
Malware baazar
(Please add more similar platforms in the comments if you know)

One thing I’d like to mention here about Hunchly, their daily reports subscription of dark web is what I am interested in, not specifically only dark web but it could be the websites on the surface web as well.
I was curious about how do crawlers, botnets and such things work, so I wanted to do some type of basic project in order to understand more about the domain of “phishing and threat intelligence”, basically the idea was to scrape websites across the internet, check if they are legit or an infected one, if they are infected, add it up on a spreadsheet or in CSV file.

I do know, many websites now don’t allow these bots to scrape their data and it’s against TOS of most of the websites, and in order to prevent it, many websites have already set-up their defense mechanisms, either via detecting such crawlers using ML algorithms which is trained on the huge datasets, or rate limiting (a very basic defense).

Please let me know if you have any ideas about this, also, what would be the best programming language to design a web crawler, keeping in mind the speed and multi processing architecture each language follows. So far I have heard about GO or JS, not sure about python tho, although have seen many examples written in python as well.

submitted by /u/RoninPark
[link] [comments]

May 16, 2023
Read More >>