This is a Web crawler/search combination tool that can be used on any site on the Internet. It should be able to select a web site with specific criteria and then look for a certain item within the content. This is not a keyword search, this is more of a link checker within the content.
This should be able to be done on most big web sites. Example. If you go to http://www.wikihow.com/Find-a-Low-Airfare This article is an example of the 1000's they have on their site.
I would like to be able to crawl each category by page and find the anchor link domains that are not http://www.wikihow.com or advertising links, so in this case we would want to populate a text file with
http://www.airfarewatchdog.com
http://southwest.com
http://www.airbank-travel.com
http://www.priceline.com
Then i would like to see if these domains are available. Obviously the newer the article the less likely, so I would rather it search the categories by oldest article first.
Let me know if this makes sense. There are several sites like this, so I would want it to work on all of them.
My goal is to go through an extemelly large web site with tons of pages of content without having to do it manually. The sites are not mine, I just want to access them. Almost every page on this site has links within the content. I'm not talking about all the links on the page, just the 1-3 links with in the belly of the body. I then want to have it bring the links into a text file or better yet, check and see if that domain is available for purchase.
A few things that need to be done are categories, so it doesn't search the whole site at the same time. It can be broken down. I should be able to do this on multiple sites. Is this something you can do?
I want to be able to scan exisiting articles on sites like wikihow.com and pull out the links inside the articles that are not wiki or ads. The sites I had listed were examples of what would get pulled out.
Then I want to check and see if airfarewatchdog.com is available and so on.
接包方 | 国家/地区 | |
---|---|---|
![]() ![]() |
5
Buzhidao
|
|
2
Early-software
|
||
1
Loveisp
(中标)
|
||
0
Wenlovejob
|