return to tranceaddict TranceAddict Forums Archive > Main Forums > Chill Out Room

 
Extracting hyperlinks in a webpage EXCEL help needed
View this Thread in Original format
jdat
Do any of you know an easy way to generate lists of all hyperlinks contained inside a webpage?

I have pages with hundreds of links and I need to copy them all and I need a quick and easy way to do it ..


I'm sure I might be able to generate lists through a web editor or something?

please pretty please? :)



( look at the bottom for the excel help I need )
Akridrot
quote:
Originally posted by jdat
Do any of you know an easy way to generate lists of all hyperlinks contained inside a webpage?

I have pages with hundreds of links and I need to copy them all and I need a quick and easy way to do it ..


I'm sure I might be able to generate lists through a web editor or something?

please pretty please? :)


An idea:

Copy source, and do a regex to remove all the tags?

Simple as hell.

THIS IS ASSUMING that it's a huge dump of links on one page.

edit: You could do this in many text editors, just do a "Replace all" with "" (leave it blank) for "
" and then ""

If you want to completely remove all tags with regex, I think you'd do something like : <[^<>]>

edit2: Links don't show up in stupid quotes. {a href) and (a) is what I meant.
jdat
humm regex editting with what?

I know what regex is but never done anything involving massive regex editting and crap


and no this isn't a list on a webpage it's webpages with various pictures text etc ... like real pages you know :p

this pisses me off

I can open the webpages in something like Rapid php and it displays all the hyperlinks in a code explorer window but there's no way to copy the whole list .... bah I'm gonna go bootleg and copy screen and ocr the stuff :wtf:
Akridrot
Major regex editing? DUDE, it's really not that that hard. Maybe if you showed me the page?

And if they are ALL hyperlinks, it's safe to assume that they'd all be in the same kind of tags: a href

So then you'd do a regex in the source to copy all text between a href and /a, regardless of what it is and output it to a file.


edit: Btw, I'd like to know what OCR stuff you plan on using.
jdat
well what the hell do I use to do this <[^<>]> ?


Sounds lovely
Akridrot
quote:
Originally posted by jdat
well what the hell do I use to do this <[^<>]> ?


Sounds lovely


http://notepad-plus.sourceforge.net/uk/site.htm

because you don't have PHP.

I could do this for you in php if you coughed up a link.
jdat
ok how the heck do I run this?

<[^<>]>
jdat
ok I found something that works great!

Little app called Selected links from http://mikos.boom.ru/
works only with iexplorer.
You select the hyperlinks you want to extract and it copies them to the clipboard.

Does just what I need :D


Now I have another problem .... I just noticed I need to compare the pages I am getting all the urls from with a page of already used urls...
Need an automated way to do this :(

Long story short current page of urls looks like this:
A
D
E
G
H

New catches look like
A
B
C
D
F
H
I

I need a way to take B, C, I from the new list and put that in a separate list as these are the links I need to check individually..
bah this is doing my head in

And the letters don't reflect anywhere near the actual amount of links I am working with .... I was on this for 20 minutes and I already got 2500 links :wtf:
of which 800 or so are new and these are the ones I need to put somewhere else...


hummm gonna try finding some spreadsheet formulas as I have already been using that to remove duplicates
jdat
quote:
Originally posted by josh4
you're doing this in excel? are the pages of urls different excel files?



I'm generating the lists from a website with some external app that extracts all the urls.
Then I paste everything in excel just to cleanup ... it's really just straight forward text with no link name only the hyperlink

I suck at excel formulas ( forgot them all ) and I'm using csved to remove dups ... yeah n00b :(
CLICK TO RETURN TO TOP OF PAGE
 
Privacy Statement