Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
DiscussionsAccessExcelInfoPathOutlookPowerPointPublisherWord
DirectoryUser Groups
Related Topics
Outlook ExpressInternet ExplorerWindowsMS Server ProductsMore Topics ...

MS Office Forum / Excel / New Users / May 2008

Tip: Looking for answers? Try searching our database.

Extract domain names out of URLs

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
MikeB - 20 Apr 2008 18:17 GMT
Hi, I have a list of URLs and I would like to find a way to extract
the domain name from the URL.

The URLs can be prefixed with www in some cases and I'd also like to
strip that off.

So, for the following URLs
http://www.wiseclerk.com/group-news/tag/libor
http://www.wiseclerk.com/group-news/2008/04/
http://www.wiseclerk.com/group-news/
http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana
sports
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/
http://www.techdirt.com/
http://techdirt.com/index.php
http://techdirt.com/blog.php?tag=loans

I'd like to extract

wiseclerk.com
tiscali.co.uk
techdirt.com

I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

Any help appreciated.

I'm thinking that I might have to do a find/search function for either
of the two starting strings and then have the "max" value (for the
longer of the two strings) but that's a lot of extra columns in my
spreadsheet.

Is there perhaps a better way to do this?
Rick Rothstein (MVP - VB) - 20 Apr 2008 18:52 GMT
I think this formula does what you want and handles all the various
possibilities...

=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1))

Rick

> Hi, I have a list of URLs and I would like to find a way to extract
> the domain name from the URL.
[quoted text clipped - 33 lines]
>
> Is there perhaps a better way to do this?
Howard Kaikow - 20 Apr 2008 20:48 GMT
> I think this formula does what you want and handles all the various
> possibilities...

=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","
")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUT
E(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",
SUBSTITUTE(A1,"http://","")&"/")-1))

That does not work for URLs that start with, say, ftp.

The type of the URL is relevant, e.g., what about ftp.xyz.com and
www.xyz.com, both  should result in same doughmain.

In this case, a worksheet function may be better..
Rick Rothstein (MVP - VB) - 21 Apr 2008 16:49 GMT
Well, I could modify my formula to account for the FTP protocol (and, by
changing the approach, maybe so others even), but there seems to be so many
protocols possible... even if you wrote a function to handle them, which
protocols would you select to handle?

Rick

>> I think this formula does what you want and handles all the various
>> possibilities...
[quoted text clipped - 10 lines]
>
> In this case, a worksheet function may be better..
Howard Kaikow - 22 Apr 2008 21:50 GMT
> Well, I could modify my formula to account for the FTP protocol (and, by
> changing the approach, maybe so others even), but there seems to be so many
> protocols possible... even if you wrote a function to handle them, which
> protocols would you select to handle?

There is no simple answer.
It's up to the programmer.

A function may be easier because there are likely Windows APIs for parsing
URLs.
Rick Rothstein (MVP - VB) - 22 Apr 2008 22:43 GMT
>> Well, I could modify my formula to account for the FTP protocol (and, by
>> changing the approach, maybe so others even), but there seems to be so
[quoted text clipped - 7 lines]
> A function may be easier because there are likely Windows APIs for parsing
> URLs.

Fair enough. Here is a possible Function solution to this question which
makes use of the UrlGetPart API to get to the "company name". For those
following this thread, add a Module to the project (Insert/Module from the
VBA editor) and copy/paste the code after my signature into its code window.
Inside the GetCompanyName function, I have seeded the Protocols constant
with "www" and "ftp"... you can add any other protocols you think necessary
in this constant, just make sure each protocol is separated by a dot. You
would use this function just like any other worksheet function. So, for
example, if A1 contains the URL, then put this formula in the cell you want
to extract the company name to...

=GetCompanyName(A1)

It is that simple (assuming I didn't screw up my code that is<g>).

Rick

Private Declare Function UrlGetPart Lib "shlwapi" _
               Alias "UrlGetPartA" _
              (ByVal pszIn As String, _
               ByVal pszOut As String, _
               pcchOut As Long, _
               ByVal dwPart As Long, _
               ByVal Flags As Long) As Long

Private Const MAX_PATH As Long = 260
Private Const URL_PART_HOSTNAME As Long = 2

Public Function GetCompanyName(URL As String) As String
  Dim dwPart As Long
  Dim Flags As Long
  Dim Part As String
  Dim Size As Long
  Dim Host As String
  '  Protocols is a **dot** delimited string
  Const Protocols As String = "www.ftp"
  If Len(URL) > 0 Then
     If InStr(URL, "//") = 0 Then URL = "http://" & URL
     Part = Space$(MAX_PATH)
     Size = Len(Part)
     If UrlGetPart(URL, Part, Size, URL_PART_HOSTNAME, 0&) = 0 Then
        GetCompanyName = Left$(Part, Size)
     End If
     Host = Left(GetCompanyName, InStr(GetCompanyName, "."))
     If InStr("." & Protocols & ".", "." & Host) Then
       GetCompanyName = Mid(GetCompanyName, Len(Host) + 1)
     End If
  End If
End Function
Howard Kaikow - 23 Apr 2008 07:25 GMT
> Fair enough. Here is a possible Function solution to this question which
> makes use of the UrlGetPart API to get to the "company name". For those
[quoted text clipped - 6 lines]
> example, if A1 contains the URL, then put this formula in the cell you want
> to extract the company name to...

It is almost always better to use the Unicode version of APIs.
Rick Rothstein (MVP - VB) - 23 Apr 2008 09:24 GMT
>> Fair enough. Here is a possible Function solution to this question which
>> makes use of the UrlGetPart API to get to the "company name". For those
[quoted text clipped - 12 lines]
>
> It is almost always better to use the Unicode version of APIs.

You are probably right; but I have zero experience with international
programming issues (all my programming efforts for the past 27 years have
been US based), whether involving fonts or not. The end result is I have no
idea if there is anything special  that needs to be accounted for or not
when dealing with them. My gut feeling is simply using UrlGetPartW in place
of UrlGetPart is not the whole answer.

Rick
Howard Kaikow - 23 Apr 2008 10:17 GMT
I really tied up now, but the following  will give you an idea on how  to
Convert to Unicode.
Apparently, the code is still not parsing as desired.

Option Explicit

Private Declare Function UrlGetPart Lib "shlwapi" _
               Alias "UrlGetPartW" _
              (ByVal pszIn As Long, _
               ByVal pszOut As Long, _
               pcchOut As Long, _
               ByVal dwPart As Long, _
               ByVal Flags As Long) As Long

Private Const MAX_PATH As Long = 260
Private Const URL_PART_HOSTNAME As Long = 2

Public Function GetCompanyName(URL As String) As String
  Dim dwPart As Long
  Dim Flags As Long
  Dim Part As String
  Dim Size As Long
  Dim Host As String
  '  Protocols is a **dot** delimited string
  Const Protocols As String = "www.ftp"
  If Len(URL) > 0 Then
     If InStr(URL, "//") = 0 Then URL = "http://" & URL
     Part = Space$(MAX_PATH)
     Size = Len(Part)
     If UrlGetPart(StrPtr(URL), StrPtr(Part), Size, URL_PART_HOSTNAME, 0&)
= 0 Then
        GetCompanyName = Left$(Part, Size)
     End If
     Host = Left(GetCompanyName, InStr(GetCompanyName, "."))
     If InStr("." & Protocols & ".", "." & Host) Then
       GetCompanyName = Mid(GetCompanyName, Len(Host) + 1)
     End If
  End If
End Function

Private Sub btnByeBye_Click()
   Unload Me
End Sub

Private Sub btnRunMe_Click()
   With lstResult
       .AddItem GetCompanyName("http://www.BagelsAndLox.com/")
       .AddItem GetCompanyName("http://BagelsAndLox.com/")
       .AddItem GetCompanyName("www.BagelsAndLox.com")
       .AddItem GetCompanyName("BagelsAndLox.com")
       .AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/")
       .AddItem GetCompanyName("http://Carol.BagelsAndLox.com/")
       .AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
       .AddItem GetCompanyName("Alice.BagelsAndLox.com")
   End With
End Sub
Howard Kaikow - 23 Apr 2008 10:26 GMT
A betta test is:

Private Sub btnRunMe_Click()
   With lstResult
       .AddItem GetCompanyName("http://www.BagelsAndLox.com/")
       .AddItem GetCompanyName("http://BagelsAndLox.com/")
       .AddItem GetCompanyName("www.BagelsAndLox.com")
       .AddItem GetCompanyName("BagelsAndLox.com")
       .AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/")
       .AddItem GetCompanyName("http://Carol.BagelsAndLox.com/")
       .AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
       .AddItem GetCompanyName("Alice.BagelsAndLox.com")

       .AddItem GetCompanyName("http://www.BagelsAndLox.com/Alpha.htm")
       .AddItem GetCompanyName("http://BagelsAndLox.com/Beta.html")
       .AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/Gamma.doc")
       .AddItem GetCompanyName("http://Carol.BagelsAndLox.com/Delta.jpg")
   End With
End Sub
Nick Hodge - 20 Apr 2008 19:07 GMT
Mike

Try something like

=IF(ISERR(FIND("www",A1,1)),MID(A1,8,FIND("/",A1,8)-8),MID(A1,12,FIND("/",A1,12)-12))

this works on the theory that there is a trailing / after the domain
somewhere, it will need amending if there is not, e.g
http://excelusergroup.org

Signature

HTH
Nick Hodge
Microsoft MVP - Excel
Southampton, England
nick_hodgeTAKETHISOUT@zen.co.ukANDTHIS
web: www.excelusergroup.org
web: www.nickhodge.co.uk

> Hi, I have a list of URLs and I would like to find a way to extract
> the domain name from the URL.
[quoted text clipped - 33 lines]
>
> Is there perhaps a better way to do this?
MikeB - 20 Apr 2008 19:13 GMT
> Mike
>
[quoted text clipped - 51 lines]
>
> > Is there perhaps a better way to do this?

Hey Nick! Some years ago I had the pleasure of living in Romsey near
Southampton. I just saw your signature and that brought back fond
memories.
Rick Rothstein (MVP - VB) - 20 Apr 2008 19:17 GMT
That doesn't seem to work for URLs like these...

www.techdirt.com/articles/20080408/223932792.shtml

http://www.com/

Rick

> Mike
>
[quoted text clipped - 43 lines]
>>
>> Is there perhaps a better way to do this?
Ron Rosenfeld - 23 Apr 2008 03:08 GMT
>Hi, I have a list of URLs and I would like to find a way to extract
>the domain name from the URL.
[quoted text clipped - 33 lines]
>
>Is there perhaps a better way to do this?

Since you indicated that these are URL's, I will assume they have a normal
structure, so the domain name will be the part between the first // and the
next /.  Then we just remove the www. if it exists.

=SUBSTITUTE(MID(A1,FIND("/",A1)+2,FIND(CHAR(1),
SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2),"www.","")

--ron
Ron Rosenfeld - 23 Apr 2008 03:15 GMT
>Hi, I have a list of URLs and I would like to find a way to extract
>the domain name from the URL.
[quoted text clipped - 33 lines]
>
>Is there perhaps a better way to do this?

Missed something -- the case where there is no final "/".

So use this instead:

=SUBSTITUTE(MID(A1,FIND("/",A1)+2,IF((LEN(A1)-LEN(SUBSTITUTE(A1,"/","")))=2,
255,FIND(CHAR(1),SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2)),"www.","")

--ron
Ron Rosenfeld - 23 Apr 2008 14:38 GMT
>Hi, I have a list of URLs and I would like to find a way to extract
>the domain name from the URL.
[quoted text clipped - 33 lines]
>
>Is there perhaps a better way to do this?

Here is a UDF that will probably do a better job, given all the variables that
have been proposed in this thread.

Note that the line that starts with  re.Pattern = and the subsequent line(s)
within quotation marks should be all on one line.

======================================
Option Explicit
Function ExtrURL(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern =
"\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"
If re.test(str) = True Then
   Set mc = re.Execute(str)
   ExtrURL = mc(mc.Count - 1).submatches(2)
   ExtrURL = Replace(ExtrURL, "www.", "")
   If InStr(1, ExtrURL, ".") = 0 Then ExtrURL = ""
End If
End Function
==================================

--ron
Rick Rothstein (MVP - VB) - 23 Apr 2008 15:23 GMT
> re.Pattern =
> "\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"

Now that is what I miss about Regular Expressions from my days many years
ago working with them in the UNIX world... their clarity and readability.<g>

Rick
Ron Rosenfeld - 23 Apr 2008 19:33 GMT
On Wed, 23 Apr 2008 10:23:45 -0400, "Rick Rothstein \(MVP - VB\)"
<rick.newsNO.SPAM@NO.SPAMverizon.net> wrote:

>> re.Pattern =
>> "\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"
[quoted text clipped - 3 lines]
>
>Rick

<ggg>

And even when you write out the explanation:

===============================
URL capturing

\b((https?|ftp)://)?([-A-Z0-9.]+)(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

Options: case insensitive

Assert position at a word boundary «\b»
Match the regular expression below and capture its match into backreference
number 1 «((https?|ftp)://)?»
  Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
  Match the regular expression below and capture its match into backreference
number 2 «(https?|ftp)»
     Match either the regular expression below (attempting the next
alternative only if this one fails) «https?»
        Match the characters “http” literally «http»
        Match the character “s” literally «s?»
           Between zero and one times, as many times as possible, giving back
as needed (greedy) «?»
     Or match regular expression number 2 below (the entire group fails if
this one fails to match) «ftp»
        Match the characters “ftp” literally «ftp»
  Match the characters “://” literally «://»
Match the regular expression below and capture its match into backreference
number 3 «([-A-Z0-9.]+)»
  Match a single character present in the list below «[-A-Z0-9.]+»
     Between one and unlimited times, as many times as possible, giving back
as needed (greedy) «+»
     The character “-” «-»
     A character in the range between “A” and “Z” «A-Z»
     A character in the range between “0” and “9” «0-9»
     The character “.” «.»
Match the regular expression below and capture its match into backreference
number 4 «(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
  Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
  Match the character “/” literally «/»
  Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
     Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
     The character “-” «-»
     A character in the range between “A” and “Z” «A-Z»
     A character in the range between “0” and “9” «0-9»
     One of the characters “+&@#/%=~_|!:,.;” «+&@#/%=~_|!:,.;»
Match the regular expression below and capture its match into backreference
number 5 «(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
  Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
  Match the character “?” literally «\?»
  Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
     Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
     The character “-” «-»
     A character in the range between “A” and “Z” «A-Z»
     A character in the range between “0” and “9” «0-9»
     One of the characters “+&@#/%=~_|!:,.;” «+&@#/%=~_|!:,.;»

Created with RegexBuddy
======================================
--ron
Harlan Grove - 27 Apr 2008 04:27 GMT
Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
...
[reformatted]
>re.Pattern = "\b((https?|ftp)://)?([\-A-Z0-9.]+)" & _
> "(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"
...

Why so verbose?

re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
ExtrURL = re.Replace(str, "$2")
Ron Rosenfeld - 27 Apr 2008 12:46 GMT
>Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
>...
[quoted text clipped - 7 lines]
>re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
>ExtrURL = re.Replace(str, "$2")

It's a pattern (from a library) that captures the different URL parts into
different backreferences, so does more than what the OP requested.

But I did test against all the content mentioned in the thread.

Running a quick test, using
============================
Function Extr(str As String) As String
Dim re As Object
   Set re = CreateObject("vbscript.regexp")
       re.IgnoreCase = True
       re.Global = True
       re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
   If re.Test(str) = True Then
       Extr = re.Replace(str, "$2")
   End If
End Function
================================

Your pattern doesn't seem to match:

http://excelusergroup.org
www.techdirt.com/articles/20080408/223932792.shtml

www.BagelsAndLox.com
BagelsAndLox.com
www.Ted.BagelsAndLox.com
Alice.BagelsAndLox.com

and won't extract the URL from
    .AddItem GetCompanyName("www.BagelsAndLox.com")
       .AddItem GetCompanyName("BagelsAndLox.com")
    .AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
       .AddItem GetCompanyName("Alice.BagelsAndLox.com")

Granted, these kinds of examples were not all in the OP's specifications.
--ron
Harlan Grove - 28 Apr 2008 22:35 GMT
Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
...
>But I did test against all the content mentioned in the thread.
...
>Your pattern doesn't seem to match:
>
>http://excelusergroup.orgwww.techdirt.com/articles/20080408/223932792.shtml

The domain name in the url above should be

techdirt.com

and that's what my approach returns.

>www.BagelsAndLox.com
>BagelsAndLox.comwww.Ted.BagelsAndLox.com
>Alice.BagelsAndLox.com
...

I don't consider these urls. They're missing a protocol specifier
(http, https, ftp, or mailto, news, gofer, etc.) All depends on how we
define urls, but there could be substrings in arbitrary text that
match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like
2.34.5. How would one distinguish these from urls without making the
protocol specifiers mandatory?
Ron Rosenfeld - 29 Apr 2008 03:27 GMT
>I don't consider these urls. They're missing a protocol specifier
>(http, https, ftp, or mailto, news, gofer, etc.) All depends on how we
>define urls, but there could be substrings in arbitrary text that
>match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like
>2.34.5. How would one distinguish these from urls without making the
>protocol specifiers mandatory?

A valid objection.

Of course, we could just go back to the OP's original request:

>I'm struggling to find the correct algorithm to locate the starting
>point (after the http:// or after the http://www.) and the ending
>point (the first /) for my Mid function.

which can be easily handled with a worksheet function, and given his
description of "having a list of URL's" are probably not embedded in text.

--ron
Harlan Grove - 29 Apr 2008 21:00 GMT
Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
...
>Of course, we could just go back to the OP's original request:
>
[quoted text clipped - 5 lines]
>description of "having a list of URL's" are probably not embedded in
>text.

OP's often don't provide comprehensive examples, as you know. If the
urls always have protocol specifiers, and there's always 2 slashes
just after the protocol specifier and colon, then the domain name will
appear between :// and the subsequent /, but such urls *can* also
contain port number specifiers. For example,

http://www.foo.com:80/bar/

which your approach chokes on but mine parses as foo.com. Then there
are mailto: and news: protocol specifiers that aren't followed by two
slashes, but they're perhaps a digression.

The domain name will be the last 2 or 3 period-separated tokens
between the first colon, possibly followed by 2 slashes, and the first
subsequent colon or slash. The only characters you need to check for
as delimiters are colons and slashes. The domain name will contain 1
or 2 periods separating any other characters.
Ron Rosenfeld - 01 May 2008 13:53 GMT
>Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
>...
[quoted text clipped - 25 lines]
>as delimiters are colons and slashes. The domain name will contain 1
>or 2 periods separating any other characters.

Actually, my VBA regex approaches parses out port specifiers OK.  But I think
there is confusion, for me and others, about what constitutes a "domain name".
(I'm not particularly knowledgeable here).  

But I see definitions for URL; domain name; registered domain name; hostname;
as well as various types of Top Level Domains (generic, country specific);
second level domains; and various levels of subdomains.

And the specifications are changing.  Including allowing the use non-ascii
characters both in country level TLD's as well as in legitimate domain names.

In any event, the OP said he had a list of URL's; wanted to extract the domain
name; and remove the www. if present.

So I have simplified my original regex and VBA routine to do that.  I start
matching at the first ":", with an optional "//"; capture the (www.) into a
group which I will ignore, and return the subsequent string that includes
letters, digits, underscore, hyphens and dots.

re.Pattern = ":(//)?(www\.)?([-\w.]+)"

This returns the domains and all the subdomains, with the exception of the
"www."

There are some differences in what we return in some of the URL's you listed.
I'm not sure what the OP would want.  For some of them, he might want the
leftmost subdomain, and for others not.

URL
http://www.firstmonday.dk/issues/issue3_3/raymond/
http://www.insurance.ca.gov/docs/index.html
http://www.tdi.state.tx.us/wc/indexwc.html
http://en-US.www.mozilla.com/en-US/firefox/help/
http://xxx.lanl.gov/
http://www.stats.ox.ac.uk/pub/MASS4/
http://gd.tuwien.ac.at/opsys/linux/RPM/

Ron            Harlan
firstmonday.dk        www.firstmonday.dk
insurance.ca.gov    ca.gov
tdi.state.tx.us        state.tx.us
en-US.www.mozilla.com    mozilla.com
xxx.lanl.gov        lanl.gov
stats.ox.ac.uk        ox.ac.uk
gd.tuwien.ac.at        tuwien.ac.at

I can "correct" the entry with mozilla.com by making a small change in my
regex:

":(//)?([-\w.]*www\.)?([-\w.]+)"

and that works on the samples you provided.  But I don't know if it would work
in all cases.

In addition, as you know, javascript does not match unicode characters, so that
causes another set of problems :-(

Enought for now -- I've got some errands to do.  Below is the VBA code I used:

Ron:
====================================
Function ExtrURL(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = False
're.Pattern = ":(//)?(www\.)?([-\w.]+)"
re.Pattern = ":(//)?([-\w.]*www\.)?([-\w.]+)"
If re.test(str) = True Then
   Set mc = re.Execute(str)
   ExtrURL = mc(mc.Count - 1).submatches(2)
End If
End Function

'Harlan--------------------------------------------------------

Function ExtrURLH(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
ExtrURLH = re.Replace(str, "$2")
End Function
=======================================

Best,
--ron
Howard Kaikow - 30 Apr 2008 05:19 GMT
The inclusion of the whatever:// is irrelevant to the issue of extracting
the domain.
Proper code will work either way.
And do not forget about country codes at the end of the string/URL.
Harlan Grove - 01 May 2008 03:02 GMT
"Howard Kaikow" <kai...@standards.com> wrote...
>The inclusion of the whatever:// is irrelevant to the issue of extracting
>the domain.
>Proper code will work either way.
>And do not forget about country codes at the end of the string/URL.

Really? What code would handle all the following?

http://linuxtoday.com/
http://www.firstmonday.dk/issues/issue3_3/raymond/
http://www.ace.net.nz/tech/TechFileFormat.html#s
http://www.ifi.unizh.ch/richter/people/pilz/links/index.html
http://www.insurance.ca.gov/docs/index.html
http://www.tdi.state.tx.us/wc/indexwc.html
http://xcell05.free.fr/pages/prog/api-c.htm
http://www.science.uva.nl/research/air/wiki/ShellStartupFiles
http://en-US.www.mozilla.com/en-US/firefox/help/
http://xxx.lanl.gov/
http://www.stats.ox.ac.uk/pub/MASS4/
http://caml.inria.fr/
http://www.er.uqam.ca/nobel/r10735/linux.html
http://gd.tuwien.ac.at/opsys/linux/RPM/
http://perso.wanadoo.es/antlarr/kalamaris.html

where the domain names should be

linuxtoday.com
firstmonday.dk
ace.net.nz
unizh.ch
ca.gov
state.tx.us
free.fr
uva.nl
mozilla.com
lanl.gov
ox.ac.uk
inria.fr
uqam.ca
tuwien.ac.at
wanadoo.es

It seems country top-level domains (.uk, .ca, .es, .dk, .fr, etc)
don't have to have US-like top-level domains
(.com, .net, .org, .gov, .edu, etc), but they can have optional
alternatives (.ac for .edu, .co for .com). But the presence of .??.us
where the ?? are 2-char abbreviations for US states or territories
really screws up simple rules.
Dave Mills - 01 May 2008 17:25 GMT
First you will need to answer how a human can tell what the domain part is for
these examples. The only way I could think of would be to query "Who Is" and
look at the registrant data.

The problem is com, nz, uk etc. are all domains
so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz but then net
is a sub domain of nz. Since the domain itself can have an IP and be used to
point to a web server there is no way you can extract what you have defined as
the domain part from the string programmatically. The solution needs additional
data about what you consider is the boundary point in each string.

>"Howard Kaikow" <kai...@standards.com> wrote...
>>The inclusion of the whatever:// is irrelevant to the issue of extracting
[quoted text clipped - 44 lines]
>where the ?? are 2-char abbreviations for US states or territories
>really screws up simple rules.
Signature

Dave Mills
There are 10 type of people, those that understand binary and those that don't.

Harlan Grove - 01 May 2008 19:59 GMT
Dave Mills <Ne...@nospam--djmills-dot-co.uk> wrote...
>First you will need to answer how a human can tell what the domain
>part is for these examples. . . .
...
>The problem is com, nz, uk etc. are all domains
>so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz
[quoted text clipped - 3 lines]
>programmatically. The solution needs additional data about what you
>consider is the boundary point in each string.
...

There are some rules. Maybe not complete, but they'll cover most
situations. Domains should be parsed right to left by token, and
tokens are period-delimited strings.

If the rightmost token is 2 chars,
 it's a country top-level domain, and presumably more tokens wanted.
 If the next token going left is also 2 chars or a common generic
 top-level domain name (net, org, etc.), then it's presumably also
 a higher level domain. Otherwise, the 2nd token from the right
 would complete the domain name.
 If the rightmost token is us, the next is 2 chars and the next is
 k12, we'd need the 4th token from the right too; otherwise, the 3rd
 token from the right would complete the domain name. Any further
 tokens going left would be hostnames within domain.
Else (the rightmost token is 3 or more chars) the 2nd token from the
 right would complete the domain name.

These rules would fail if www.foobar.museum.ru were a valid url, in
which case the domain name should be foobar.museum.ru. Perhaps what's
needed is a complete list of accepted top-level domain names, then the
domain name would stop at the first token going right to left that
isn't an accepted top-level domain name.

The joker in the set of urls I posted before was stats.ox.ac.uk.
Dave Mills - 02 May 2008 07:02 GMT
>Dave Mills <Ne...@nospam--djmills-dot-co.uk> wrote...
>>First you will need to answer how a human can tell what the domain
[quoted text clipped - 25 lines]
>Else (the rightmost token is 3 or more chars) the 2nd token from the
>  right would complete the domain name.
Most UK schools have domain names like
school.localauthority.sch.uk
This breaks your assumption.
The problem is that you have assumed that there is some sort of convention about
the depth of a DNS name whereas once I own a domain I can create as many
sub-domain as I like nested to any depth I like. Hence the need for some
addition info to determine the boundary.

>These rules would fail if www.foobar.museum.ru were a valid url, in
>which case the domain name should be foobar.museum.ru. Perhaps what's
[quoted text clipped - 3 lines]
>
>The joker in the set of urls I posted before was stats.ox.ac.uk.
Signature

Dave Mills
There are 10 type of people, those that understand binary and those that don't.

Howard Kaikow - 26 Apr 2008 13:11 GMT
Did this issue  ever get solved?
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.