MS Office Forum / Excel / New Users / May 2008
Extract domain names out of URLs
|
|
Thread rating:  |
MikeB - 20 Apr 2008 18:17 GMT Hi, I have a list of URLs and I would like to find a way to extract the domain name from the URL.
The URLs can be prefixed with www in some cases and I'd also like to strip that off.
So, for the following URLs http://www.wiseclerk.com/group-news/tag/libor http://www.wiseclerk.com/group-news/2008/04/ http://www.wiseclerk.com/group-news/ http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana sports http://www.techdirt.com/articles/20080408/223932792.shtml http://www.techdirt.com/articles/20080408/223932792.shtml http://www.techdirt.com/ http://www.techdirt.com/ http://techdirt.com/index.php http://techdirt.com/blog.php?tag=loans
I'd like to extract
wiseclerk.com tiscali.co.uk techdirt.com
I'm struggling to find the correct algorithm to locate the starting point (after the http:// or after the http://www.) and the ending point (the first /) for my Mid function.
Any help appreciated.
I'm thinking that I might have to do a find/search function for either of the two starting strings and then have the "max" value (for the longer of the two strings) but that's a lot of extra columns in my spreadsheet.
Is there perhaps a better way to do this?
Rick Rothstein (MVP - VB) - 20 Apr 2008 18:52 GMT I think this formula does what you want and handles all the various possibilities...
=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1))
Rick
> Hi, I have a list of URLs and I would like to find a way to extract > the domain name from the URL. [quoted text clipped - 33 lines] > > Is there perhaps a better way to do this? Howard Kaikow - 20 Apr 2008 20:48 GMT > I think this formula does what you want and handles all the various > possibilities... =IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://"," ")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUT E(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/", SUBSTITUTE(A1,"http://","")&"/")-1))
That does not work for URLs that start with, say, ftp.
The type of the URL is relevant, e.g., what about ftp.xyz.com and www.xyz.com, both should result in same doughmain.
In this case, a worksheet function may be better..
Rick Rothstein (MVP - VB) - 21 Apr 2008 16:49 GMT Well, I could modify my formula to account for the FTP protocol (and, by changing the approach, maybe so others even), but there seems to be so many protocols possible... even if you wrote a function to handle them, which protocols would you select to handle?
Rick
>> I think this formula does what you want and handles all the various >> possibilities... [quoted text clipped - 10 lines] > > In this case, a worksheet function may be better.. Howard Kaikow - 22 Apr 2008 21:50 GMT > Well, I could modify my formula to account for the FTP protocol (and, by > changing the approach, maybe so others even), but there seems to be so many > protocols possible... even if you wrote a function to handle them, which > protocols would you select to handle? There is no simple answer. It's up to the programmer.
A function may be easier because there are likely Windows APIs for parsing URLs.
Rick Rothstein (MVP - VB) - 22 Apr 2008 22:43 GMT >> Well, I could modify my formula to account for the FTP protocol (and, by >> changing the approach, maybe so others even), but there seems to be so [quoted text clipped - 7 lines] > A function may be easier because there are likely Windows APIs for parsing > URLs. Fair enough. Here is a possible Function solution to this question which makes use of the UrlGetPart API to get to the "company name". For those following this thread, add a Module to the project (Insert/Module from the VBA editor) and copy/paste the code after my signature into its code window. Inside the GetCompanyName function, I have seeded the Protocols constant with "www" and "ftp"... you can add any other protocols you think necessary in this constant, just make sure each protocol is separated by a dot. You would use this function just like any other worksheet function. So, for example, if A1 contains the URL, then put this formula in the cell you want to extract the company name to...
=GetCompanyName(A1)
It is that simple (assuming I didn't screw up my code that is<g>).
Rick
Private Declare Function UrlGetPart Lib "shlwapi" _ Alias "UrlGetPartA" _ (ByVal pszIn As String, _ ByVal pszOut As String, _ pcchOut As Long, _ ByVal dwPart As Long, _ ByVal Flags As Long) As Long
Private Const MAX_PATH As Long = 260 Private Const URL_PART_HOSTNAME As Long = 2
Public Function GetCompanyName(URL As String) As String Dim dwPart As Long Dim Flags As Long Dim Part As String Dim Size As Long Dim Host As String ' Protocols is a **dot** delimited string Const Protocols As String = "www.ftp" If Len(URL) > 0 Then If InStr(URL, "//") = 0 Then URL = "http://" & URL Part = Space$(MAX_PATH) Size = Len(Part) If UrlGetPart(URL, Part, Size, URL_PART_HOSTNAME, 0&) = 0 Then GetCompanyName = Left$(Part, Size) End If Host = Left(GetCompanyName, InStr(GetCompanyName, ".")) If InStr("." & Protocols & ".", "." & Host) Then GetCompanyName = Mid(GetCompanyName, Len(Host) + 1) End If End If End Function
Howard Kaikow - 23 Apr 2008 07:25 GMT > Fair enough. Here is a possible Function solution to this question which > makes use of the UrlGetPart API to get to the "company name". For those [quoted text clipped - 6 lines] > example, if A1 contains the URL, then put this formula in the cell you want > to extract the company name to... It is almost always better to use the Unicode version of APIs.
Rick Rothstein (MVP - VB) - 23 Apr 2008 09:24 GMT >> Fair enough. Here is a possible Function solution to this question which >> makes use of the UrlGetPart API to get to the "company name". For those [quoted text clipped - 12 lines] > > It is almost always better to use the Unicode version of APIs. You are probably right; but I have zero experience with international programming issues (all my programming efforts for the past 27 years have been US based), whether involving fonts or not. The end result is I have no idea if there is anything special that needs to be accounted for or not when dealing with them. My gut feeling is simply using UrlGetPartW in place of UrlGetPart is not the whole answer.
Rick
Howard Kaikow - 23 Apr 2008 10:17 GMT I really tied up now, but the following will give you an idea on how to Convert to Unicode. Apparently, the code is still not parsing as desired.
Option Explicit
Private Declare Function UrlGetPart Lib "shlwapi" _ Alias "UrlGetPartW" _ (ByVal pszIn As Long, _ ByVal pszOut As Long, _ pcchOut As Long, _ ByVal dwPart As Long, _ ByVal Flags As Long) As Long
Private Const MAX_PATH As Long = 260 Private Const URL_PART_HOSTNAME As Long = 2
Public Function GetCompanyName(URL As String) As String Dim dwPart As Long Dim Flags As Long Dim Part As String Dim Size As Long Dim Host As String ' Protocols is a **dot** delimited string Const Protocols As String = "www.ftp" If Len(URL) > 0 Then If InStr(URL, "//") = 0 Then URL = "http://" & URL Part = Space$(MAX_PATH) Size = Len(Part) If UrlGetPart(StrPtr(URL), StrPtr(Part), Size, URL_PART_HOSTNAME, 0&) = 0 Then GetCompanyName = Left$(Part, Size) End If Host = Left(GetCompanyName, InStr(GetCompanyName, ".")) If InStr("." & Protocols & ".", "." & Host) Then GetCompanyName = Mid(GetCompanyName, Len(Host) + 1) End If End If End Function
Private Sub btnByeBye_Click() Unload Me End Sub
Private Sub btnRunMe_Click() With lstResult .AddItem GetCompanyName("http://www.BagelsAndLox.com/") .AddItem GetCompanyName("http://BagelsAndLox.com/") .AddItem GetCompanyName("www.BagelsAndLox.com") .AddItem GetCompanyName("BagelsAndLox.com") .AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/") .AddItem GetCompanyName("http://Carol.BagelsAndLox.com/") .AddItem GetCompanyName("www.Ted.BagelsAndLox.com") .AddItem GetCompanyName("Alice.BagelsAndLox.com") End With End Sub
Howard Kaikow - 23 Apr 2008 10:26 GMT A betta test is:
Private Sub btnRunMe_Click() With lstResult .AddItem GetCompanyName("http://www.BagelsAndLox.com/") .AddItem GetCompanyName("http://BagelsAndLox.com/") .AddItem GetCompanyName("www.BagelsAndLox.com") .AddItem GetCompanyName("BagelsAndLox.com") .AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/") .AddItem GetCompanyName("http://Carol.BagelsAndLox.com/") .AddItem GetCompanyName("www.Ted.BagelsAndLox.com") .AddItem GetCompanyName("Alice.BagelsAndLox.com")
.AddItem GetCompanyName("http://www.BagelsAndLox.com/Alpha.htm") .AddItem GetCompanyName("http://BagelsAndLox.com/Beta.html") .AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/Gamma.doc") .AddItem GetCompanyName("http://Carol.BagelsAndLox.com/Delta.jpg") End With End Sub
Nick Hodge - 20 Apr 2008 19:07 GMT Mike
Try something like
=IF(ISERR(FIND("www",A1,1)),MID(A1,8,FIND("/",A1,8)-8),MID(A1,12,FIND("/",A1,12)-12))
this works on the theory that there is a trailing / after the domain somewhere, it will need amending if there is not, e.g http://excelusergroup.org
 Signature HTH Nick Hodge Microsoft MVP - Excel Southampton, England nick_hodgeTAKETHISOUT@zen.co.ukANDTHIS web: www.excelusergroup.org web: www.nickhodge.co.uk
> Hi, I have a list of URLs and I would like to find a way to extract > the domain name from the URL. [quoted text clipped - 33 lines] > > Is there perhaps a better way to do this? MikeB - 20 Apr 2008 19:13 GMT > Mike > [quoted text clipped - 51 lines] > > > Is there perhaps a better way to do this? Hey Nick! Some years ago I had the pleasure of living in Romsey near Southampton. I just saw your signature and that brought back fond memories.
Rick Rothstein (MVP - VB) - 20 Apr 2008 19:17 GMT That doesn't seem to work for URLs like these...
www.techdirt.com/articles/20080408/223932792.shtml
http://www.com/
Rick
> Mike > [quoted text clipped - 43 lines] >> >> Is there perhaps a better way to do this? Ron Rosenfeld - 23 Apr 2008 03:08 GMT >Hi, I have a list of URLs and I would like to find a way to extract >the domain name from the URL. [quoted text clipped - 33 lines] > >Is there perhaps a better way to do this? Since you indicated that these are URL's, I will assume they have a normal structure, so the domain name will be the part between the first // and the next /. Then we just remove the www. if it exists.
=SUBSTITUTE(MID(A1,FIND("/",A1)+2,FIND(CHAR(1), SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2),"www.","")
--ron
Ron Rosenfeld - 23 Apr 2008 03:15 GMT >Hi, I have a list of URLs and I would like to find a way to extract >the domain name from the URL. [quoted text clipped - 33 lines] > >Is there perhaps a better way to do this? Missed something -- the case where there is no final "/".
So use this instead:
=SUBSTITUTE(MID(A1,FIND("/",A1)+2,IF((LEN(A1)-LEN(SUBSTITUTE(A1,"/","")))=2, 255,FIND(CHAR(1),SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2)),"www.","")
--ron
Ron Rosenfeld - 23 Apr 2008 14:38 GMT >Hi, I have a list of URLs and I would like to find a way to extract >the domain name from the URL. [quoted text clipped - 33 lines] > >Is there perhaps a better way to do this? Here is a UDF that will probably do a better job, given all the variables that have been proposed in this thread.
Note that the line that starts with re.Pattern = and the subsequent line(s) within quotation marks should be all on one line.
====================================== Option Explicit Function ExtrURL(str As String) As String Dim re As Object, mc As Object Set re = CreateObject("vbscript.regexp") re.IgnoreCase = True re.Global = True re.Pattern = "\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?" If re.test(str) = True Then Set mc = re.Execute(str) ExtrURL = mc(mc.Count - 1).submatches(2) ExtrURL = Replace(ExtrURL, "www.", "") If InStr(1, ExtrURL, ".") = 0 Then ExtrURL = "" End If End Function ==================================
--ron
Rick Rothstein (MVP - VB) - 23 Apr 2008 15:23 GMT > re.Pattern = > "\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?" Now that is what I miss about Regular Expressions from my days many years ago working with them in the UNIX world... their clarity and readability.<g>
Rick
Ron Rosenfeld - 23 Apr 2008 19:33 GMT On Wed, 23 Apr 2008 10:23:45 -0400, "Rick Rothstein \(MVP - VB\)" <rick.newsNO.SPAM@NO.SPAMverizon.net> wrote:
>> re.Pattern = >> "\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?" [quoted text clipped - 3 lines] > >Rick <ggg>
And even when you write out the explanation:
=============================== URL capturing
\b((https?|ftp)://)?([-A-Z0-9.]+)(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?
Options: case insensitive
Assert position at a word boundary «\b» Match the regular expression below and capture its match into backreference number 1 «((https?|ftp)://)?» Between zero and one times, as many times as possible, giving back as needed (greedy) «?» Match the regular expression below and capture its match into backreference number 2 «(https?|ftp)» Match either the regular expression below (attempting the next alternative only if this one fails) «https?» Match the characters http literally «http» Match the character s literally «s?» Between zero and one times, as many times as possible, giving back as needed (greedy) «?» Or match regular expression number 2 below (the entire group fails if this one fails to match) «ftp» Match the characters ftp literally «ftp» Match the characters :// literally «://» Match the regular expression below and capture its match into backreference number 3 «([-A-Z0-9.]+)» Match a single character present in the list below «[-A-Z0-9.]+» Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» The character - «-» A character in the range between A and Z «A-Z» A character in the range between 0 and 9 «0-9» The character . «.» Match the regular expression below and capture its match into backreference number 4 «(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?» Between zero and one times, as many times as possible, giving back as needed (greedy) «?» Match the character / literally «/» Match a single character present in the list below «[-A-Z0-9+&@#/%=~_|!:,.;]*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» The character - «-» A character in the range between A and Z «A-Z» A character in the range between 0 and 9 «0-9» One of the characters +&@#/%=~_|!:,.; «+&@#/%=~_|!:,.;» Match the regular expression below and capture its match into backreference number 5 «(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?» Between zero and one times, as many times as possible, giving back as needed (greedy) «?» Match the character ? literally «\?» Match a single character present in the list below «[-A-Z0-9+&@#/%=~_|!:,.;]*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» The character - «-» A character in the range between A and Z «A-Z» A character in the range between 0 and 9 «0-9» One of the characters +&@#/%=~_|!:,.; «+&@#/%=~_|!:,.;»
Created with RegexBuddy ====================================== --ron
Harlan Grove - 27 Apr 2008 04:27 GMT Ron Rosenfeld <ronrosenf...@nospam.org> wrote... ... [reformatted]
>re.Pattern = "\b((https?|ftp)://)?([\-A-Z0-9.]+)" & _ > "(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?" ...
Why so verbose?
re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*" ExtrURL = re.Replace(str, "$2")
Ron Rosenfeld - 27 Apr 2008 12:46 GMT >Ron Rosenfeld <ronrosenf...@nospam.org> wrote... >... [quoted text clipped - 7 lines] >re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*" >ExtrURL = re.Replace(str, "$2") It's a pattern (from a library) that captures the different URL parts into different backreferences, so does more than what the OP requested.
But I did test against all the content mentioned in the thread.
Running a quick test, using ============================ Function Extr(str As String) As String Dim re As Object Set re = CreateObject("vbscript.regexp") re.IgnoreCase = True re.Global = True re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*" If re.Test(str) = True Then Extr = re.Replace(str, "$2") End If End Function ================================
Your pattern doesn't seem to match:
http://excelusergroup.org www.techdirt.com/articles/20080408/223932792.shtml
www.BagelsAndLox.com BagelsAndLox.com www.Ted.BagelsAndLox.com Alice.BagelsAndLox.com
and won't extract the URL from .AddItem GetCompanyName("www.BagelsAndLox.com") .AddItem GetCompanyName("BagelsAndLox.com") .AddItem GetCompanyName("www.Ted.BagelsAndLox.com") .AddItem GetCompanyName("Alice.BagelsAndLox.com")
Granted, these kinds of examples were not all in the OP's specifications. --ron
Harlan Grove - 28 Apr 2008 22:35 GMT Ron Rosenfeld <ronrosenf...@nospam.org> wrote... ...
>But I did test against all the content mentioned in the thread. ...
>Your pattern doesn't seem to match: > >http://excelusergroup.orgwww.techdirt.com/articles/20080408/223932792.shtml The domain name in the url above should be
techdirt.com
and that's what my approach returns.
>www.BagelsAndLox.com >BagelsAndLox.comwww.Ted.BagelsAndLox.com >Alice.BagelsAndLox.com ...
I don't consider these urls. They're missing a protocol specifier (http, https, ftp, or mailto, news, gofer, etc.) All depends on how we define urls, but there could be substrings in arbitrary text that match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like 2.34.5. How would one distinguish these from urls without making the protocol specifiers mandatory?
Ron Rosenfeld - 29 Apr 2008 03:27 GMT >I don't consider these urls. They're missing a protocol specifier >(http, https, ftp, or mailto, news, gofer, etc.) All depends on how we >define urls, but there could be substrings in arbitrary text that >match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like >2.34.5. How would one distinguish these from urls without making the >protocol specifiers mandatory? A valid objection.
Of course, we could just go back to the OP's original request:
>I'm struggling to find the correct algorithm to locate the starting >point (after the http:// or after the http://www.) and the ending >point (the first /) for my Mid function. which can be easily handled with a worksheet function, and given his description of "having a list of URL's" are probably not embedded in text.
--ron
Harlan Grove - 29 Apr 2008 21:00 GMT Ron Rosenfeld <ronrosenf...@nospam.org> wrote... ...
>Of course, we could just go back to the OP's original request: > [quoted text clipped - 5 lines] >description of "having a list of URL's" are probably not embedded in >text. OP's often don't provide comprehensive examples, as you know. If the urls always have protocol specifiers, and there's always 2 slashes just after the protocol specifier and colon, then the domain name will appear between :// and the subsequent /, but such urls *can* also contain port number specifiers. For example,
http://www.foo.com:80/bar/
which your approach chokes on but mine parses as foo.com. Then there are mailto: and news: protocol specifiers that aren't followed by two slashes, but they're perhaps a digression.
The domain name will be the last 2 or 3 period-separated tokens between the first colon, possibly followed by 2 slashes, and the first subsequent colon or slash. The only characters you need to check for as delimiters are colons and slashes. The domain name will contain 1 or 2 periods separating any other characters.
Ron Rosenfeld - 01 May 2008 13:53 GMT >Ron Rosenfeld <ronrosenf...@nospam.org> wrote... >... [quoted text clipped - 25 lines] >as delimiters are colons and slashes. The domain name will contain 1 >or 2 periods separating any other characters. Actually, my VBA regex approaches parses out port specifiers OK. But I think there is confusion, for me and others, about what constitutes a "domain name". (I'm not particularly knowledgeable here).
But I see definitions for URL; domain name; registered domain name; hostname; as well as various types of Top Level Domains (generic, country specific); second level domains; and various levels of subdomains.
And the specifications are changing. Including allowing the use non-ascii characters both in country level TLD's as well as in legitimate domain names.
In any event, the OP said he had a list of URL's; wanted to extract the domain name; and remove the www. if present.
So I have simplified my original regex and VBA routine to do that. I start matching at the first ":", with an optional "//"; capture the (www.) into a group which I will ignore, and return the subsequent string that includes letters, digits, underscore, hyphens and dots.
re.Pattern = ":(//)?(www\.)?([-\w.]+)"
This returns the domains and all the subdomains, with the exception of the "www."
There are some differences in what we return in some of the URL's you listed. I'm not sure what the OP would want. For some of them, he might want the leftmost subdomain, and for others not.
URL http://www.firstmonday.dk/issues/issue3_3/raymond/ http://www.insurance.ca.gov/docs/index.html http://www.tdi.state.tx.us/wc/indexwc.html http://en-US.www.mozilla.com/en-US/firefox/help/ http://xxx.lanl.gov/ http://www.stats.ox.ac.uk/pub/MASS4/ http://gd.tuwien.ac.at/opsys/linux/RPM/
Ron Harlan firstmonday.dk www.firstmonday.dk insurance.ca.gov ca.gov tdi.state.tx.us state.tx.us en-US.www.mozilla.com mozilla.com xxx.lanl.gov lanl.gov stats.ox.ac.uk ox.ac.uk gd.tuwien.ac.at tuwien.ac.at
I can "correct" the entry with mozilla.com by making a small change in my regex:
":(//)?([-\w.]*www\.)?([-\w.]+)"
and that works on the samples you provided. But I don't know if it would work in all cases.
In addition, as you know, javascript does not match unicode characters, so that causes another set of problems :-(
Enought for now -- I've got some errands to do. Below is the VBA code I used:
Ron: ==================================== Function ExtrURL(str As String) As String Dim re As Object, mc As Object Set re = CreateObject("vbscript.regexp") re.IgnoreCase = True re.Global = False 're.Pattern = ":(//)?(www\.)?([-\w.]+)" re.Pattern = ":(//)?([-\w.]*www\.)?([-\w.]+)" If re.test(str) = True Then Set mc = re.Execute(str) ExtrURL = mc(mc.Count - 1).submatches(2) End If End Function
'Harlan--------------------------------------------------------
Function ExtrURLH(str As String) As String Dim re As Object, mc As Object Set re = CreateObject("vbscript.regexp") re.IgnoreCase = True re.Global = True re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*" ExtrURLH = re.Replace(str, "$2") End Function =======================================
Best, --ron
Howard Kaikow - 30 Apr 2008 05:19 GMT The inclusion of the whatever:// is irrelevant to the issue of extracting the domain. Proper code will work either way. And do not forget about country codes at the end of the string/URL.
Harlan Grove - 01 May 2008 03:02 GMT "Howard Kaikow" <kai...@standards.com> wrote...
>The inclusion of the whatever:// is irrelevant to the issue of extracting >the domain. >Proper code will work either way. >And do not forget about country codes at the end of the string/URL. Really? What code would handle all the following?
http://linuxtoday.com/ http://www.firstmonday.dk/issues/issue3_3/raymond/ http://www.ace.net.nz/tech/TechFileFormat.html#s http://www.ifi.unizh.ch/richter/people/pilz/links/index.html http://www.insurance.ca.gov/docs/index.html http://www.tdi.state.tx.us/wc/indexwc.html http://xcell05.free.fr/pages/prog/api-c.htm http://www.science.uva.nl/research/air/wiki/ShellStartupFiles http://en-US.www.mozilla.com/en-US/firefox/help/ http://xxx.lanl.gov/ http://www.stats.ox.ac.uk/pub/MASS4/ http://caml.inria.fr/ http://www.er.uqam.ca/nobel/r10735/linux.html http://gd.tuwien.ac.at/opsys/linux/RPM/ http://perso.wanadoo.es/antlarr/kalamaris.html
where the domain names should be
linuxtoday.com firstmonday.dk ace.net.nz unizh.ch ca.gov state.tx.us free.fr uva.nl mozilla.com lanl.gov ox.ac.uk inria.fr uqam.ca tuwien.ac.at wanadoo.es
It seems country top-level domains (.uk, .ca, .es, .dk, .fr, etc) don't have to have US-like top-level domains (.com, .net, .org, .gov, .edu, etc), but they can have optional alternatives (.ac for .edu, .co for .com). But the presence of .??.us where the ?? are 2-char abbreviations for US states or territories really screws up simple rules.
Dave Mills - 01 May 2008 17:25 GMT First you will need to answer how a human can tell what the domain part is for these examples. The only way I could think of would be to query "Who Is" and look at the registrant data.
The problem is com, nz, uk etc. are all domains so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz but then net is a sub domain of nz. Since the domain itself can have an IP and be used to point to a web server there is no way you can extract what you have defined as the domain part from the string programmatically. The solution needs additional data about what you consider is the boundary point in each string.
>"Howard Kaikow" <kai...@standards.com> wrote... >>The inclusion of the whatever:// is irrelevant to the issue of extracting [quoted text clipped - 44 lines] >where the ?? are 2-char abbreviations for US states or territories >really screws up simple rules.  Signature Dave Mills There are 10 type of people, those that understand binary and those that don't.
Harlan Grove - 01 May 2008 19:59 GMT Dave Mills <Ne...@nospam--djmills-dot-co.uk> wrote...
>First you will need to answer how a human can tell what the domain >part is for these examples. . . . ...
>The problem is com, nz, uk etc. are all domains >so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz [quoted text clipped - 3 lines] >programmatically. The solution needs additional data about what you >consider is the boundary point in each string. ...
There are some rules. Maybe not complete, but they'll cover most situations. Domains should be parsed right to left by token, and tokens are period-delimited strings.
If the rightmost token is 2 chars, it's a country top-level domain, and presumably more tokens wanted. If the next token going left is also 2 chars or a common generic top-level domain name (net, org, etc.), then it's presumably also a higher level domain. Otherwise, the 2nd token from the right would complete the domain name. If the rightmost token is us, the next is 2 chars and the next is k12, we'd need the 4th token from the right too; otherwise, the 3rd token from the right would complete the domain name. Any further tokens going left would be hostnames within domain. Else (the rightmost token is 3 or more chars) the 2nd token from the right would complete the domain name.
These rules would fail if www.foobar.museum.ru were a valid url, in which case the domain name should be foobar.museum.ru. Perhaps what's needed is a complete list of accepted top-level domain names, then the domain name would stop at the first token going right to left that isn't an accepted top-level domain name.
The joker in the set of urls I posted before was stats.ox.ac.uk.
Dave Mills - 02 May 2008 07:02 GMT >Dave Mills <Ne...@nospam--djmills-dot-co.uk> wrote... >>First you will need to answer how a human can tell what the domain [quoted text clipped - 25 lines] >Else (the rightmost token is 3 or more chars) the 2nd token from the > right would complete the domain name. Most UK schools have domain names like school.localauthority.sch.uk This breaks your assumption. The problem is that you have assumed that there is some sort of convention about the depth of a DNS name whereas once I own a domain I can create as many sub-domain as I like nested to any depth I like. Hence the need for some addition info to determine the boundary.
>These rules would fail if www.foobar.museum.ru were a valid url, in >which case the domain name should be foobar.museum.ru. Perhaps what's [quoted text clipped - 3 lines] > >The joker in the set of urls I posted before was stats.ox.ac.uk.  Signature Dave Mills There are 10 type of people, those that understand binary and those that don't.
Howard Kaikow - 26 Apr 2008 13:11 GMT Did this issue ever get solved?
|
|
|