Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
DiscussionsAccessExcelInfoPathOutlookPowerPointPublisherWord
DirectoryUser Groups
Related Topics
Outlook ExpressInternet ExplorerWindowsMS Server ProductsMore Topics ...

MS Office Forum / Excel / New Users / October 2007

Tip: Looking for answers? Try searching our database.

Text Parsing

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Tom K - 05 Oct 2007 15:55 GMT
It seems like this ought to be easy, but I'm stumped.

I received several thousand address records in the following format--all in
one column of an Excel worksheet:

Smith, John & Mary 123 Main St Chicago, IL 60601
Williams, Bill & Cindy 45 54th Drive Madison, WI 51375

I need to parse these into:

Last Name
First Name(s)
Street Address
City
State
Zip

I can get the last name using =LEFT(B2, FIND(",",B2)-1) and the Zip using
=RIGHT(B2,5) but I can't come up with a way to get the other fields. A quick
scan of the records shows all the street addresses seem to start with a
number, and there are no numbers in any of the first names, so it seems like
if I can find the first number reading from the left, I could grab the first
names, but neither FIND nor SEARCH seem to work. I'm using Excel 2007.

Could anyone point me to a possible approach. Seems as if this is something
others must have solved many times before.

Many thanks,
Tom K
LarryKO - 05 Oct 2007 17:11 GMT
I'm sure there is an easier way, but there is a mid() function that
will allow you to selectively pull from the middle of the field.  If
the ampersands are consistent I'm sure this will work to get
everything out.  Good luck!
Tyro - 05 Oct 2007 17:27 GMT
You have a real problem. You can easily get the last name and the zip and
the state, but the rest is problematic.. Are the cities Chicago or St
Chicago (as in St Louis) Madison or Drive Madison? What if there is no Mary
or Cindy? No matter how you do it, you're going to have a lot of manual
work. Even if you split the data into columns (space delimiter) the result
will be inconsistant. Can you get the original file, before it was put into
Excel?

> It seems like this ought to be easy, but I'm stumped.
>
[quoted text clipped - 26 lines]
> Many thanks,
> Tom K
Tom K - 05 Oct 2007 23:45 GMT
The cities (in this example) are Chicago and Madison. And there are cases
where there is no spouse--so no ampersand.

I believe the list came from QuickBooks. I'm not familiar with QuickBooks,
but maybe it has a CSV or fixed field format. Otherwise, I think I am stuck,
unless someone else on these boards is creative.

Thanks for your thoughts.

> You have a real problem. You can easily get the last name and the zip and
> the state, but the rest is problematic.. Are the cities Chicago or St
[quoted text clipped - 34 lines]
>> Many thanks,
>> Tom K
Tyro - 06 Oct 2007 00:54 GMT
You see the problem with breaking down the cell?

Lets suppose the entry is Smith, John & Mary 123 Main St. Louis, MO 60601.
(St. Louis, Missouri)
How is that different from Smith, John & Mary 123 Main St Chicago, IL 60601
(St. Chicago, Illinois)
They both live at 123 Main Street

Do you see my point?

> The cities (in this example) are Chicago and Madison. And there are cases
> where there is no spouse--so no ampersand.
[quoted text clipped - 5 lines]
>
> Thanks for your thoughts.
MartinW - 06 Oct 2007 00:08 GMT
Hi Tom,

This may help you get some of the way.
Do Data>Text to Columns using comma as the delimeter.

Then on your second column do an Edit>Replace for each digit.
Find what:-       space1
Replace with:-    ~1
Then repeat with
Find what:-       space2
Replace with:-    ~2
etc
...........
...........
Down to
Find what:-       space9
Replace with:-    ~9

You can then do a further Text to columns on the ~
or use it as a Find character.

Still a bit of manual work but hopefully that will
help you get part way.

HTH
Martin

> It seems like this ought to be easy, but I'm stumped.
>
[quoted text clipped - 26 lines]
> Many thanks,
> Tom K
MartinW - 06 Oct 2007 01:38 GMT
I just realised that you may end up with two tilde's in some circumstances
which precludes using Text to Columns after the Find and Replace,
but doesn't stop you using FIND as it will work on the first tilde
only, if you use it like this.

This will return everything to the left of the tilde
=LEFT(B1,FIND("~",B1)-1)

Either of  these will return everything to the right of the tilde
=MID(B1,FIND("~",B1)+1,100)
=RIGHT(B1,LEN(B1)-FIND("~",B1))

HTH
Martin

> Hi Tom,
>
[quoted text clipped - 53 lines]
>> Many thanks,
>> Tom K
MartinW - 06 Oct 2007 03:10 GMT
I may not have explained my first post very well.
The intention is to split the data at the first number of the
street address. After the Find and Replace procedure
your data should look like this.

     John & Mary~123 Main St Chicago
     Bill & Cindy~45~54th Drive Madison
     Bob & Carol & Ted & Alice~623 River St Chicago
     B.J~3/89~26th St Toronto

After splitting at the tilde with the aforementioned formulae

     John & Mary                      |      123 Main St Chicago
     Bill & Cindy                       |     45~54th Drive Madison
     Bob & Carol & Ted & Alice  |     623 River St Chicago
     B.J                                   |   3/89~26th St Toronto

Just use Find and Replace to clean up the leftover
tilde's in column C.

HTH
Martin

> Hi Tom,
>
[quoted text clipped - 53 lines]
>> Many thanks,
>> Tom K
Ron Rosenfeld - 06 Oct 2007 03:32 GMT
>I may not have explained my first post very well.
>The intention is to split the data at the first number of the
>street address.

That will present a problem if there are addresses that do not start with
numbers.

But if the all do, it would make it simpler for my method to parse out the
first names, whether there are one, two and whether or not the names had more
than one word.
--ron
MartinW - 06 Oct 2007 03:59 GMT
> That will present a problem if there are addresses that do not start with
> numbers.

Tom's original post said that all the addresses start with a number
and there doesn't appear to be any numbers in the names, although
from the way he phrased it, that may not be entirely correct.

I think no matter how he achieves his result at some point he
will have to go through a painfully slow and careful checking
process.

Regards
Martin
Ron Rosenfeld - 06 Oct 2007 04:11 GMT
>Tom's original post said that all the addresses start with a number

You're correct!  I missed that.  I think I will modify my suggestion.
--ron
Rick Rothstein (MVP - VB) - 06 Oct 2007 20:10 GMT
>>Tom's original post said that all the addresses start with a number
>
> You're correct!  I missed that.  I think I will modify my suggestion.

Well, I wouldn't concede this point too quickly... the OP's original
statement was..

  "...all the street addresses seem to start with a number..."

with the key word to note being "seem"; and, of course, the OP's observation
means nothing against possible future data.

Rick
Ron Rosenfeld - 06 Oct 2007 22:16 GMT
On Sat, 6 Oct 2007 15:10:51 -0400, "Rick Rothstein \(MVP - VB\)"
<rickNOSPAMnews@NOSPAMcomcast.net> wrote:

>>>Tom's original post said that all the addresses start with a number
>>
[quoted text clipped - 9 lines]
>
>Rick

I agree.

But I've presented two methods to try to parse out the first name(s).

One relies on the first name(s) being just single words and, if there are two,
separated by an ampersand.

The second relies on the first name(s) being the words between the initial
comma, and the first digit.

So far as parsing out the city, the algorithm I've presented relies on a list
of all possible cities being available.

One could also, depending on the data, define the city as being between some
street-type abbreviation (e.g. Rd Av Ave St Dr Drive, etc) and the <comma>
which precedes the State abbreviation.

But I need more input from Tom K before going further.

100% success may not be possible, depending on the data.

--ron
Rick Rothstein (MVP - VB) - 06 Oct 2007 22:31 GMT
> On Sat, 6 Oct 2007 15:10:51 -0400, "Rick Rothstein \(MVP - VB\)"
> <rickNOSPAMnews@NOSPAMcomcast.net> wrote:
[quoted text clipped - 37 lines]
>
> 100% success may not be possible, depending on the data.

You are being getting far more involved in this thread than I would ever
consider doing (kudos to you for that, by the way)... I look at the data and
simply see a horribly flawed structure which seems like it will take far
more time to develop a "no more than a partial solution" for than it would
probably take to parse the data lines one-by-one. I think the only thing the
OP can hope for is to be able to go back to the source and get the data in a
more structured form.

Rick
Ron Rosenfeld - 06 Oct 2007 22:44 GMT
On Sat, 6 Oct 2007 17:31:27 -0400, "Rick Rothstein \(MVP - VB\)"
<rickNOSPAMnews@NOSPAMcomcast.net> wrote:

>I think the only thing the
>OP can hope for is to be able to go back to the source and get the data in a
>more structured form.

But then the task would become simple :-))
--ron
Peo Sjoblom - 06 Oct 2007 22:47 GMT
>> On Sat, 6 Oct 2007 15:10:51 -0400, "Rick Rothstein \(MVP - VB\)"
>> <rickNOSPAMnews@NOSPAMcomcast.net> wrote:
[quoted text clipped - 47 lines]
> thing the OP can hope for is to be able to go back to the source and get
> the data in a more structured form.

I agree, yet posters seem to think this is easy, to quote the OP
"It seems like this ought to be easy, but I'm stumped"
I never understood why it would be easy to parse text strings that can
follow who knows how many different rules in a program made to crunch
numbers?
It seems that Excel more than any other programs is used (or rather people
want to use it that way) to do everything from word processing to scheduling
work shifts!

Signature

Regards,

Peo Sjoblom

Tyro - 06 Oct 2007 22:54 GMT
Agreed. I've been massaging computer data for over 40 years. In this case,
it's best to go back to square 1.

>> On Sat, 6 Oct 2007 15:10:51 -0400, "Rick Rothstein \(MVP - VB\)"
>> <rickNOSPAMnews@NOSPAMcomcast.net> wrote:
[quoted text clipped - 49 lines]
>
> Rick
Earl Kiosterud - 06 Oct 2007 00:13 GMT
Tom,

This is a tough one.  Your data is quite ambiguous as to names, addresses, city, etc.  As
Tyro said, you'll probably have some manual work.  One workable, if manual, way is to put
the stuff in a Word table.  Set up columns as you need (city, state, etc).  Then you can
quickly select parts and drag them to the column where they belong.  Manual, but fast.
Several thousand records???  Ick.  Hire someone.  If you use one any of the parsing
techniques to do the parsing automatically, you can still use Word for cleanup, as required.
You can easily put it in Word initially with copy/paste, or by opening the Excel file
directly in Word.  DON'T SAVE IT IN WORD, or it'll clobber your original Excel file.

Rob Bovey had a pretty sophisticated Excel parser for this kind of stuff.  It was many years
ago.  See if it's available at his web site, www.appspro.com.
Signature

Regards from Virginia Beach,

Earl Kiosterud
www.smokeylake.com

   Note: Top-posting has been the norm here.
   Some folks prefer bottom-posting.
   But if you bottom-post to a reply that's
   already top-posted, the thread gets messy.
   When in Rome...
-----------------------------------------------------------------------

> It seems like this ought to be easy, but I'm stumped.
>
[quoted text clipped - 25 lines]
> Many thanks,
> Tom K
Ron Rosenfeld - 06 Oct 2007 01:57 GMT
>It seems like this ought to be easy, but I'm stumped.
>
[quoted text clipped - 25 lines]
>Many thanks,
>Tom K

If your layout is always in the same format, and in particular if there are
only single first names, then this can be done using Regular Expressions.

The main problem is to differentiate the first name(s) from the beginning of
the address.  The rule that I used in the example below is this pattern:

<comma><space>
<word>
then, optionally:
    <space>
    ampersand (&)
    <space>
    <word>

But this will fail if you have a "double" first name.  For example:

Smith, Mary Jane 47 Edgecomb rd. ...

If that is the case, we would need to come up with a different rule.  But try
this, anyway, and see if it works.

If your addresses always start with a number, we could incorporate that as a
rule and get around the issue of double first names.

Also, you need to list your cities by name.  In the Pattern I devised, I listed
Chicago and Madison.  You can add more possibilities by extending the
pipe-delimited list in Pattern.

First, enter the UDF into your workbook:

    <alt-F11> opens the VB Editor
    Ensure your project is highlighted in the Project Explorer window, then
Insert/Module and paste the code below into the window that opens.

The following formulas should work:

Last Name:      =remid(A1,"^[^,]*")
First Name(s):    =remid(A1,Pattern,1)
Street Address:    =remid(A1,Pattern,3)
City:        =remid(A1,Pattern,4)
State:        =remid(A1,Pattern,5)
Zip:        =remid(A1,Pattern,6)

Pattern:
,\s+(\w+(\s+&\s+\w+)?)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

=============================================
Option Explicit
Function reMid(str As String, spattern As String, Optional Index) As String
Dim re As Object
Dim mc As Object

Set re = CreateObject("vbscript.regexp")
re.Global = True
re.ignorecase = True
re.Pattern = spattern

If re.test(str) = True Then
   Set mc = re.Execute(str)
   If IsMissing(Index) Then
       reMid = mc(0)
   Else
       reMid = mc(0).submatches(Index - 1)
   End If
End If
End Function
=============================================

This should handle any entries that are in the pattern you presented.  Examples
that don't work would be useful in trying to refine this routine.
--ron
Tyro - 06 Oct 2007 02:24 GMT
That won't work. He's talking about tons of addresses, not just two.

>>It seems like this ought to be easy, but I'm stumped.
>>
[quoted text clipped - 109 lines]
> that don't work would be useful in trying to refine this routine.
> --ron
Ron Rosenfeld - 06 Oct 2007 03:27 GMT
>That won't work. He's talking about tons of addresses, not just two.

It is not the number of addresses that is important.  Rather it is the pattern
used to determine those addresses.  I pointed out one potential issue with the
pattern having to do with multiple first names -- e.g. Mary Jane.  I also
pointed out how he can extend the list of city names.

What other pattern related problems do you see?

You pointed out that there would be a potential issue if there are two cities
named, for example:

Louis
St Louis

But I doubt there are too many city pairs like that.  So manual inspection
might be an option.

Certainly, adding "St Louis" to the city list enables proper parsing of an
entry such as:

Sylvia, Burt & Jinny Waburn St St Louis, MO 45678

using my algorithm.

And the pipe-delimited city list can be constructed in a variety of ways,
depending on how many cities there are in the list.  It would sure be simpler
if each city were preceded by a <comma><space> rather than just a <space> as it
is now.  But that may not be an option.

So far as dealing with "tons of addresses", in the formula I wrote, A1 can be
replaced with any cell reference.  And, in Excel, if you fill down a formula,
the cell reference will adjust appropriately so as to be able to handle "tons
of addresses".

I'm not sure how long the string containing the City List can be.  I just
constructed a list of about 250 cities which was almost 3,000 characters long
and it seemed to work OK.  I suspect the limit may be 32,767 characters, but
I'm not certain of that.

Obviously, it cannot be entered as a text string within the formula, due to the
limitation of 1,024 characters in a formula (Excel 2003) but rather as a
concatenated function, or possibly a concatenated Name.

--ron
Ron Rosenfeld - 06 Oct 2007 04:21 GMT
>>It seems like this ought to be easy, but I'm stumped.
>>
[quoted text clipped - 97 lines]
>that don't work would be useful in trying to refine this routine.
>--ron

Martin just pointed out to me that EVERY street address begins with a number.

That being the case, the following is more efficient for picking out the first
name(s), and will handle the issue of two first names:  e.g. Mary Jane & Justin

Last Name:      =remid(A1,"^[^,]*")
First Name(s):    =remid(A1,Pattern,1)
Street Address:    =remid(A1,Pattern,2)
City:        =remid(A1,Pattern,3)
State:        =remid(A1,Pattern,4)
Zip:        =remid(A1,Pattern,5)

and

Pattern:

,\s+(\D+)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

We still need to generate a pipe-delimited list of cities.  How to do that
depends on the length of the list.
--ron
Ron Rosenfeld - 07 Oct 2007 03:44 GMT
>>>It seems like this ought to be easy, but I'm stumped.
>>>
[quoted text clipped - 119 lines]
>depends on the length of the list.
>--ron

Just for consistency, we could use a single Pattern for all including the Last
Name.

Pattern:

(^[^,]+),\s+(\D+)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

Then use the formula:

Last Name:    =remid($A1,Pattern,COLUMNS($A:A))

Copy/drag to the right, and the last argument will adjust to the proper Index
for each section of the string.
--ron
Tom K - 07 Oct 2007 18:21 GMT
Guys:

Thank you all for the many excellent suggestions. I'm grateful that so many
people would try to help out. What a great community!

After carefully scanning thorugh the list, it looks like there are two
instances where the street address does not start with a number (darn) --  
those address that start PO Box XXX, or RR# XXX. Most of the first names are
of the format "Spouse 1 & Spouse 2" or just "Spouse 1".

Let me work throught the suggestions from Ron, Martin, and Earl and see if I
have any success. Again many thanks.

Tom K

P.S. What does "OP" mean?

> It seems like this ought to be easy, but I'm stumped.
>
[quoted text clipped - 26 lines]
> Many thanks,
> Tom K
Ragdyer - 07 Oct 2007 18:32 GMT
Depending on context, either:
original post
OR
original poster
Signature

Regards,

RD

---------------------------------------------------------------------------
Please keep all correspondence within the NewsGroup, so all may benefit !
---------------------------------------------------------------------------

> Guys:
>
[quoted text clipped - 43 lines]
> > Many thanks,
> > Tom K
Tyro - 07 Oct 2007 18:34 GMT
OP = Original Poster

> Guys:
>
[quoted text clipped - 12 lines]
>
> P.S. What does "OP" mean?
Rick Rothstein (MVP - VB) - 07 Oct 2007 18:36 GMT
> P.S. What does "OP" mean?

Original Poster

Rick
Ron Rosenfeld - 07 Oct 2007 19:45 GMT
>Guys:
>
[quoted text clipped - 10 lines]
>
>Tom K

If the issue is that some of the streets start with RR or PO, then, using my
last example of formulas, try this for Pattern:

(^[^,]+),\s+(\D+?)\s+(?=PO|RR|\d)(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

Where you see the pipe-delimited list of "street starters"  (e.g. PO|RR|\d) you
can expand that if needed (e.g. PO|P.O.|RR|\d)

Note that in that list,  the "\d" stands for a digit, so don't leave that out.

You'll still need the pipe-delimited list of city names, but once you give some
more information, we can probably construct that easily.

The formulas in my last post:

Last Name:    =remid($A1,Pattern,COLUMNS($A:A))

Copy/drag to the right, and the last argument will adjust to the proper Index
for each section of the string.

And the UDF itself is unchanged.
--ron
justinthered@earthlink.net - 12 Oct 2007 18:38 GMT
> Guys:
>
[quoted text clipped - 43 lines]
> > Many thanks,
> > Tom K

did you try, like I am doing right now on some text... copy your
column to a worksheet, export that worksheet to a CSV file, then on a
new worksheet, import from that CSV and you can then use the wizard to
set comma as delimeter and make it so you get each item into new
column...

then you can work the information from that stand point.

If this is to be a a one time deal.

just in colorado

Rate this thread:






 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.