MS Office Forum / Excel / Worksheet Functions / February 2007
Extracting a numbers from a text string
|
|
Thread rating:  |
IPerlovsky - 26 Feb 2007 19:22 GMT How would I extract a number (or series of numbers) using a single cell formula from an alphanumeric text string that also contains spaces? One key component of the number(s) in question, is that they have a space on either side. See the example of the text string below:
"A 544646-BA CALIF UN1MM+ 5 7/1/2017 FSA 108.579 3.90 3.82 8 Aaa AAA"
In this example, I am trying to extract the number "8". However, there is also the number "5", which appears first, and that I want to avoid extracting. Additionally, the formula should be able to extract any number, not specific to "8" or "5". This formula should always skip the first single number surrounded by spaces ("5" in this example) and extract the second number(s) ("8" in this example).
 Signature iperlovsky
CLR - 26 Feb 2007 19:30 GMT If you always have a space and 6 characters after the number you seek, then...
=LEFT(RIGHT(A1,9),1)*1
Vaya con Dios, Chuck, CABGx3
> How would I extract a number (or series of numbers) using a single cell > formula from an alphanumeric text string that also contains spaces? One key [quoted text clipped - 9 lines] > number surrounded by spaces ("5" in this example) and extract the second > number(s) ("8" in this example). IPerlovsky - 26 Feb 2007 19:42 GMT The text string is always different (ie, it does not always have a space and 6 characters after the number)....
Any other suggestions?
 Signature iperlovsky
> If you always have a space and 6 characters after the number you seek, then... > [quoted text clipped - 16 lines] > > number surrounded by spaces ("5" in this example) and extract the second > > number(s) ("8" in this example). Elkar - 26 Feb 2007 19:41 GMT Based solely on the criteria you provided, this should work:
=MID(MID(A1,SEARCH(" ? ",A1)+1,LEN(A1)),SEARCH(" ? ",MID(A1,SEARCH(" ? ",A1)+1,LEN(A1)))+1,1)
HTH, Elkar
> How would I extract a number (or series of numbers) using a single cell > formula from an alphanumeric text string that also contains spaces? One key [quoted text clipped - 9 lines] > number surrounded by spaces ("5" in this example) and extract the second > number(s) ("8" in this example). IPerlovsky - 26 Feb 2007 19:56 GMT Thanks, that worked great! One caveat though - what if the number in question comes with or without a "-" (negative sign) accompaniment (ie, "-8") and/or the number that I am trying not to extract has a decimal behind it with up to 2 decimpal places (ie, "5.75")?
 Signature iperlovsky
> Based solely on the criteria you provided, this should work: > [quoted text clipped - 17 lines] > > number surrounded by spaces ("5" in this example) and extract the second > > number(s) ("8" in this example). Pete_UK - 26 Feb 2007 20:03 GMT I think you might find it easier to use Data | Text-to-columns and split the data at each space. Then you would get (in different columns):
A 544646-BA CALIF UN1MM+ 5 7/1/2017 FSA 108.579 3.90 3.82 8 Aaa AAA
and then you can decide more easily which of these you want to take as your number.
Hope this helps.
Pete
> Thanks, that worked great! One caveat though - what if the number in question > comes with or without a "-" (negative sign) accompaniment (ie, "-8") and/or [quoted text clipped - 30 lines] > > - Show quoted text - Duke Carey - 26 Feb 2007 20:06 GMT You ask too much. From your first statement of the problem the target number is the second of 2 bracketed by spaces. Now you completely change the criteria. Moreover, there are at least TWO numbers with decimals in the example you provided, and you would want to skip them both, right?
You apparently want an infinitely flexible solution.
> Thanks, that worked great! One caveat though - what if the number in question > comes with or without a "-" (negative sign) accompaniment (ie, "-8") and/or [quoted text clipped - 22 lines] > > > number surrounded by spaces ("5" in this example) and extract the second > > > number(s) ("8" in this example). Elkar - 26 Feb 2007 20:07 GMT Hmm... if the number could contain decimal places, then based on your example, 3.90 would be returned (or perhaps even 108.579 if the number can be greater than 10). I think we would need to find a better set of criteria than "a number surrounded by spaces". Are there any other commonalities amongst your data that we can work with? (ie.. total number of spaces, first number from the right, etc...)
Or, perhaps if you posted some more examples of your data, we might be able to see a useful pattern.
> Thanks, that worked great! One caveat though - what if the number in question > comes with or without a "-" (negative sign) accompaniment (ie, "-8") and/or [quoted text clipped - 22 lines] > > > number surrounded by spaces ("5" in this example) and extract the second > > > number(s) ("8" in this example). IPerlovsky - 26 Feb 2007 20:23 GMT Okay, let me make this a bit clearer: The number I am looking for ("8" in the example) can be more than one digit (up to 3), but it will always be a number without decimals and a space on either side. The number in the example that comes first, which I want to avoid, ("5") can be a number with up to 2 decimal places - or no decimals like in the example. The formula should not pull any single letters that have spaces on either side. I provided 3 other strings as an example:
1/4/2007 CA 544646-BH LOS ANGELES CALIF UN 1000000 5 7/1/2022 FGIC 108.151 3.96 3.86 10 Aaa AAA Y 7/1/2016
...here I am looking for "10"
1/8/2007 CA 544644-NC LOS ANGELES CALIF UN 1000000 5.75 7/1/2016 MBIA 115.907 3.74 3.67 7 Aaa AAA N 7/1/2016
...here for "7"
1/8/2007 CA 544644-NC LOS ANGELES CALIF UN 1000000 5.75 7/1/2016 MBIA 115.907 3.74 3.67 17 Aaa AAA N 7/1/2016
...and here for "17"
your help is greatly appreciated.
 Signature iperlovsky
> Hmm... if the number could contain decimal places, then based on your > example, 3.90 would be returned (or perhaps even 108.579 if the number can be [quoted text clipped - 32 lines] > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > number(s) ("8" in this example). Duke Carey - 26 Feb 2007 20:35 GMT So - is it always followed by the "Aaa AAA" sequence?
> Okay, let me make this a bit clearer: The number I am looking for ("8" in > the example) can be more than one digit (up to 3), but it will always be a [quoted text clipped - 57 lines] > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > number(s) ("8" in this example). IPerlovsky - 26 Feb 2007 20:38 GMT It would appear that way from my examples, but the answer would have to be no.
thanks for giving this another crack...
 Signature iperlovsky
> So - is it always followed by the "Aaa AAA" sequence? > [quoted text clipped - 59 lines] > > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > > number(s) ("8" in this example). Duke Carey - 26 Feb 2007 20:50 GMT OK, these lines represent bond issues, right? Is it fair to say that these would parse into (rough guess)
Date (purchase?) State of Obligation CUSIP Description Par Coupon Maturity Guarantor Price YTM YTC ? Call Date
And you want the ? value
What does it represent, and is there a better way of grabbing it?
> It would appear that way from my examples, but the answer would have to be no. > [quoted text clipped - 63 lines] > > > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > > > number(s) ("8" in this example). IPerlovsky - 26 Feb 2007 20:57 GMT You are correct. I am looking for spread. The difference between what you call YTM and YTC, which are actually the benchmark yield and the bond's yield.
 Signature iperlovsky
> OK, these lines represent bond issues, right? > Is it fair to say that these would parse into (rough guess) [quoted text clipped - 84 lines] > > > > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > > > > number(s) ("8" in this example). Duke Carey - 26 Feb 2007 21:31 GMT After the second date value, will you ALWAYS have separate numbers representing price, benchmark yield, and bond yield, and frequently but not always have a guarantor name, too? Any chance the guarantor name will have spaces?
> You are correct. I am looking for spread. The difference between what you > call YTM and YTC, which are actually the benchmark yield and the bond's yield. IPerlovsky - 26 Feb 2007 21:37 GMT Yes, price, benchmark yield, and bond yield will always be before spread and it will occasionally have a guarantor name that will not have a space.
 Signature iperlovsky
> After the second date value, will you ALWAYS have separate numbers > representing price, benchmark yield, and bond yield, and frequently but not [quoted text clipped - 3 lines] > > You are correct. I am looking for spread. The difference between what you > > call YTM and YTC, which are actually the benchmark yield and the bond's yield. Duke Carey - 26 Feb 2007 22:05 GMT Using some helper columns, these formulas worked on all 3 of your examples
B1: =RIGHT(A1,LEN(A1)-5-SEARCH("\",SUBSTITUTE(A1,"/","\",4))) C1: =IF(ISERROR(VALUE(LEFT(B1,1))),RIGHT(B1,LEN(B1)-SEARCH(" ",B1)),B1) D1: =RIGHT(C1,LEN(C1)-SEARCH("\",SUBSTITUTE(C1," ","\",3))) E1: =--LEFT(D1,SEARCH(" ",D1)-1)
If you sepnd some time, you can probably next some of these. Better yet, maybe Harlan Grove will jump in here with an elegant solution. He's a genius at this sort of stuff
> Yes, price, benchmark yield, and bond yield will always be before spread and > it will occasionally have a guarantor name that will not have a space. [quoted text clipped - 6 lines] > > > You are correct. I am looking for spread. The difference between what you > > > call YTM and YTC, which are actually the benchmark yield and the bond's yield. Duke Carey - 26 Feb 2007 21:35 GMT This will get you a string that starts with the guarantor or the price, where there is no guarantor
=RIGHT(A1,LEN(A1)-5-SEARCH("\",SUBSTITUTE(A1,"/","\",4)))
More to follow, based on your response to the earlier question
> You are correct. I am looking for spread. The difference between what you > call YTM and YTC, which are actually the benchmark yield and the bond's yield. [quoted text clipped - 87 lines] > > > > > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > > > > > number(s) ("8" in this example). Elkar - 26 Feb 2007 20:43 GMT Ok, give this one a try:
=MID(A1,FIND("~",SUBSTITUTE(A1," ","~",14))+1,FIND("~",SUBSTITUTE(A1," ","~",15))-FIND("~",SUBSTITUTE(A1," ","~",14))-1)
I've made two assumptions here which may or may not be correct. First off, the number you're looking for is always between the 14th and 15th spaces. And, you don't have the ~ symbol in any of your data. If either assumption is incorrect, then I'll try something else.
HTH, Elkar
> Okay, let me make this a bit clearer: The number I am looking for ("8" in > the example) can be more than one digit (up to 3), but it will always be a [quoted text clipped - 57 lines] > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > number(s) ("8" in this example). IPerlovsky - 26 Feb 2007 20:53 GMT We are getting closer. This one would not work because the numbers are not always going to be between the 14th and 15th spaces.
Here are some examples of where it would not work:
12/15/2006 CA 544644-L6 LOS ANGELES CALIF UN1MM+ 5 7/1/2018 AMBAC 108.748 3.79 3.71 8 Y Aaa AAA Y 7/1/2015
...here it pulls Y
2/6/2007 CA 544646-BA LOS ANGELES CALIF UN1MM+ 5 7/1/2017 FSA 108.579 3.90 3.82 8 Aaa AAA Y 7/1/2016
...here it pulls "Aaa"
 Signature iperlovsky
> Ok, give this one a try: > [quoted text clipped - 70 lines] > > > > > > number surrounded by spaces ("5" in this example) and extract the second > > > > > > number(s) ("8" in this example). Harlan Grove - 27 Feb 2007 00:13 GMT IPerlovsky <iperlov...@statestreet.com> wrote...
>We are getting closer. This one would not work because the numbers >are not always going to be between the 14th and 15th spaces. > >Here are some examples of where it would not work: ...
So far you've provided several examples, namely,
A 544646-BA CALIF UN1MM+ 5 7/1/2017 FSA 108.579 3.90 3.82 8 Aaa AAA
1/4/2007 CA 544646-BH LOS ANGELES CALIF UN 1000000 5 7/1/2022 FGIC 108.151 3.96 3.86 10 Aaa AAA Y 7/1/2016
1/8/2007 CA 544644-NC LOS ANGELES CALIF UN 1000000 5.75 7/1/2016 MBIA 115.907 3.74 3.67 7 Aaa AAA N 7/1/2016
1/8/2007 CA 544644-NC LOS ANGELES CALIF UN 1000000 5.75 7/1/2016 MBIA 115.907 3.74 3.67 17 Aaa AAA N 7/1/2016
12/15/2006 CA 544644-L6 LOS ANGELES CALIF UN1MM+ 5 7/1/2018 AMBAC 108.748 3.79 3.71 8 Y Aaa AAA Y 7/1/2015
2/6/2007 CA 544646-BA LOS ANGELES CALIF UN1MM+ 5 7/1/2017 FSA 108.579 3.90 3.82 8 Aaa AAA Y 7/1/2016
These don't all follow the same format, so simple parsing rules won't work. However, it APPEARS that what MIGHT work would be to ensure that the last token is nonnumeric, then find the last numeric token, where tokens are any substrings containing no spaces. If that spec holds, then define the name seq referring to
=ROW(INDEX(Sheet1!$1:$65536,1,1):INDEX(Sheet1!$1:$65536,255,1))
and if one of these records were in cell B1, try the array formula
=VLOOKUP(9.99999999999999E+307,--MID(TRIM(B1)&"#", SMALL(IF(MID(" "&TRIM(B1),seq,1)=" ",seq),seq), SMALL(IF(MID(TRIM(B1)&"# ",seq,1)=" ",seq),seq) -SMALL(IF(MID(" "&TRIM(B1),seq,1)=" ",seq),seq)),1)
Given the 6 example records above, this formula returns
8 10 7 17 8 8
The trick here is that the TRIM(B1)&"# " terms ensure that the last token converts to an error rather than a number. The MID call returns an array of the separate tokens in B1 followed by a lot of #NUM! errors after the final token. The -- converts the numeric tokens into numbers and the other tokens into error values, and the VLOOKUP call given its 1st arg and missing 4th arg returns the last number found in col 1 of its 2nd arg, the --MID(...) array.
Given all this, you'd be better off following Ron Rosenfeld's advice about using Laurent Longre's MOREFUNC.XLL add-in.
IPerlovsky - 27 Feb 2007 16:55 GMT thanks, advice taken...
 Signature iperlovsky
> IPerlovsky <iperlov...@statestreet.com> wrote... > >We are getting closer. This one would not work because the numbers [quoted text clipped - 56 lines] > Given all this, you'd be better off following Ron Rosenfeld's advice > about using Laurent Longre's MOREFUNC.XLL add-in. Ron Rosenfeld - 26 Feb 2007 21:50 GMT >How would I extract a number (or series of numbers) using a single cell >formula from an alphanumeric text string that also contains spaces? One key [quoted text clipped - 9 lines] >number surrounded by spaces ("5" in this example) and extract the second >number(s) ("8" in this example). Looking at your examples and trying to read a bit between the lines, it appears as if the integer number you are trying to extract can also be described as the last integer surrounded by spaces in the string.
That being the case, here is one solution:
Download and install Longre's free and easily distributable morefunc.xll add-in from http://xcell05.free.fr/
Then use this "regular expression" formula:
=REGEX.MID(A1,"(?<=\s)\d+(?=\s)",-1)
--ron
Ron Rosenfeld - 26 Feb 2007 21:52 GMT >How would I extract a number (or series of numbers) using a single cell >formula from an alphanumeric text string that also contains spaces? One key [quoted text clipped - 9 lines] >number surrounded by spaces ("5" in this example) and extract the second >number(s) ("8" in this example). Well, I just noted that the number to be extracted can be a negative number, so use this instead:
=REGEX.MID(A1,"(?<=\s)-?\d+(?=\s)",-1)
--ron
IPerlovsky - 26 Feb 2007 22:42 GMT Ron,
Thank you for the tip. I will study these functions to obtain a more thorough understanding of the application's potential.
 Signature iperlovsky
> >How would I extract a number (or series of numbers) using a single cell > >formula from an alphanumeric text string that also contains spaces? One key [quoted text clipped - 16 lines] > > --ron Ron Rosenfeld - 27 Feb 2007 02:38 GMT >Ron, > >Thank you for the tip. I will study these functions to obtain a more >thorough understanding of the application's potential. You're welcome. It works on all of the examples you've given.
=REGEX.MID(A1,"(?<=\s)-?\d+(?=\s)",-1)
The Regular Expression is the section within the double quote marks.
Translation:
(?<=\s)
Look for a space (but don't extract it)
followed by
-?
a minus sign which is optional
followed by
\d+
a string of digits
followed by
(?=\s)
a space (but don't extract it).
The -1 at the end says make this the very last sequence in the string that meets this description.
The following formula returns the same values, but, since we TRIM the spaces instead of altering the regular expression so as not to return the spaces, might be easier to understand:
=TRIM(REGEX.MID(A1,"\s-?\d+\s",-1))
The pattern that matches the regex would be:
<space><optional minus sign><series of digits><space> \s -? \d+ \s
--ron
|
|
|