MS Office Forum / Excel / New Users / August 2007
Extracting some text from a string.
|
|
Thread rating:  |
Art MacNeil - 18 Aug 2007 01:18 GMT Hello,
I'm trying to extract some specific text from a bunch of strings of text.
I think this should be really easy but I can't get it done. I've used "Left" "Mid", "Substitute" and "Find" in one combination of another and sometimes it works but it doesn't always work.
Regardless of the path, what I need is, the string of text after the final "\" and before the ".xls"
I've copied a sample of the strings I'm working with below.
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\5 year compare\5yearscomparePF.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\CCA\CCA Line 1 to 2475.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\CCA\CCA Lines 2475 & down.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\RRSP Deduction pf\RRSP Deduction PF.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\5 year compare pf.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\CCA Line 1 to 2475.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\CCA Lines 2475 & down.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Federal\RRSP Deduction pf.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\Annexe A\Annexe A T P.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\LSVCC\LSVCC (Québec) T P.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\QVEH-B\QVEH-B series T P.xls C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\TP1-Comparative_Summary\TP1_Comparative_TAXSUMMARY_PF.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\Annexe A.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\LSVCC (Québec).xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\QVEH B.xls E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation Testing\Québec\TP1 Comp Summary pf.xls
The Text I need is: 5yearscomparePF CCA Line 1 to 2475 CCA Lines 2475 & down RRSP Deduction PF 5 year compare pf CCA Line 1 to 2475 CCA Lines 2475 & down RRSP Deduction pf Annexe A T P LSVCC (Québec) T P QVEH-B series T P TP1_Comparative_TAXSUMMARY_PF Annexe A LSVCC (Québec) QVEH B TP1 Comp Summary pf
Thank you for your help,
Art.
Rick Rothstein (MVP - VB) - 18 Aug 2007 01:51 GMT If your 'xls' extensions are all always in lower case letters, then this formula will work...
=SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls","")
If the extensions can be all lower or all upper case letters, then this formula will work....
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls",""),".XLS","")
Note though that neither formula will work if there is a mixture of upper and lower case letters in the extension part of the string.
Rick
> Hello, > [quoted text clipped - 63 lines] > > Art. MartinW - 18 Aug 2007 02:14 GMT Hi Rick,
Perhaps this formula which I believe was one of yours might work better. =MID(A1,FIND("~",SUBSTITUTE(A1,"\","~",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))+1,999) then in C1 put =LEFT(B1,LEN(B1)-4) This allows for upper and lower case as well as .doc or any other extension.
Now if you could just show us how to combine the two.....I've been trying but I just can't get the syntax right.
Regards Martin
> If your 'xls' extensions are all always in lower case letters, then this > formula will work... [quoted text clipped - 79 lines] >> >> Art. Rick Rothstein (MVP - VB) - 18 Aug 2007 02:58 GMT Hey, I forgot I posted that one (I tend to re-create these things as the questions come in). Here is the combined formula...
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls",""),".XLS","")
It is 38 characters longer than the one I posted to this thread and requires some 5 additional function calls (the extra overhead coming from having to remove the extension).
Rick
> Hi Rick, > [quoted text clipped - 94 lines] >>> >>> Art. Art MacNeil - 18 Aug 2007 02:35 GMT This worked perfectly:
=SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls","")
I can't believe how small the formula is. Mine was huge, and it didn't work in all cases.
Thank you very much,
Art.
> If your 'xls' extensions are all always in lower case letters, then this > formula will work... [quoted text clipped - 79 lines] >> >> Art. Ron Coderre - 18 Aug 2007 03:24 GMT Try this:
B1: =SUBSTITUTE(RIGHT(A1,MATCH("\",INDEX(MID(A1,LEN(A1)+1-ROW($1:$255),1),0),0)-1),".xls","")
That assumes the file name does not exceed 255 characters.
Is that something you can work with? *********** Regards, Ron
XL2003, WinXP
> Hello, > [quoted text clipped - 63 lines] > > Art. Art MacNeil - 18 Aug 2007 07:55 GMT This worked as well.
Thank you,
Art.
> Try this: > [quoted text clipped - 81 lines] >> >> Art. Harlan Grove - 18 Aug 2007 08:19 GMT "Art MacNeil" <artmacneil@shaw.ca> wrote... ...
>Regardless of the path, what I need is, the string of text after the final >"\" and before the ".xls" ...
Base filenames.
Another alternative, define the name seq referring to the formula
=ROW(INDEX($1:$65536,1,1):INDEX($1:$65536,1024,1))
Then with one of these full pathnames in cell A2, try the formula
=SUBSTITUTE(REPLACE(A2,1,LOOKUP(2,1/(MID(A2,seq,1)="\"),seq),""),".xls","")
For more general parsing tasks, you'd be better off downloading and installing Laurent Longre's MOREFUNC.XLL add-in and using it's regular expression functions. Using it you could extract base filenames using
=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")
Ron Rosenfeld - 18 Aug 2007 12:08 GMT >"Art MacNeil" <artmacneil@shaw.ca> wrote... >... [quoted text clipped - 17 lines] > >=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]") Or you could use (a little shorter):
=REGEX.MID(A1,"[^\\]*(?=\.xls$)")
or, if other than xls files might be present:
=REGEX.MID(A1,"[^\\]*(?=\.\w{3}$)")
--ron
Ron Rosenfeld - 18 Aug 2007 15:29 GMT >>"Art MacNeil" <artmacneil@shaw.ca> wrote... >>... [quoted text clipped - 27 lines] > >--ron The non-XLS file version regex could be augmented to something like:
[^\\]*(?=\.[^\\]*$)
or
[^\\]*(?=\.[^\\]+$)
--ron
Art MacNeil - 18 Aug 2007 19:25 GMT > On Sat, 18 Aug 2007 07:08:43 -0400, Ron Rosenfeld > <ronrosenfeld@nospam.org> [quoted text clipped - 42 lines] > > --ron Thank you.
I hate to say it, but I already had downloaded the MOREFUNC.XLL add-in installed. I just didn't realize it could be used to do this.
Art.
Ron Rosenfeld - 18 Aug 2007 20:00 GMT >Thank you. > >I hate to say it, but I already had downloaded the MOREFUNC.XLL add-in >installed. I just didn't realize it could be used to do this. > >Art. Once you get used to using Regular Expressions, you'll find that complex text manipulations become pretty simple.
By the way, so far as I know, MOREFUNC.XLL has not been updated to work with XL2007, so some of the functions don't work.
Also, one limitation of MOREFUNC is 255 characters. This is a limitation of the XLL, apparently. But the Regular Expression functions can all be duplicated in VBA, with only minor changes in the expressions. --ron
Harlan Grove - 18 Aug 2007 19:47 GMT "Ron Rosenfeld" <ronrosenfeld@nospam.org> wrote... ... ...
>>>=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]") ...
>The non-XLS file version regex could be augmented to something like: > [quoted text clipped - 3 lines] > >[^\\]*(?=\.[^\\]+$) ...
I have nastier filenames on my system, e.g., .RHistory which begins with a period and has no other period. My REGEX.SUBSTITUTE formula returns it while your REGEX.MID formula returns "". Either may be appropriate depending on exactly how one wants to handle such filenames. However, there are also filenames that have no periods whatsoever, and with those my formula returns then while yours still returns "".
I had tried REGEX.MID with several variations on your ending assertion to handle such filenames, but I gave up and used (what I consider to be) the much more reliable REGEX.SUBSTITUTE.
Ron Rosenfeld - 19 Aug 2007 01:04 GMT >"Ron Rosenfeld" <ronrosenfeld@nospam.org> wrote... >... [quoted text clipped - 20 lines] >handle such filenames, but I gave up and used (what I consider to be) the >much more reliable REGEX.SUBSTITUTE. It is certainly the case that my REGEX.MID function assumes that the filename will have a suffix starting with a "."
How do you differentiate a file name that has no suffix from a subfolder name? I suppose you could look for a terminal "\", but you'd have to ensure that would be in the list you're examining. --ron
Harlan Grove - 19 Aug 2007 03:47 GMT "Ron Rosenfeld" <ronrosenfeld@nospam.org> wrote... ...
>How do you differentiate a file name that has no suffix from a subfolder >name? >I suppose you could look for a terminal "\", but you'd have to ensure that >would be in the list you're examining. Exactly. If there's a \ to the right, you're NOT looking at a filename. Conversely, folders CAN have periods in their names.
As for ensuring there are \s, it didn't (and still doesn't) seem relevant to this thread, but if filenames MIGHT lack drive/directory paths, then use
=REGEX.SUBSTITUTE(A1,"^(.+\\)?([^\\]+?)(\.[^.\\]+)?$","[2]")
Ron Rosenfeld - 19 Aug 2007 12:37 GMT >"Ron Rosenfeld" <ronrosenfeld@nospam.org> wrote... >... [quoted text clipped - 10 lines] > >=REGEX.SUBSTITUTE(A1,"^(.+\\)?([^\\]+?)(\.[^.\\]+)?$","[2]") OK. That makes sense. And it's been a long time since I've worked with other file systems. As usual, the best Regex to use depends on the nature of the data and what is to be extracted.
It is the case that "[^\\]*(?=\.[^\\]+$)" will match just the file names in all of the examples given by the OP. And if he were only interested in .xls files, then
=REGEX.MID(A2,"[^\\]*(?=\.xls$)")
But yours is more general, and could be easily adapted to other situations. --ron
|
|
|