イジハピ！ : 【第575回】Using regular expression is fun!

Are you familiar with "regular expression"?
I'm writing this article to the people who are NOT familiar with it.

Jeffrey E. F. Friedl O'Reilly Media 2012-03-05

Regular Expression is one of the functionality of computer for power users.
It doesn't look like user friendly, but don't be afraid.
I will help you a lot even if you learn it very little.

The word "Regular Expression" is already unfriendly for computer newbies.
It is NOT friendly to me.
I heard it had come from Mathematics.
You can simply call it as "regex".

Regex is the language to describe the "pattern" of string.
In this context, the word "string" means some text -- the series of letters/characters.
(I know, you got to learn a lot of new words!)
The pattern would enhance the search/replace functionality of text editor, word processer, programing languages, etc.

The search/replace is the great functionality of text editor.
Let's think that you are authoring a software manual including following text.

...You can run this software with Windows xp and later.
We recommend you to use it with Windows xp.
We ensure that Windows xp is OK with this software. Blah blah...

The text is huge and including so much string "Windows xp".

Then, your boss asked you to change the string "Windows xp" into "Windows 8.1".

If you are very junior with computer, you might check the whole text with eyeballs and change the words manually.
But I think you won't do it.
You use search/replace functionality of the text editor instead.
Then the words would be corrected at once.
(Following is the screen capture of Text Mate2, the editor for Mac.)

But you would have to change following words:

from: "Windows xp", "Windows 95", "Windows 7", etc.
to: "Mac OS X"

Then you would use regex.

You fill the "find" box with following text.

\bWindows \w+\b

This text is called as "pattern".
Then, you got to switch on the check box "[ ] use regular expression".

The pattern would match with the various strings: Windows xp, Windows 7, Windows 95 and so on.

Let's confirm again:

If you fill the "find" box with fixed string "Windows xp" and turn the "[ ] use regular expression" box off, only one string "Windows xp" would be replaced.
If you fill the "find" box with the pattern "\bWindows \w+\b" and turn the "[ ] use regular expression" box on, various string like "Windows ???" would be all replaced.

So the pattern is the string that matched various strings.

Regex is the language (set of words and syntax) to describe the pattern.

Let's translate the pattern "\bWindows \w+\b".

\b means boundary, the place between words and words.
It would match with the place between printed letter/character with start of text, end of text, spaces, periods, commas, and so on.

So the pattern "\bWindows\b" would match the strings like:
"Windows"
"The software for Windows:"
"The Windows engineers"

And it wouldn't match the strings like:
"WindowsMania"
"BreakingWindows"

\w means "word letters".
It would match some letters like "a", "b", "C", "D", "0", "1", "2" etc.
I don't desctibe the accutate definitions of the pattern.
Please refer to the reference manual for your utilities.

+ means "repeat once or more times".
"a+" would match "a", "aa", "aaa", "aaaaaaaaaaaaaaa" etc.
So the pattern \w+ would match "xp", "7", "Vista", etc.

So the pattern "\bWindows \w+\b" would match "Windows ???" surrounded by words boundaries.
??? means some strings.

Isn't it easy?

I would show you some examples below.
Note that \d means digit (one numeric letter), ? means once or no times, [01] means 0 or 1, and [0-3] means 0, 1, 2 or 3.

\d\d\d\d-\d\d\-\d\d
　match: 2014-01-01 2001-12-31 1999-05-01 9999-99-99(unexpected match)
　unmatch: aaaa-01-01 20100101 2011--05-01 900-9-9(unexpected unmatch)

\d+-[01]?\d-[0-3]?\d
　match: 2014-01-01 2001-12-31 1999-05-01 900-9-9 9999-13-13(unexpected match)
　unmatch: aaaa-01-01 20100101 2011--05-01 9999-99-99

The first pattern works, but it is not so good.
The second one is better, but you can improve it.

But sometimes the bad patterns are good enough for you, because the text might not include so much errors.

Following pattern is intersting.

　(\w+)\s+\1

This would match the repeated words like:
　this this
　the the
　red red

These might be your typo, so the regex would work as a tiny error checker.
You can make patterns as many as you want to improve the quality of your writing.

One advice, you should not believe regex too much.
"the the" might be error, but it might not be error --- the name of a rock'n'roll band :-D

Soul Mining (2LP 30th Anniversary Deluxe Edition) [12 inch Analog]

posted with カエレバ

The The Sony Music CMG 2014-06-30

イジハピ！

【第575回】Using regular expression is fun!

「English Articles」カテゴリの最新記事

query1000