How To:Regular Expression

Hi All,

 

I just started to use DPX. For the moment I have tried the demo and it is wonderful. Congrats!!

 

Now I try to create my own template and I have started with a simple sample

 

I want to use OCR on my business card

 

But I didn't uderstand how work Regular expression

 

On my Business card on the left I have FIRST NAME and LAST NAME  (  Sébastien Laporte)

 

What is the syntax? I need to type ^[a-z-\s]+$ ?

 

When I test my template, the application always recognize a part of my first name and last name  (  bastien aporte )

 

I believe that it is cause of capital letter and accent  (  bastien Laporte  )

So what it is the expression for have Capital letter, accent and minuscule?

 

( Tomorrow I will do a test with Passport )

 

regards,

 

Sébastien

Comments


Hi Sébastien,

This is an excellent example of a common use case; thank you!

In your case, it seems as you may not need regular expressions to accomplish your task. Regex should be used when an OCR value can be clearly defined (e.g. Date of birth DD/MM/YY can be translated into regex as "find two digits, then a slash, then two digits, then a slash, then two digits"). On a business card, there isn't much information that can be identified this way (maybe phone numbers, but since we're reading the entire block as one text value, this would not work). I would suggest that you employ the character sub-set instead. This allows you to choose a group of allowable characters for a field (e.g. Only use letters for the name, title, and department field).

Regarding the incorrect OCR, this may be partially due to the fact that European characters is not properly selected in this release. We have already fixed this on our end, and this should be fixed when you receive the update.

Additionally, I have found that you only defined one region used for identification, and it isn't a strong indicator for the form. What I mean is that since this region is in the center left of the form, it can cause a wrong orientation to be correctly accepted if something similar appears in that same space when rotated or flipped upside down (which does happen). I would suggest that you create at least two identify regions, and that they be in opposite corners of the form (e.g. on the business card, I would choose the Motorola logo and the telephone/email text as two regions). Adding a third region can help, but it depends on the form.

As a quick aside, note that your template would only work for Motorola business cards in this format; I noticed that mine has less information on it, and thus, regions may be slightly higher or lower than on your card. The person creating templates needs to be aware of these nuances and they have to see if they can create one template that can accommodate them all, for example by ensuring that all regions are large enough to be read on all slight variations on the form. We are working to tackle this problem in future releases, but currently, this is a limitation of the software.

Regards,

Lawrence


Do we support full standard regexp syntax?

Because I am trying to get this string (date DD/MM/YYYY and time HH:MM):

example 25/15/2013 20:35

I tried:

^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}$       // what looks like a V is backslash-slash, since slash is special an has to be escaped. Just ONE space.

^[0-9]{2}\/[0-9]{2}\/[0-9]{4}  *[0-9]{2}:[0-9]{2}$    // same, but with one or more spaces, just in case....

^[0-9]{2}\/[0-9]{2}\/[0-9]{4}  *[0-9]{2}:[0-9]{2} *$    // same, but with one or more spaces PLUS 0 or more trailing spaces

To no avail... so, do we have to escape the slashes?

Also, do we support regexp substitution (example, s/^[0-9](.*$)/\1/, which would simply remove the first digit)?


Please refer to documentation here: http://perldoc.perl.org/perlre.html.

I do not believe we support regex substitution.


Thanks! Good to know it's written in Perl.