Unicode Supported Regular Expressions & Validations

In any application, validations play an important role in protecting application from invalid data to be saved into the database.
There are several ways of validating data before it is saved into the database. We can apply native database constraints, client-side validations, controller-level validations or model-level validations.
Here are the pros and cons of these alternatives:
Database constraints – Good in cases like uniqueness, difficult to test and maintain
Client-side validations – Need to be supported by server-side validations too as they can be bypassed
Controller-level Validations – Whenever possible Controllers should be skinny. So avoid this.
Model-level validations – This is the best way to ensure that only valid data is saved into the database. They are database agnostic, cannot be bypassed by end users, and are convenient to test and maintain.

Rails provide built-in helpers for common needs and makes validations easy. It also allows you to create our validation methods.
While working on an Internationalization (I18n) supported Ruby on Rails application, I came across a situation when I had to validate the format of data with regular expression.
The regular expressions that we use support English characters.

For example:

If I want to validate name of a person to contain only characters then I will use following:

<pre>

class Person < ActiveRecord::Base

validates :name, format: { with: /\A[a-zA-Z]+\z/ }

end

</pre>

When we are working with only English data then this validation will work well.

But the problem comes when we want to store data from other languages as it only allows a-z and A-Z.

Here is the custom solution that we can apply in Ruby on Rails application.

I created a class as following

<pre>

class AppRegexp

end

</pre>

Collected all the regular expressions used in the application in this class.

Doing this now all the regular expressions are all together at a place. Good to maintain.

<pre>

class AppRegexp

class << self

regexps = {

:name => Regexp.new(/\A[a-zA-Z]+\z/)

:email => Regexp.new(/^((\”[^\”\f\n\r\t\v\b]+\”)|([\w\!\#\$\%\&\’\*\+\-\~\/\^\`\|\{\}]+(\.[\w\!\#\$\%\&\’\*\+\-\~\/\^\`\|\{\}]+)*))@((\[(((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9])))\])|(((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9]))\.((25[0-5])|(2[0-4][0-9])|([0-1]?[0-9]?[0-9])))|((([A-Za-z0-9\-])+\.)+[A-Za-z\-]+))$/),

:password => Regexp.new(/^(\S(?=.*\d)+|\d(?=.*\d)*)\S*$/),

:country_name => Regexp.new(/^[a-zA-Z]+([\.\-\s]?[a-zA-Z]+)*$/),

:locale => Regexp.new(/^[a-z]{2}([-]{0}|[-]{1}[A-Z]{2})$/),

:currency_code => Regexp.new( /^[A-Z]+*$/),

:zip => Regexp.new(/^[a-zA-Z0-9]+([\-\s][a-zA-Z0-9]+)*$/),

:phone => Regexp.new(/^[+#$]?([\s]?[0-9])+[$]{0,1}([\s]?[\-#xX\.)]?[0-9]+)*$/),

:mobile => Regexp.new(/^[+#]?([\s]?[0-9])+([\s]?[\-#xX\.)]?[0-9]+)*$/)

}

regexps.each_pair do |attribute, regex|

define_method attribute do

regex

end

</pre>

Used these regular expressions from AppRegexp as

<pre>

class Person < ActiveRecord::Base

validates :name, format: { with: AppRegexp.name }

end

</pre>

Now when we are working with only English characters we mostly applied inclusion rule.

Means characters should be in A-Z or a-z. Now while working with Unicode we need to apply exclusion rule.

As per set theory, there are 3 sets characters, digits and special characters.

Means when we need character data to be allowed, we need to exclude digits and special characters.

When we need characters and digits we need union of characters and digits sets.

Or in other way we can say that we need to exclude special characters from Universal set.

Looking at all the possible cases for regular expressions, we can broadly categorize data as below:
only_characters_allowed: Means excluding all special characters and digits
characters_and_numbers_allowed: Means excluding all special characters
only_special_characters_not_allowed: Means excluding only special characters
custom: Any other regular expression as per custom requirement
Now to construct such custom regular expressions to support Unicode data, I have defined some constants to be used in regeular expressins.

<pre>

class AppRegexp

# Returns string of Special Characters to be used in Regexp

SPECIAL_CHARACTERS = [

‘\\~’, ‘\\!’, ‘\\@’, ‘\\#’, ‘\\$’, ‘\\%’, ‘\\^’, ‘\\&’, ‘\\*’, ‘\$‘, ‘\$’,

‘\\_’, ‘\\+’, ‘\\-‘, ‘\\=’, ‘\\|’, ‘\\{‘, ‘\\}’, ‘\\[‘, ‘\\]’, ‘\\:’, ‘\\;’,

‘\\”‘, ‘\\<‘, ‘\\>’, ‘\\.’, ‘\\?’, ‘\\/’,

“\\\\\s” # Backslash & space character are purposely kept together in a string

].join

# Returns string of All the Digits of Unicode Character set

# Following list is taken from http://www.fileformat.info/info/unicode/category/Nd/list.htm

DIGITS = [

(0x0030..0x0039).to_a, # DIGIT ZERO to NINE

(0x0660..0x0669).to_a, # ARABIC-INDIC DIGIT ZERO to NINE

(0x06F0..0x06F9).to_a, # EXTENDED ARABIC-INDIC DIGIT ZERO to NINE

(0x07C0..0x07C9).to_a, # NKO DIGIT ZERO to NINE

(0x0966..0x096F).to_a, # DEVANAGARI DIGIT ZERO to NINE

(0x09E6..0x09EF).to_a, # BENGALI DIGIT ZERO to NINE

(0x0A66..0x0A6F).to_a, # GURMUKHI DIGIT ZERO to NINE

(0x0AE6..0x0AEF).to_a, # GUJARATI DIGIT ZERO to NINE

(0x0B66..0x0B6F).to_a, # ORIYA DIGIT ZERO to NINE

(0x0BE6..0x0BEF).to_a, # TAMIL DIGIT ZERO to NINE

(0x0C66..0x0C6F).to_a, # TELUGU DIGIT ZERO to NINE

(0x0CE6..0x0CEF).to_a, # KANNADA DIGIT ZERO to NINE

(0x0D66..0x0D6F).to_a, # MALAYALAM DIGIT ZERO to NINE

(0x0E50..0x0E59).to_a, # THAI DIGIT ZERO to NINE

(0x0ED0..0x0ED9).to_a, # LAO DIGIT ZERO to NINE

(0x0F20..0x0F29).to_a, # TIBETAN DIGIT ZERO to NINE

(0x1090..0x1099).to_a, # MYANMAR SHAN DIGIT ZERO to NINE

(0x17E0..0x17E9).to_a, # KHMER DIGIT ZERO to NINE

(0x1810..0x1819).to_a, # MONGOLIAN DIGIT ZERO to NINE

(0x1946..0x194F).to_a, # LIMBU DIGIT ZERO to NINE

(0x19D0..0x19D9).to_a, # NEW TAI LUE DIGIT ZERO to NINE

(0x1A80..0x1A99).to_a, # TAI THAM HORA DIGIT ZERO to NINE

(0x1B50..0x1B59).to_a, # BALINESE DIGIT ZERO to NINE

(0x1BB0..0x1BB9).to_a, # SUNDANESE DIGIT ZERO to NINE

(0x1C40..0x1C49).to_a, # LEPCHA DIGIT ZERO to NINE

(0x1C50..0x1C59).to_a, # OL CHIKI DIGIT ZERO to NINE

(0xA620..0xA629).to_a, # VAI DIGIT ZERO to NINE

(0xA8D0..0xA8D9).to_a, # SAURASHTRA DIGIT ZERO to NINE

(0xA900..0xA909).to_a, # KAYAH LI DIGIT ZERO to NINE

(0xA9D0..0xA9D9).to_a, # JAVANESE DIGIT ZERO to NINE

(0xAA50..0xAA59).to_a, # CHAM DIGIT ZERO to NINE

(0xABF0..0xABF9).to_a, # MEETEI MAYEK DIGIT ZERO to NINE

(0xFF10..0xFF19).to_a, # FULLWIDTH DIGIT ZERO to NINE

(0x104A0..0x104A9).to_a, # OSMANYA DIGIT ZERO to NINE

(0x11066..0x1106F).to_a, # BRAHMI DIGIT ZERO to NINE

(0x110F0..0x110F9).to_a, # SORA SOMPENG DIGIT ZERO to NINE

(0x11136..0x1113F).to_a, # CHAKMA DIGIT ZERO to NINE

(0x111D0..0x111D9).to_a, # SHARADA DIGIT ZERO to NINE

(0x116C0..0x116C9).to_a, # TAKRI DIGIT ZERO to NINE

(0x1D7CE..0x1D7D7).to_a, # MATHEMATICAL BOLD DIGIT ZERO to NINE

(0x1D7D8..0x1D7E1).to_a, # MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO to NINE

(0x1D7E2..0x1D7EB).to_a, # MATHEMATICAL SANS-SERIF DIGIT ZERO to NINE

(0x1D7EC..0x1D7F5).to_a, # MATHEMATICAL SANS-SERIF BOLD DIGIT ZERO to NINE

(0x1D7F6..0x1D7FF).to_a # MATHEMATICAL MONOSPACE DIGIT ZERO to NINE

].flatten.map(&:chr).join

end

</pre>

Now I defined following general regular expressions which could be used to validate Unicode data.

<pre>

class AppRegexp

class << self

regexps = {

:only_characters_allowed => Regexp.new(/^([^#{SPECIAL_CHARACTERS}#{DIGITS}]+\s{0,1})+$/), # Excludes all special characters & digits

:characters_and_numbers_allowed => Regexp.new(/^([^#{SPECIAL_CHARACTERS}]+\s{0,1})+$/), # Excludes all special characters

:only_special_characters_not_allowed => Regexp.new(/^([^#{SPECIAL_CHARACTERS}]+(.)*)+$/), # Only special characters’ string not allowed rest all combinations allowed

}

regexps.each_pair do |attribute, regex|

define_method attribute do

regex

end

</pre>

Now our Custom Unicode supported regular expressions are ready and can be used to validate unicode data.

For example:

<pre>

class Person < ActiveRecord::Base

validates :name, format: { with: AppRegexp.only_characters_allowed }

end

</pre>

The complete code is as following (Instead of pasting complete code here we can add this to gist and provide only reference link here.)

<pre>

class AppRegexp

# Returns string of Special Characters to be used in Regexp

SPECIAL_CHARACTERS = [

‘\\~’, ‘\\!’, ‘\\@’, ‘\\#’, ‘\\$’, ‘\\%’, ‘\\^’, ‘\\&’, ‘\\*’, ‘\$‘, ‘\$’,

‘\\_’, ‘\\+’, ‘\\-‘, ‘\\=’, ‘\\|’, ‘\\{‘, ‘\\}’, ‘\\[‘, ‘\\]’, ‘\\:’, ‘\\;’,

‘\\”‘, ‘\\<‘, ‘\\>’, ‘\\.’, ‘\\?’, ‘\\/’,

“\\\\\s” # Backslash & space character are purposely kept together in a string

].join

# Returns string of All the Digits of Unicode Character set

# Following list is taken from http://www.fileformat.info/info/unicode/category/Nd/list.htm

DIGITS = [

(0x0030..0x0039).to_a, # DIGIT ZERO to NINE

(0x0660..0x0669).to_a, # ARABIC-INDIC DIGIT ZERO to NINE

(0x06F0..0x06F9).to_a, # EXTENDED ARABIC-INDIC DIGIT ZERO to NINE

(0x07C0..0x07C9).to_a, # NKO DIGIT ZERO to NINE

(0x0966..0x096F).to_a, # DEVANAGARI DIGIT ZERO to NINE

(0x09E6..0x09EF).to_a, # BENGALI DIGIT ZERO to NINE

(0x0A66..0x0A6F).to_a, # GURMUKHI DIGIT ZERO to NINE

(0x0AE6..0x0AEF).to_a, # GUJARATI DIGIT ZERO to NINE

(0x0B66..0x0B6F).to_a, # ORIYA DIGIT ZERO to NINE

(0x0BE6..0x0BEF).to_a, # TAMIL DIGIT ZERO to NINE

(0x0C66..0x0C6F).to_a, # TELUGU DIGIT ZERO to NINE

(0x0CE6..0x0CEF).to_a, # KANNADA DIGIT ZERO to NINE

(0x0D66..0x0D6F).to_a, # MALAYALAM DIGIT ZERO to NINE

(0x0E50..0x0E59).to_a, # THAI DIGIT ZERO to NINE

(0x0ED0..0x0ED9).to_a, # LAO DIGIT ZERO to NINE

(0x0F20..0x0F29).to_a, # TIBETAN DIGIT ZERO to NINE

(0x1090..0x1099).to_a, # MYANMAR SHAN DIGIT ZERO to NINE

(0x17E0..0x17E9).to_a, # KHMER DIGIT ZERO to NINE

(0x1810..0x1819).to_a, # MONGOLIAN DIGIT ZERO to NINE

(0x1946..0x194F).to_a, # LIMBU DIGIT ZERO to NINE

(0x19D0..0x19D9).to_a, # NEW TAI LUE DIGIT ZERO to NINE

(0x1A80..0x1A99).to_a, # TAI THAM HORA DIGIT ZERO to NINE

(0x1B50..0x1B59).to_a, # BALINESE DIGIT ZERO to NINE

(0x1BB0..0x1BB9).to_a, # SUNDANESE DIGIT ZERO to NINE

(0x1C40..0x1C49).to_a, # LEPCHA DIGIT ZERO to NINE

(0x1C50..0x1C59).to_a, # OL CHIKI DIGIT ZERO to NINE

(0xA620..0xA629).to_a, # VAI DIGIT ZERO to NINE

(0xA8D0..0xA8D9).to_a, # SAURASHTRA DIGIT ZERO to NINE

(0xA900..0xA909).to_a, # KAYAH LI DIGIT ZERO to NINE

(0xA9D0..0xA9D9).to_a, # JAVANESE DIGIT ZERO to NINE

(0xAA50..0xAA59).to_a, # CHAM DIGIT ZERO to NINE

(0xABF0..0xABF9).to_a, # MEETEI MAYEK DIGIT ZERO to NINE

(0xFF10..0xFF19).to_a, # FULLWIDTH DIGIT ZERO to NINE

(0x104A0..0x104A9).to_a, # OSMANYA DIGIT ZERO to NINE

(0x11066..0x1106F).to_a, # BRAHMI DIGIT ZERO to NINE

(0x110F0..0x110F9).to_a, # SORA SOMPENG DIGIT ZERO to NINE

(0x11136..0x1113F).to_a, # CHAKMA DIGIT ZERO to NINE

(0x111D0..0x111D9).to_a, # SHARADA DIGIT ZERO to NINE

(0x116C0..0x116C9).to_a, # TAKRI DIGIT ZERO to NINE

(0x1D7CE..0x1D7D7).to_a, # MATHEMATICAL BOLD DIGIT ZERO to NINE

(0x1D7D8..0x1D7E1).to_a, # MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO to NINE

(0x1D7E2..0x1D7EB).to_a, # MATHEMATICAL SANS-SERIF DIGIT ZERO to NINE

(0x1D7EC..0x1D7F5).to_a, # MATHEMATICAL SANS-SERIF BOLD DIGIT ZERO to NINE

(0x1D7F6..0x1D7FF).to_a # MATHEMATICAL MONOSPACE DIGIT ZERO to NINE

].flatten.map(&:chr).join

class << self

regexps = {

:only_characters_allowed => Regexp.new(/^([^#{SPECIAL_CHARACTERS}#{DIGITS}]+\s{0,1})+$/), # Excludes all special characters & digits

:characters_and_numbers_allowed => Regexp.new(/^([^#{SPECIAL_CHARACTERS}]+\s{0,1})+$/), # Excludes all special characters

:only_special_characters_not_allowed => Regexp.new(/^([^#{SPECIAL_CHARACTERS}]+(.)*)+$/), # Only special characters’ string not allowed rest all combinations allowed

# Other regular expressions with custom requirement

:password => Regexp.new(/^(\S(?=.*\d)+|\d(?=.*\d)*)\S*$/),

:country_name => Regexp.new(/^[a-zA-Z]+([\.\-\s]?[a-zA-Z]+)*$/),

:locale => Regexp.new(/^[a-z]{2}([-]{0}|[-]{1}[A-Z]{2})$/),

:currency_code => Regexp.new( /^[A-Z]+*$/),

:zip => Regexp.new(/^[a-zA-Z0-9]+([\-\s][a-zA-Z0-9]+)*$/),

:phone => Regexp.new(/^[+#$]?([\s]?[0-9])+[$]{0,1}([\s]?[\-#xX\.)]?[0-9]+)*$/),

:mobile => Regexp.new(/^[+#]?([\s]?[0-9])+([\s]?[\-#xX\.)]?[0-9]+)*$/)

}

regexps.each_pair do |attribute, regex|

define_method attribute do

regex

end

</pre>

Above mentioned validations include all the unicode characters.

If we want to be specific to some language characters we can change the set of characters provided in constants.

These regular expressions are as per requirement for Unicode characters and can be modified in same way for digits.

This is the solution that we can use for Ruby on Rails application but we can develop a variant of this applying the same technique for other languages.

Unicode Supported Regular Expressions & Validations

1 Comment

Leave a Comment Cancel Reply

Related posts:

Related Articles

Should we be concerned about brainjacking? Brainjacking can potentially be a really dangerous type of hacking as it involves the human brain, and it is right to raise the question regarding its concern.

Spatial journalism will be the future of media, thanks to AR and VR Spatial journalism will provide an immersive and engaging news watching experience to viewers by putting them into the shoes of news reporters.

Banking technology – What 2016 has in store Competition among banks has increased and they are waging financial wars on a number of fronts. Not only this, a number of fintech startups, digital challengers and tech giants present a threat that is unlikely to go away soon.

1 Comment

Leave a Comment Cancel Reply