How to match any Unicode letter with Regex in Elixir


Percy Grunwald's Profile Picture

Written by Percy Grunwald

— Last Updated February 22, 2019

If you need to match letters in a multilingual application /[a-zA-Z]/ simply won’t cut it for you:

iex> String.match?("å", ~r/^[a-zA-Z]$/)
false

Assuming that your regular expression library supports Unicode matching, you can match a Unicode grapheme in the “letter” category with \p{L}. This will match letters of all languages, cases and types covered by Unicode: a, å, B, ß, β, ӝ, , , etc.

This regular expression is very useful if you need to enforce presence of letters in any language or absence of special characters in a string that could potentially be in any of the world’s written languages.

Unicode Regex in Elixir

Luckily for us Alchemists, Elixir’s Regex module supports Unicode matching and we simply need to supply the u option to ~r (sigil_r) to release its awesome power.

Basic Latin characters from English work, as you might expect:

iex> String.match?("a", ~r/^\p{L}$/u)
true
iex> String.match?("A", ~r/^\p{L}$/u)
true

Latin character variants with umlauts and acute accents are no problem either:

iex> String.match?("ö", ~r/^\p{L}$/u)
true
iex> String.match?("Á", ~r/^\p{L}$/u)
true

Let’s make sure it’s not just returning a match for any character, so how about some characters that look like letters but aren’t:

iex> String.match?("$", ~r/^\p{L}$/u)
false
iex> String.match?("@", ~r/^\p{L}$/u)
false

Be careful to remember the u option, or the Regex will not return the expected results:

iex> String.match?("å", ~r/^\p{L}$/)
false

More cool stuff you can match with Unicode

The power to do Unicode matching gives you some pretty enormous power, since you can match on any of the Unicode categories.

For instance, you could match only lowercase letters with \p{Ll}:

iex> String.match?("a", ~r/^\p{Ll}$/u)
true

iex> String.match?("A", ~r/^\p{Ll}$/u)
false

iex> String.match?("å", ~r/^\p{Ll}$/u)
true

iex> String.match?("Å", ~r/^\p{Ll}$/u)
false

Or match only currency symbols with \p{Sc}:

iex> String.match?("$", ~r/^\p{Sc}$/u)
true

iex> String.match?("€", ~r/^\p{Sc}$/u)
true

iex> String.match?("#", ~r/^\p{Sc}$/u)
false

You could use this to determine whether an input string is a valid currency string without needing to manually list all possible currency symbols:

```elixir iex> currency_string_regex = ~r/\p{Sc}\d+.\d{2}/u ~r/\p{Sc}\d+.\d{2}/u

iex> [“$1.00”, “£1.00”, “¥1.00”, “€1.00”, “&1.00”]
…> |> Enum.filter(&String.match?(&1, currency_string_regex)) [“$1.00”, “£1.00”, “¥1.00”, “€1.00”]

Got any examples of where Unicode matching is used in your applications? Please hit me up in the comments! 😃

Comments