вторник, 12 мая 2015 г.

Removal of combining diacritical marks from text

Unicode standard describes method for applying of combining diacritical marks to modify other characters.

Given a text from user, some characters can be treated as combination of regular character and combining diacritical marks.


One could have ability to obtain text filtered from combining diacritical marks.

This is how it could be done for Hebrew niqqud, for instance.

Removal of Hebrew niqqud marks from text

To see chart of Hebrew diacritic marks, open this document: Hebrew Unicode block (PDF).

I'll use Ruby language for demonstration.

At first, lets define ranges in Unicode:

HEBREW_WHOLE_UNICODE_RANGE = 0x0591..0x05F4
HEBREW_DIACRITICS_UNICODE_RANGE = 0x0591..0x05C7
HEBREW_LETTERS_UNICODE_RANGE = 0x05d0..0x05ea
Here is how we remove niqqud marks from following Hebrew text
בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ
Code in Ruby:
text.
  unpack('U*').
  delete_if {|c| HEBREW_DIACRITICS_UNICODE_RANGE.include?(c)}.
  pack('U*')
Here "text.unpack('U*')" means "unpack Unicode characters to array of code points"
text.unpack('U*')
 => [1489, 1468, 1456, 1512, 1461, 1488, 1513, 1473, 1460, 1497, 1514, 32, 1489, 1468, 1464, 1512, 1464, 1488, 32, 1488, 1457, 1500, 1465, 1492, 1460, 1497, 1501, 32, 1488, 1461, 1514, 32, 1492, 1463, 1513, 1468, 1473, 1464, 1502, 1463, 1497, 1460, 1501, 32, 1493, 1456, 1488, 1461, 1514, 32, 1492, 1464, 1488, 1464, 1512, 1462, 1509]
 After applying filter we applying pack('U*') to array of code points to obtain character representation.

Finally we have following text:
בראשית ברא אלהים את השמים ואת הארץ

Комментариев нет:

Отправить комментарий