Журнал фильтра правок

Подробности записи журнала 2994131

11:20, 17 мая 2020: 70 «Удаление категорий» 2a00:23c4:7a95:6600:f87a:1f1b:e:9c21 (обсуждение) на странице Юникод, меры: Предупреждение (просмотреть)

Изменения, сделанные в правке

Параметры действия

Переменная	Значение
Число правок участника (`user_editcount`)	null
Имя учётной записи (`user_name`)	'2A00:23C4:7A95:6600:F87A:1F1B:E:9C21'
Возраст учётной записи (`user_age`)	0
Группы (включая неявные) в которых состоит участник (`user_groups`)	[ 0 => '*' ]
Права, которые есть у участника (`user_rights`)	[ 0 => 'createaccount', 1 => 'read', 2 => 'edit', 3 => 'createpage', 4 => 'createtalk', 5 => 'writeapi', 6 => 'viewmywatchlist', 7 => 'editmywatchlist', 8 => 'viewmyprivateinfo', 9 => 'editmyprivateinfo', 10 => 'editmyoptions', 11 => 'abusefilter-log-detail', 12 => 'urlshortener-create-url', 13 => 'centralauth-merge', 14 => 'abusefilter-view', 15 => 'abusefilter-log', 16 => 'vipsscaler-test', 17 => 'flow-hide' ]
Редактирует ли пользователь через мобильное приложение (`user_app`)	false
Редактирует ли участник через мобильный интерфейс (`user_mobile`)	false
ID страницы (`page_id`)	13025
Пространство имён страницы (`page_namespace`)	0
Название страницы (без пространства имён) (`page_title`)	'Юникод'
Полное название страницы (`page_prefixedtitle`)	'Юникод'
Последние десять редакторов страницы (`page_recent_contributors`)	[ 0 => '2.62.199.107', 1 => 'Monedula', 2 => 'Vort', 3 => '82.215.96.6', 4 => 'Francuaza', 5 => 'InternetArchiveBot', 6 => 'Person or Persons Unknown', 7 => 'BabelStone', 8 => '0lderv1k', 9 => '31.8.203.240' ]
Возраст страницы (в секундах) (`page_age`)	493396985
Действие (`action`)	'edit'
Описание правки/причина (`summary`)	''
Старая модель содержимого (`old_content_model`)	'wikitext'
Новая модель содержимого (`new_content_model`)	'wikitext'
Вики-текст старой страницы до правки (`old_wikitext`)	'[[Файл:New Unicode logo.svg\|x200px\|thumb\|right\|Логотип Unicode Consortium]] '''Юнико́д'''<ref name=autogenerated1>{{cite web\|url=http://www.unicode.org/standard/UnicodeTranscriptions.html\|title=Unicode Transcriptions\|publisher=\|date=\|accessdate=10 мая 2010\|lang=en\|archiveurl=https://web.archive.org/web/20060408204540/http://www.unicode.org/standard/UnicodeTranscriptions.html\|archivedate=2006-04-08\|deadlink=yes}}</ref> (чаще всего) или '''Унико́д'''<ref>[http://www.paratype.ru/help/term/terms.asp?code=361 Уникод в словаре Paratype]</ref> ({{lang-en\|Unicode}}) — стандарт [[Набор символов\|кодирования символов]], включающий в себя знаки почти всех письменных [[язык]]ов мира<ref name="unicode-techintro">{{cite web\|url=http://www.unicode.org/standard/principles.html\|title=The Unicode® Standard: A Technical Introduction\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100310120125/http://www.unicode.org/standard/principles.html\|archivedate=2010-03-10\|deadlink=yes}}</ref>. В настоящее время стандарт является преобладающим в [[Интернет\|Интернете]]. Стандарт предложен в [[1991 год]]у некоммерческой организацией «Консорциум Юникода» ({{lang-en\|Unicode Consortium, Unicode Inc.}})<ref>{{cite web\|url=http://www.unicode.org/history/publicationdates.html\|title=History of Unicode Release and Publication Dates\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100110085403/http://www.unicode.org/history/publicationdates.html\|archivedate=2010-01-10\|deadlink=yes}}</ref><ref>{{cite web\|url=http://www.unicode.org/consortium/consort.html\|title=The Unicode Consortium\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627085503/http://www.unicode.org/consortium/consort.html\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Применение этого стандарта позволяет закодировать очень большое число символов из разных систем письменности: в документах, закодированных по стандарту Юникод, могут соседствовать китайские [[иероглиф]]ы, математические символы, буквы [[греческий алфавит\|греческого алфавита]], [[латинский алфавит\|латиницы]] и [[кириллица\|кириллицы]], символы музыкальной нотной нотации, при этом становится ненужным переключение [[кодовая страница\|кодовых страниц]]<ref name="unicode-foreword">{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf\|title=Foreword\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627141434/http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Стандарт состоит из двух основных частей: универсального набора символов ({{lang-en\|Universal character set, UCS}}) и семейства кодировок ({{lang-en\|Unicode transformation format, UTF}}). Универсальный набор символов перечисляет допустимые по стандарту Юникод символы и присваивает каждому символу код в виде неотрицательного целого числа, записываемого обычно в шестнадцатеричной форме с префиксом <code>U+</code>, например, <code>U+040F</code>. Семейство кодировок определяет способы преобразования кодов символов для передачи в потоке или в файле. Коды в стандарте Юникод разделены на несколько областей. Область с кодами от U+0000 до U+007F содержит символы набора [[ASCII]], и коды этих символов совпадают с их кодами в ASCII. Далее расположены области символов других систем письменности, знаки пунктуации и технические символы. Часть кодов зарезервирована для использования в будущем<ref name='unicode-02'>{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf\|title=General Structure\|accessdate=2010-07-05\|archiveurl=https://web.archive.org/web/20100627093139/http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Под символы кириллицы выделены области знаков с кодами от U+0400 до U+052F, от U+2DE0 до U+2DFF, от U+A640 до U+A69F (см. [[Кириллица в Юникоде]])<ref>{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf\|title=European Alphabetic Scripts\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627140856/http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. == Предпосылки создания и развитие Юникода == {{цитата\|Unicode — это уникальный код для любого символа, независимо от платформы, независимо от программы, независимо от языка.\|автор=Консорциум Юникода<ref>[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]</ref>}} К концу 1980-х годов стандартом стали 8-битные кодировки, их существовало уже большое множество, и постоянно появлялись новые. Это объяснялось как расширением круга поддерживаемых языков, так и стремлением создавать кодировки, частично совместимые между собой (характерный пример — появление [[альтернативная кодировка\|альтернативной кодировки для русского языка]], обусловленное эксплуатацией западных программ, созданных для кодировки [[CP437]]). В результате появилось несколько проблем: # проблема неправильной раскодировки; # проблема ограниченности набора символов; # проблема преобразования одной кодировки в другую; # проблема дублирования шрифтов. '''Проблема неправильной раскодировки''' вызывала появление в документе символов иностранных языков, не предполагавшихся в документе, или появление не предполагавшихся [[псевдографика\|псевдографических]] символов, прозванных русскоязычными пользователями «кракозябрами». Проблема во многом была вызвана отсутствием стандартизированной формы указания кодировки для файла или потока. Проблему можно было решить либо последовательным внедрением стандарта указания кодировки, либо внедрением общей для всех языков кодировки.<ref name='unicode-foreword' /> '''Проблема ограниченности набора символов'''<ref name='unicode-foreword' />. Проблему можно было решить либо переключением шрифтов внутри документа, либо внедрением «широкой» кодировки. Переключение шрифтов издавна практиковалось в [[текстовый процессор\|текстовых процессорах]], причём часто использовались [[нестандартные шрифты\|шрифты с нестандартной кодировкой]], т. н. «dingbat fonts». В итоге при попытке переноса документа в другую систему все нестандартные символы превращались в «кракозябры». '''Проблема преобразования одной кодировки в другую'''. Проблему можно было решить либо составлением таблиц перекодировки для каждой пары кодировок, либо использованием промежуточного преобразования в третью кодировку, включающую все символы всех кодировок<ref>{{cite web\|url=http://www.unicode.org/history/unicode88.pdf\|title=Unicode 88\|accessdate=2010-07-08\|archiveurl=https://web.archive.org/web/20170906035012/http://unicode.org/history/unicode88.pdf\|archivedate=2017-09-06\|deadlink=yes}}</ref>. '''Проблема дублирования шрифтов'''. Для каждой кодировки создавался свой шрифт, даже если наборы символов в кодировках совпадали частично или полностью. Проблему можно было решить путём создания «больших» шрифтов, из которых впоследствии выбирались бы нужные для данной кодировки символы. Однако это требовало создания единого реестра символов, чтобы определять, чему что соответствует. Была признана необходимость создания единой «широкой» кодировки. Кодировки с переменной длиной символа, широко использующиеся в Восточной Азии, были признаны слишком сложными в использовании, поэтому было решено использовать символы фиксированной ширины. Использование 32-битных символов казалось слишком расточительным, поэтому было решено использовать 16-битные. Первая версия Юникода представляла собой кодировку с фиксированным размером символа в 16 бит, то есть общее число кодов было 2<sup>16</sup> ({{formatnum:65536}}). С тех пор символы стали обозначать четырьмя шестнадцатеричными цифрами (например, <code>U+04F0</code>). При этом в Юникоде планировалось кодировать не все существующие символы, а только те, которые необходимы в повседневном обиходе. Редко используемые символы должны были размещаться в «области пользовательских символов» ({{lang\|en\|private use area}}), которая первоначально занимала коды <code>U+D800…U+F8FF</code>. Чтобы использовать Юникод также и в качестве промежуточного звена при преобразовании разных кодировок друг в друга, в него включили все символы, представленные во всех наиболее известных кодировках. В дальнейшем, однако, было принято решение кодировать все символы и в связи с этим значительно расширить кодовую область. Одновременно с этим, коды символов стали рассматриваться не как 16-битные значения, а как абстрактные числа, которые в компьютере могут представляться множеством разных способов (см. [[#Способы представления\|способы представления]]). Поскольку в ряде компьютерных систем (например, [[Windows NT]]<ref name="windows-nt">{{cite web\|url=http://support.microsoft.com/kb/99884\|title=Unicode and Microsoft Windows NT\|work=Microsoft Support\|lang=en\|archiveurl=https://web.archive.org/web/20090926092654/http://support.microsoft.com/kb/99884\|archivedate=2009-09-26\|accessdate=2009-11-12\|deadlink=yes}}</ref>) фиксированные 16-битные символы уже использовались в качестве кодировки по умолчанию, было решено все наиболее важные знаки кодировать только в пределах первых {{formatnum:65536}} позиций (так называемая {{lang-en\|basic multilingual plane, BMP}}). Остальное пространство используется для «дополнительных символов» ({{lang-en\|supplementary characters}}): систем письма вымерших языков или очень редко используемых [[китай]]ских иероглифов, математических и музыкальных символов. Для совместимости со старыми 16-битными системами была изобретена система [[UTF-16]], где первые {{formatnum:65536}} позиций, за исключением позиций из интервала U+D800…U+DFFF, отображаются непосредственно как 16-битные числа, а остальные представляются в виде «суррогатных пар» (первый элемент пары из области U+D800…U+DBFF, второй элемент пары из области U+DC00…U+DFFF). Для суррогатных пар была использована часть кодового пространства (2048 позиций), отведённого «для частного использования». Поскольку в UTF-16 можно отобразить только 2<sup>20</sup>+2<sup>16</sup>−2048 ({{formatnum:1112064}}) символов, то это число и было выбрано в качестве окончательной величины кодового пространства Юникода (диапазон кодов: 0x000000-0x10FFFF). Хотя кодовая область Юникода была расширена за пределы 2<sup>16</sup> уже в версии 2.0, первые символы в «верхней» области были размещены только в версии 3.1. Роль этой кодировки в веб-секторе постоянно растёт. На начало 2010 доля веб-сайтов, использующих Юникод, составила около 50 %<ref>{{cite web\|url=http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov\|title=Unicode используется почти на 50% веб-сайтов\|lang=ru\|archiveurl=https://web.archive.org/web/20100611042601/http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov\|archivedate=2010-06-11\|accessdate=2010-02-09\|deadlink=yes}}</ref>. == Версии Юникода == Работа по доработке стандарта продолжается. Новые версии выпускаются по мере изменения и пополнения таблиц символов. Параллельно выпускаются новые документы [[Международная организация по стандартизации\|ISO]]/IEC 10646. Первый стандарт выпущен в 1991 году, последний на данный момент — в 2020. Версии стандарта 1.0—5.0 публиковались как книги и имеют [[ISBN]]<ref>[http://www.unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates]</ref><ref>[http://www.unicode.org/versions/enumeratedversions.html Enumerated Versions]</ref>. Номер версии стандарта составлен из трёх цифр (например, 3.1.1). Третью цифру меняют при внесении в стандарт небольших изменений, не добавляющих новых символов (исключение — версия 1.0.1, в которой добавлены {{iw\|Унифицированные идеограммы ККЯ\|унифицированные идеограммы китайского, японского и корейского письма\|en\|CJK Unified Ideographs}})<ref>[http://www.unicode.org/versions/index.html About Versions]</ref>. База данных символов Юникода ([http://www.unicode.org/ucd/ Unicode Character Database]) доступна для всех версий на официальном сайте как в простом текстовом, так и в XML-формате. Файлы распространяются под BSD-подобной [http://www.unicode.org/copyright.html лицензией]. {{Временная линия Юникода}} {\| class="wikitable" \|- \|+ Версии Юникода \|- ! Номер версии ! Дата публикации ! [[Международный стандартный книжный номер\|ISBN]] книги ! Издание ISO/IEC 10646 ! Количество [[Письменность\|письменностей]] ! Количество символов<ref group="A">'''Включая''' символы графические ({{lang-en\|graphic}}), управляющие ({{lang-en\|control}}) и символы форматирования ({{lang-en\|format}}); '''не включая''' [[Области для частного использования\|символы для частного использования]] ({{Lang-en\|private-use}}), несимвольные знаки ({{Lang-en\|noncharacters}}) и суррогаты ({{lang-en\|surrogate code points}}).</ref> ! Изменения \|- \| 1.0.0<ref>{{cite web\|title=Unicode® 1.0\|url=http://www.unicode.org/versions/Unicode1.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Октябрь 1991 \| ISBN 0-201-56788-1 (Vol.1) \| \| {{formatnum:24}} \| {{formatnum:7161}} \| Изначально Юникод содержал символы следующих письменностей: [[арабское письмо]], [[армянское письмо]], [[бенгальское письмо]], [[Чжуинь\|чжуиньское письмо]], [[кириллица]], [[деванагари]], [[грузинское письмо]], [[Греческий алфавит\|греческое и коптское письмо]], [[Гуджарати (письмо)\|гуджарати]], [[гурмукхи]], [[хангыль]], [[Еврейский алфавит\|еврейское письмо]], [[хирагана]], [[Каннада (письмо)\|каннада]], [[катакана]], [[лаосское письмо]], [[Латинский алфавит\|латиница]], [[Малаялам (письмо)\|малаялам]], [[Ория (письмо)\|ория]], [[тамильское письмо]], [[Телугу (письмо)\|телугу]], [[тайское письмо]] и [[тибетское письмо]]<ref>{{cite web \| title = Unicode Data 1.0.0 \| url = http://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 1.0.1 \| Июнь 1992 \| ISBN 0-201-60845-6 (Vol.2) \| \| {{formatnum:25}} \| {{formatnum:28359}} \| Добавлены {{formatnum:20902}} {{iw\|Унифицированные идеограммы ККЯ\|унифицированные идеограммы китайского, японского и корейского письма\|en\|CJK Unified Ideographs}}<ref>{{cite web \| title = Unicode Data 1.0.1 \| url = http://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 1.1<ref>{{cite web\|title=Unicode® 1.1\|url=http://www.unicode.org/versions/Unicode1.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Июнь 1993 \| \| ISO/IEC 10646-1:1993 \| {{formatnum:24}} \| {{formatnum:34233}} \| Добавлено {{formatnum:4306}} слогов [[Хангыль\|хангыля]], дополнивших уже имеющиеся в кодировке {{formatnum:2350}} символов. Удалены символы [[Тибетское письмо\|тибетского письма]]<ref>{{cite web \| title = Unicode Data 1995 \| url = http://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 2.0<ref>{{cite web\|title=Unicode 2.0.0\|url=http://www.unicode.org/versions/Unicode2.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Июль 1996 \| ISBN 0-201-48345-9 \| ISO/IEC 10646-1:1993 и Amendments 5, 6, 7 \| {{formatnum:25}} \| {{formatnum:38950}} \| Удалены добавленные ранее слоги [[Хангыль\|хангыля]], и добавлены {{formatnum:11172}} новых слога хангыля с новыми кодами. Возвращены удалённые ранее символы [[Тибетское письмо\|тибетского письма]]; символы получили новые коды и были размещены в разных таблицах. Введён механизм суррогатных ({{lang-en\|surrogate}}) символов. Выделено место для плоскостей ({{lang-en\|planes}}) [[Области для частного использования\|15 и 16]]<ref>{{cite web \| title = Unicode Data 2.0.14 \| url = http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 2.1<ref>{{cite web\|title=Unicode 2.1.0\|url=http://www.unicode.org/versions/Unicode2.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Май 1998 \| \| ISO/IEC 10646-1:1993, Amendments 5, 6, 7, два символа из Amendment 18 \| 25 \| {{formatnum:38952}} \| Добавлен [[символ евро]]<ref>{{cite web \| title = Unicode Data 2.1.2 \| url = http://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 3.0<ref>{{cite web\|title=Unicode 3.0.0\|url=http://www.unicode.org/versions/Unicode3.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Сентябрь 1999 \| ISBN 0-201-61633-5 \| ISO/IEC 10646-1:2000 \| {{formatnum:38}} \| {{formatnum:49259}} \| Добавлены письмо [[Чероки (письмо)\|чероки]], [[эфиопское письмо]], [[кхмерское письмо]], [[монгольские письменности]], [[бирманское письмо]], [[огамическое письмо]], [[руны]], [[сингальское письмо]], [[сирийское письмо]], [[Тана (письмо)\|тана]], [[канадское слоговое письмо]] и [[письмо и]], а также символы [[Шрифт Брайля\|шрифта Брайля]]<ref>{{cite web \| title = Unicode Data 3.0.0 \| url = http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 3.1<ref>{{cite web\|title=Unicode 3.1.0\|url=http://www.unicode.org/versions/Unicode3.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Март 2001 \| \| ISO/IEC 10646-1:2000 ISO/IEC 10646-2:2001 \| {{formatnum:41}} \| {{formatnum:94205}} \| Добавлены [[Дезеретский алфавит\|дезеретское письмо]], [[готское письмо]] и {{iw\|древнеиталийское письмо\|\|en\|Old Italic alphabet}}, а также символы [[Современная музыкальная нотация\|западной]] и [[Византийская музыка\|византийской]] музыки, {{formatnum:42711}} {{iw\|Унифицированные идеограммы ККЯ\|унифицированных идеограмм китайского, японского и корейского письма\|en\|CJK Unified Ideographs}}. Выделено место для плоскостей [[Плоскость (Юникод)#Дополнительная многоязычная плоскость\|1]], [[Плоскость (Юникод)#Дополнительная идеографическая плоскость\|2]] и [[Плоскость (Юникод)#Специализированная дополнительная плоскость\|14]]<ref>{{cite web \| title = Unicode Data 3.1.0 \| url = http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 3.2<ref>{{cite web\|title=Unicode 3.2.0\|url=http://www.unicode.org/versions/Unicode3.2.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Март 2002 \| \| ISO/IEC 10646-1:2000 и Amendment 1 ISO/IEC 10646-2:2001 \| {{formatnum:45}} \| {{formatnum:95221}} \| Добавлены [[Бухид (письмо)\|письмо бухид]], {{iw\|Письмо хануноо\|хануноо\|en\|Hanunó'o script}}, [[байбайин]] и [[Тагбанва (письмо)\|письмо тагбанва]]<ref>{{cite web \| title = Unicode Data 3.2.0 \| url = http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 4.0<ref>{{cite web\|title=Unicode 4.0.0\|url=http://www.unicode.org/versions/Unicode4.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Апрель 2003 \| ISBN 0-321-18578-1 \| ISO/IEC 10646:2003 \| {{formatnum:52}} \| {{formatnum:96447}} \| Добавлены [[кипрское письмо]], [[Лимбу (письмо)\|письмо лимбу]], [[линейное письмо Б]], [[сомалийское письмо]], [[Алфавит Шоу\|алфавит шоу]], [[Тай-ныа#Письменность\|письмо лы]] и [[угаритское письмо]], а также символы [[Гексаграмма (Ицзин)\|гексаграмм]]<ref>{{cite web \| title = Unicode Data 4.0.0 \| url = http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 4.1<ref>{{cite web\|title=Unicode 4.1.0\|url=http://www.unicode.org/versions/Unicode4.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Март 2005 \| \| ISO/IEC 10646:2003 и Amendment 1 \| {{formatnum:59}} \| {{formatnum:97720}} \| Добавлены [[Лонтара\|письмо лонтара]], [[глаголица]], [[Кхароштхи\|письмо кхароштхи]], [[новое письмо лы]], [[древнеперсидская клинопись]], [[силхетское нагари]] и [[древнеливийское письмо]]. Символы [[Коптское письмо\|коптского письма]] были отделены от символов [[Греческий алфавит\|греческого письма]]. Также добавлены [[Аттическая система счисления\|символы старых греческих цифр]], музыкальные символы Древней Греции и [[символ гривны]] (валюты [[Украина\|Украины]])<ref>{{cite web \| title = Unicode Data 4.1.0 \| url = http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 5.0<ref>{{cite web\|title=Unicode 5.0.0\|url=http://www.unicode.org/versions/Unicode5.0.0/\|date=2006-07-14\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Июль 2006 \| ISBN 0-321-48091-0 \| ISO/IEC 10646:2003, Amendments 1, 2, четыре символа из Amendment 3 \| {{formatnum:64}} \| {{formatnum:99089}} \| Добавлены [[балийское письмо]], [[клинопись]], [[Нко (письмо)\|письмо нко]], [[монгольское квадратное письмо]] и [[финикийское письмо]]<ref>{{cite web \| title = Unicode Data 5.0.0 \| url = http://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 5.1<ref>{{cite web\|title=Unicode 5.1.0\|url=http://www.unicode.org/versions/Unicode5.1.0/\|date=2008-04-04\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Апрель 2008 \| \| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4 \| {{formatnum:75}} \| {{formatnum:100713}} \| Добавлены [[карийское письмо]], [[чамская письменность]], [[Кая-ли\|письмо кая-ли]], [[Лепча (письмо)\|письмо лепча]], [[Ликийский алфавит\|ликийское письмо]], [[Лидийский алфавит\|лидийское письмо]], [[Ол-чики\|письмо ол-чики]], [[реджангское письмо]], [[Саураштра (письмо)\|письмо саураштра]], [[сунданское письмо]],[[Древнетюркское письмо]] и [[Ваи (письмо)\|письмо ваи]]. Добавлены [[Фестский диск\|символы фестского диска]], символы костей для [[маджонг]]а и [[домино]], [[заглавная буква эсцет]] (ẞ), а также буквы латиницы, использовавшиеся в средневековых [[Рукопись\|рукописях]] для {{iw\|аббревация\|аббревиации\|en\|Scribal abbreviation}}. Новыми символами дополнен набор символов [[Бирманское письмо\|бирманского письма]]<ref>{{cite web \| title = Unicode Data 5.1.0 \| url = http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 5.2<ref>{{cite web\|title=Unicode® 5.2.0\|url=http://www.unicode.org/versions/Unicode5.2.0/\|date=2009-10-01\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Октябрь 2009 \| \| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4, 5, 6 \| {{formatnum:90}} \| {{formatnum:107361}} \| Добавлены [[Авестийский алфавит\|авестийское письмо]], [[Бамум (письменность)\|письмо бамум]], [[египетское иероглифическое письмо]] (по {{iw\|список Гардинера\|списку Гардинера\|en\|Gardiner's sign list}}, содержащему {{formatnum:1071}} символ), [[имперское арамейское письмо]], {{iw\|пахлевийское эпиграфическое письмо\|\|en\|Inscriptional Pahlavi}}, {{iw\|парфянское эпиграфическое письмо\|\|en\|Inscriptional Parthian}}, [[яванское письмо]], [[Кайтхи\|письмо кайтхи]], [[Алфавит Фрейзера\|письмо лису]], [[Манипури (письмо)\|письмо манипури]], [[южноаравийское письмо]], [[древнетюркское письмо]], [[самаритянское письмо]], [[Ланна (письмо)\|письмо ланна]] и {{iw\|письмо тай-вьет\|\|en\|Tai Viet script}}. Добавлены {{formatnum:4149}} новых {{iw\|унифицированные идеограммы китайского, японского, корейского письма\|унифицированных идеограмм китайского, японского и корейского письма\|en\|CJK Unified Ideographs}} (CJK-C), символы [[Ведийский язык\|ведийского письма]], [[символ тенге]] (валюты [[Казахстан]]а), а также расширен набор символов чамо [[Хангыль\|старого хангыля]]<ref>{{cite web \| title = Unicode Data 5.2.0 \| url = http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 6.0<ref>{{cite web\|title=Unicode® 6.0.0\|url=http://www.unicode.org/versions/Unicode6.0.0/\|date=2010-10-11\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Октябрь 2010 \| \| ISO/IEC 10646:2010 и [[символ индийской рупии]] \| {{formatnum:93}} \| {{formatnum:109449}} \| Добавлены [[батакское письмо]], [[Брахми\|письмо брахми]], [[мандейское письмо]]. Добавлены символы [[Игральные карты\|игральных карт]], [[Дорожный знак\|дорожных знаков]], [[Географическая карта\|географических карт]], [[Алхимические символы\|алхимии]], [[эмотикон]]а и [[эмодзи]], а также {{formatnum:222}} {{iw\|унифицированные идеограммы китайского, японского и корейского письма\|\|en\|CJK Unified Ideographs}} (CJK-D)<ref>{{cite web \| title = Unicode Data 6.0.0 \| url = http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 6.1<ref>{{cite web\|title=Unicode® 6.1.0\|url=http://www.unicode.org/versions/Unicode6.1.0/\|date=2012-01-31\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| Январь 2012 \| \| ISO/IEC 10646:2012 \| {{formatnum:100}} \| {{formatnum:110181}} \| Добавлены [[Чакма (письмо)\|письмо чакма]], [[Мероитское письмо\|мероитский курсив и мероитские иероглифы]], [[Письмо Полларда\|письмо мяо]], [[Шарада (письмо)\|письмо шарада]], {{iw\|письмо соранг-сомпенг\|\|en\|Sora Sompeng}} и [[Такри\|письмо такри]]<ref>{{cite web \| title = Unicode Data 6.1.0 \| url = http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 6.2<ref>{{cite web\|title=Unicode® 6.2.0\|url=http://www.unicode.org/versions/Unicode6.2.0/\|date=2012-09-26\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-07\|language=en}}</ref> \| Сентябрь 2012 \| \| ISO/IEC 10646:2012 и [[символ турецкой лиры]] \| {{formatnum:100}} \| {{formatnum:110182}} \| Добавлен [[символ турецкой лиры]] (валюты [[Турция\|Турции]])<ref>{{cite web \| title = Unicode Data 6.2.0 \| url = http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 6.3<ref>{{cite web\|title=Unicode® 6.3.0\|url=http://www.unicode.org/versions/Unicode6.3.0/\|date=2012-09-30\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-07\|language=en}}</ref> \| Сентябрь 2013 \| \| ISO/IEC 10646:2012 и шесть символов \| {{formatnum:100}} \| {{formatnum:110187}} \| Добавлено пять символов для форматирования двунаправленного текста<ref>{{cite web \| title = Unicode Data 6.3.0 \| url = http://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 7.0<ref>{{cite web\|title=Unicode® 7.0.0\|url=http://www.unicode.org/versions/Unicode7.0.0/\|date=2014-06-16\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| 16 июня 2014 \| \| ISO/IEC 10646:2012, Amendments 1, 2 и [[символ российского рубля\|символ рубля]] \| {{formatnum:123}} \| {{formatnum:113021}} \| Добавлены [[Басса (письмо)\|письмо басса]], [[агванское письмо]], [[Система Дюплойе\|стенография Дюплойе]], [[эльбасанское письмо]], [[Грантха\|письмо грантха]], {{iw\|письмо ходжики\|\|en\|Khojki}}, {{iw\|письменность худавади\|\|en\|Khudabadi alphabet}}, [[линейное письмо А]], {{iw\|письмо махаджани\|\|en\|Mahajani}}, [[манихейское письмо]], [[Кикакуи\|письмо кикакуи]], [[Моди (письмо)\|письмо моди]], {{iw\|письмо мро\|\|en\|Mro script}}, [[набатейское письмо]], [[Северноаравийские языки\|северноаравийское письмо]], [[древнепермское письмо]], [[Пахау\|письмо пахау]], [[Пальмирский алфавит\|пальмирское письмо]], {{iw\|письмо по чин хо\|\|en\|Pau Cin Hau}}, {{iw\|письмо псалтирь пехлеви\|\|en\|Psalter Pahlavi}}, [[сиддхаматрика]], [[Тирхута\|письмо тирхута]], [[варанг-кшити]] и {{iw\|дингбат\|орнамент дингбат\|en\|Dingbat}}, а также [[символ российского рубля]] и [[символ азербайджанского маната]]<ref>{{cite web \| title = Unicode Data 7.0.0 \| url = http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 8.0<ref>{{cite web\|title=Unicode® 8.0.0\|url=http://www.unicode.org/versions/Unicode8.0.0/\|date=2015-06-17\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| 17 июня 2015 \| \|ISO/IEC 10646:2014, Amendment 1, [[символ лари]], 9 унифицированных идеограмм ККЯ, 41 [[эмодзи]] \|129 \| {{formatnum:120737}} \|Добавлены [[Ахом (письмо)\|письмо ахом]], [[анатолийские иероглифы]], [[Хатран\|письмо хатран]], [[Мултани\|письмо мултани]], [[венгерские руны]], [[SignWriting]], {{formatnum:5776}} [[Унифицированные идеограммы ККЯ — расширение E]], строчные буквы письма [[чероки]], буквы латиницы для немецкой диалектологии, 41 [[эмодзи]], а также пять символов изменения [[Шкала Фитцпатрика\|цвета кожи]] для эмотиконов. Добавлен [[символ лари]] (валюты [[Грузия\|Грузии]])<ref>{{cite web \| title = Unicode Data 8.0.0 \| url = http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-04 }}</ref> \|- \| 9.0<ref>{{cite web\|title=Unicode® 9.0.0\|url=http://www.unicode.org/versions/Unicode9.0.0/\|date=2016-06-21\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| 21 июня 2016 \| \|ISO/IEC 10646:2014, Amendments 1, 2, адлам, нева, японские символы для ТВ, 74 [[эмодзи]] и символов \|135 \| {{formatnum:128237}} \|Добавлены [[Адлам\|письмо адлам]], [[Бхайкшуки\|письмо бхайкшуки]], [[Марчен\|письмо марчен]], [[Нева (письмо)\|письмо нева]], [[Осейдж (письмо)\|письмо осейдж]], [[тангутское письмо]], а также 72 [[эмодзи]] и японские символы для телевидения<ref>{{cite web \| title = Unicode Data 9.0.0 \| url = http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-06 }}</ref> \|- \| 10.0<ref>{{cite web\|title=Unicode® 10.0.0\|url=http://www.unicode.org/versions/Unicode10.0.0/\|date=2017-06-27\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> \| 20 июня 2017 \| \|ISO/IEC 10646:2017, 56 [[эмодзи]], 285 символов [[Хэнтайгана\|хэнтайганы]], 3 символа квадратного письма Дзанабадзара \|139 \| {{formatnum:136755}} \|Добавлены [[Монгольские письменности#Горизонтальное квадратное письмо\|квадратное письмо Дзанабадзара]], [[Соёмбо (письмо)\|письмо соёмбо]], [[гонди Масарама]], [[Нюй-шу\|письмо нюй-шу]], [[Хэнтайгана\|письмо хэнтайгана]], {{formatnum:7494}} [[Унифицированные идеограммы ККЯ — расширение F]], а также 56 [[эмодзи]] и символ [[биткойн]]а<ref>{{cite web \| title = Unicode Data 10.0.0 \| url = http://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt \| lang = en \| accessdate = 2017-12-07 }}</ref> \|- \| 11.0 \| Июнь 2018 \| \| ISO/IEC 10646:2017 \| 146 \| {{formatnum:137439}} \|Добавлены догра, [[грузинское письмо]] мтаврули, гунджалское гонди, [[ханифи]], индийские цифры сийяк, [[Макасарский язык\|макасарское]] письмо, медефайдрин, (древне-)[[согдийское письмо]], [[цифры майя]], 5 идеограмм ККЯ, символы [[сянци]] и половин звёздочек для оценки, а также 145 [[эмодзи]], четыре символа изменения причёски для эмотиконов и символ [[копилефт]]а<ref>{{Cite web\|url=http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt\|title=Unicode Data 11.0.0\|author=\|website=\|date=\|publisher=\|accessdate=2019-04-12\|lang=en}}</ref><ref>[http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html The Unicode Blog: Announcing The Unicode® Standard, Version 11.0]</ref><ref>[http://www.unicode.org/versions/Unicode11.0.0/ Unicode 11.0.0]</ref> \|- \|12.0 \|Март 2019 \| \|ISO/IEC 10646:2017, Amendments 1, 2, а также 62 дополнительных символов \|150 \|{{formatnum:137993}} \|Добавлены элимайское письмо, {{Не переведено 3\|надинагари\|3=en\|4=Nandinagari}}, хмонг, ванчо, дополнения для [[Письмо Полларда\|письма Полларда]], малая [[кана]] для старых японских текстов, исторические дроби и символы [[Тамильское письмо\|тамильского письма]], буквы [[Лаосское письмо\|лаосского письма]] для [[пали]], буквы латиницы для транслитерации угаритского, управляющие символы форматирования египетских иероглифов, а также 61 [[эмодзи]]<ref>[http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html The Unicode Blog: Announcing The Unicode® Standard, Version 12.0]</ref><ref>[http://www.unicode.org/versions/Unicode12.0.0/ Unicode 12.0.0]</ref> \|- \|12.1 \|Май 2019 \| \| \|150 \|{{formatnum:137994}} \|Добавлен квадратный символ эпохи [[рэйва]]<ref>[http://blog.unicode.org/2019/05/unicode-12-1-en.html The Unicode Blog: Unicode Version 12.1 released in support of the Reiwa Era]</ref><ref>[http://www.unicode.org/versions/Unicode12.1.0/ Unicode 12.1.0]</ref> \|- \|13.0 \|Март 2020 \| \| \|154 \|{{formatnum:143859}} \|Добавлены [[хорезмийское письмо]], письмо [[дивес акуру]], [[малое киданьское письмо]], [[езидское письмо]], {{formatnum:4969}} идеограмм ККЯ (включая {{formatnum:4939}} [[Унифицированные идеограммы ККЯ — расширение G]]), а также 55 [[эмодзи]], символы [[Creative Commons]] и символы для унаследованной вычислительной техники. Выделено место для [[Плоскость (Юникод)#Третичная идеографическая плоскость\|плоскости 3]]<ref>[http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html The Unicode Blog: Announcing The Unicode Standard, Version 13.0]</ref><ref>[http://www.unicode.org/versions/Unicode13.0.0/ Unicode 13.0.0]</ref> \|- \|colspan="7" \| '''Примечания''' <references group="A" /> \|} == Кодовое пространство == Хотя формы записи UTF-8 и UTF-32 позволяют кодировать до 2<sup>31</sup> ({{formatnum:2147483648}}) кодовых позиций, было принято решение использовать лишь {{formatnum:1112064}} для совместимости с UTF-16. Впрочем, даже и этого в данный момент более чем достаточно — в версии 13.0 используется всего {{formatnum:143859}} кодовых позиций. Кодовое пространство разбито на 17 ''[[Плоскость (Юникод)\|плоскостей]]'' ({{lang-en\|planes}}) по 2<sup>16</sup> ({{formatnum:65536}}) символов. Нулевая плоскость ({{lang-en2\|plane{{nbsp}}0}}) называется ''базовой'' ({{lang-en2\|basic}}) и содержит символы наиболее употребительных письменностей. Остальные плоскости — дополнительные ({{lang-en2\|supplementary}}). Первая плоскость ({{lang-en2\|plane{{nbsp}}1}}) используется в основном для исторических письменностей, вторая ({{lang-en2\|plane{{nbsp}}2}}) — для редко используемых иероглифов [[CJK\|китайского письма (ККЯ)]], третья ({{lang-en2\|plane{{nbsp}}3}}) зарезервирована для архаичных китайских иероглифов<ref>[http://unicode.org/roadmaps/tip/ Roadmap to the TIP (Tertiary Ideographic Plane)]</ref>. Плоскость 14 отведена для символов, используемых по особому назначению. Плоскости 15 и 16 выделены для частного употребления<ref name='unicode-02' />. Для обозначения символов Unicode используется запись вида «U+''xxxx''» (для кодов 0…FFFF), или «U+''xxxxx''» (для кодов 10000…FFFFF), или «U+''xxxxxx''» (для кодов 100000…10FFFF), где ''xxx'' — [[шестнадцатеричная система счисления\|шестнадцатеричные]] цифры. Например, символ «я» (U+044F) имеет код 044F{{sub\|16}}{{nbsp}}= 1103{{sub\|[[десятичная система счисления\|10]]}}. {\| class="wikitable sortable collapsible collapsed" \|- ! colspan="3" \| Плоскости Юникода \|- ! Плоскость !! Название !! Диапазон символов \|- \| 0 \|\| Базовая многоязыковая плоскость ({{lang-en2\|Basic multilingual plane, BMP}}) \|\| U+0000…U+FFFF \|- \| 1 \|\| Дополнительная многоязыковая плоскость ({{lang-en2\|Supplementary multilingual plane, SMP}}) \|\| U+10000…U+1FFFF \|- \| 2 \|\| Дополнительная иероглифическая плоскость ({{lang-en2\|Supplementary ideographic plane, SIP}}) \|\| U+20000…U+2FFFF \|- \| 3 \|\| Третичная иероглифическая плоскость ({{lang-en2\|Tertiary ideographic plane, TIP}}) \|\| U+30000…U+3FFFF \|- \| 4—13 \|\| не используются \|\| U+40000…U+DFFFF \|- \| 14 \|\| Дополнительная плоскость особого назначения ({{lang-en2\|Supplementary special-purpose plane, SSP}}) \|\| U+E0000…U+EFFFF \|- \| 15—16 \|\| Дополнительные области для частного использования ({{lang-en2\|Supplementary private use area, SPUA-A/B}}) \|\| U+F0000…U+10FFFF \|- \|} == Система кодирования == Универсальная система кодирования (Юникод) представляет собой набор графических символов и способ их кодирования для [[компьютер]]ной обработки текстовых данных. Графические символы — это символы, имеющие видимое изображение. Графическим символам противопоставляются управляющие символы и символы форматирования. Графические символы включают в себя следующие группы: * буквы, содержащиеся хотя бы в одном из обслуживаемых [[алфавит]]ов; * цифры; * знаки пунктуации; * специальные знаки ([[математика\|математические]], технические, [[идеограмма\|идеограммы]] и пр.); * разделители. Юникод — это система для линейного представления текста. Символы, имеющие дополнительные над- или подстрочные элементы, могут быть представлены в виде построенной по определённым правилам последовательности кодов (составной вариант, composite character) или в виде единого символа (монолитный вариант, precomposed character). С 2014 года считается, что все буквы крупных письменностей в Юникод внесены, и если символ доступен в составном варианте, дублировать его в монолитном виде не нужно. === Политика консорциума === Консорциум не создаёт нового, а констатирует сложившийся порядок вещей<ref name="emoji">[http://www.unicode.org/faq/emoji_dingbats.html FAQ — Emoji{{nbsp}}& Dingbats]</ref>. Например, картинки «[[эмодзи]]» были добавлены потому, что японские операторы мобильной связи широко их использовали. Для этого добавление символа проходит через сложный процесс<ref name="emoji" />. И, например, [[символ российского рубля]] прошёл его за три месяца, как только получил официальный статус, причём до этого он много лет де-факто использовался и его отказывались включить в Юникод. [[Товарный знак\|Товарные знаки]] кодируют только в порядке исключения. Так, в Юникоде нет флага [[Windows]] или яблока [[Apple]]. Как только символ появился в кодировке, он никогда не сдвинется и не исчезнет. Если же потребуется изменить порядок символов, это делается не переменой позиций, а национальным порядком сортировки. Есть и другие, более тонкие гарантии стабильности — например, не будут меняться таблицы нормализации<ref>[http://www.unicode.org/policies/stability_policy.html Unicode Character Encoding Stability Policy]</ref>. === Объединение и дублирование символов === Один и тот же символ может иметь несколько форм; в Юникод эти формы входят одной кодовой позицией: * если это сложилось исторически. Например, у [[арабское письмо\|арабских букв]] есть четыре формы: обособленная, в начале, в середине и в конце<ref>Впоследствии конкретным формам арабских букв отвели отдельные позиции. Но всё равно рекомендуется писать по-арабски «общими» вариантами букв.</ref>; * либо если в одном языке принята одна форма, а в другом — другая. [[Болгарская кириллица (типографика)\|Болгарская кириллица]] отличается от русской, а китайские иероглифы — от японских. С другой стороны, если исторически в шрифтах у разных форм начертания были разные символы, то они остаются разными и в Юникоде. Например, строчная греческая [[Сигма (буква)\|сигма]] имеет две формы, и в Юникоде у них разные коды; буква [[Расширенная латиница\|расширенной латиницы]]{{nbsp}}[[Å (латиница)\|Å]] ({{nobr\|A с кружком}}) и знак [[ангстрем]]а{{nbsp}}Å, греческая буква{{nbsp}}[[Мю\|μ]] и обозначение приставки «[[микро-]]»{{nbsp}}µ — тоже имеют разные кодовые позиции. Конечно же, похожие символы в неродственных письменностях также ставятся в разные кодовые позиции. Например, буква{{nbsp}}А в [[Латиница\|латинице]], [[Кириллица\|кириллице]], [[Греческий алфавит\|греческом]] и [[Письмо чероки\|чероки]] — разные символы. Крайне редко один и тот же символ ставится в две разные кодовые позиции для упрощения обработки текста. [[Штрих (математика)\|Математический штрих]] и такой же штрих для индикации [[Мягкий звук\|мягкости звуков]] — разные символы, второй считается буквой. == Комбинируемые символы == [[Файл:Diacritic-j.png\|right\|thumb\|Представление символа «Й» (U+0419) в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306).]] Cимволы в Юникоде подразделяются на базовые ({{lang-en\|base characters}}) и комбинируемые ({{lang-en\|combining characters}}). Комбинируемые символы обычно следуют за базовым и изменяют его отображение определённым образом. К комбинируемым символам, например, относятся [[диакритические знаки]], знаки ударения. Например, русскую букву «Й» в Юникоде можно записать в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306), отображаемого над базовым. Комбинируемые символы помечены в таблицах символов Юникода особыми категориями: * Nonspacing Mark — безинтервальный (непротяжённый) знак; таковые обычно отображаются над или под базовым символом и не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке; * Enclosing Mark — обрамляющий знак; эти символы также не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке, но отображаются сразу с нескольких сторон базового символа; * Spacing Combining Mark — интервальный (протяжённый) комбинируемый знак; таковые, как и базовый символ, занимают отдельную горизонтальную позицию (интервал) в отображаемой строке. Особый тип комбинируемых символов — селекторы варианта начертания ({{lang-en\|variation selectors}}). Они действуют только на те базовые символы, для которых такие варианты определены. К примеру, в версии Юникода 5.0 варианты начертания определены для ряда математических символов, для символов традиционного [[монгольский алфавит\|монгольского алфавита]] и для символов [[Монгольское квадратное письмо\|монгольского квадратного письма]]. == Алгоритмы нормализации == Из-за наличия в Юникоде комбинируемых символов одни и те же знаки письменности можно представить различными кодами. Так, например, букву "Й" в примере выше можно записать как отдельным символом, так и сочетанием базового и комбинированного. Из-за этого сравнение строк байт за байтом становится невозможным. Алгоритмы нормализации ({{lang-en\|normalization forms}}) решают эту проблему, выполняя приведение символов к определённому стандартному виду. Приведение осуществляется путём замены символов на эквивалентные с использованием таблиц и правил. «Декомпозицией» называется замена (разложение) одного символа на несколько составляющих символов, а «композицией», наоборот, — замена (соединение) нескольких составляющих символов на один символ. В стандарте Юникода определены четыре алгоритма нормализации текста: NFD, NFC, NFKD и NFKC. === NFD === NFD, {{lang-en\|'''n'''ormalization '''f'''orm '''D'''}} («D» от {{lang-en\|'''d'''ecomposition}}), форма нормализации D — каноническая декомпозиция — алгоритм, согласно которому выполняется рекурсивное разложение составных символов ({{lang-en\|precomposed characters}}) на последовательность из одного или нескольких простых символов в соответствии с таблицами декомпозиции. Рекурсивное потому, что в процессе разложения составной символ может быть разложен на несколько других, некоторые из которых тоже являются составными, и к которым применяется дальнейшее разложение. Примеры: {\| style="text-align:center" \| {\| class="wikitable" style="text-align:center" \| Ω \|- \| <small>U+2126</small> \|} \| colspan="2" \| → \| {\| class="wikitable" \| Ω \|- \| <small>U+03A9</small> \|} \|} {\| style="text-align:center" \| {\| class="wikitable" style="text-align:center" \| <big>Å</big> \|- \| <small>U+00C5</small> \|} \| colspan="2" \| → \| {\| class="wikitable" \| <big>A</big> \|- \| <small>U+0041</small> \|} \| {\| class="wikitable" \| <big> ̊</big> \|- \| <small>U+030A</small> \|} \|} {\| style="text-align:center" \| {\| class="wikitable" style="text-align:center" \| <big>ṩ</big> \|- \| <small>U+1E69</small> \|} \| colspan="2" \| → \| {\| class="wikitable" \| <big>s</big> \|- \| <small>U+0073</small> \|} \| {\| class="wikitable" \| <big> ̣</big> \|- \| <small>U+0323</small> \|} \| {\| class="wikitable" \| <big> ̇</big> \|- \| <small>U+0307</small> \|} \|} {\| style="text-align:center" \| {\| class="wikitable" style="text-align:center" \| colspan="2" \| <big>ḍ̇</big> \|- \| <small>U+1E0B</small> \|\| <small>U+0323</small> \|} \| colspan="2" \| → \| {\| class="wikitable" \| <big>d</big> \|- \| <small>U+0064</small> \|} \| {\| class="wikitable" \| <big> ̣</big> \|- \| <small>U+0323</small> \|} \| {\| class="wikitable" \| <big> ̇</big> \|- \| <small>U+0307</small> \|} \|} {\| style="text-align:center" \| {\| class="wikitable" style="text-align:center" \| colspan="3" \| <big>q̣̇</big> \|- \| <small>U+0071</small> \|\| <small>U+0307</small> \|\| <small>U+0323</small> \|} \| colspan="2" \| → \| {\| class="wikitable" \| <big>q</big> \|- \| <small>U+0071</small> \|} \| {\| class="wikitable" \| <big> ̣</big> \|- \| <small>U+0323</small> \|} \| {\| class="wikitable" \| <big> ̇</big> \|- \| <small>U+0307</small> \|} \|} === NFC === NFC, {{lang-en\|'''n'''ormalization '''f'''orm '''C'''}} («C» от {{lang-en\|'''c'''omposition}}), форма нормализации C — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и каноническая композиция. Сначала каноническая декомпозиция (алгоритм NFD) приводит текст к форме D. Затем каноническая композиция — операция, обратная NFD, обрабатывает текст от начала к концу с учётом следующих правил: * символ <code>S</code> считается ''начальным'', если имеет нулевой класс комбинируемости ({{lang-en\|combining class of zero}}) согласно таблице символов Юникода; * в любой последовательности символов, начинающейся с символа <code>S</code>, символ <code>C</code> блокируется от <code>S</code>, только если между <code>S</code> и <code>C</code> есть какой-либо символ <code>B</code>, который либо является начальным, либо имеет одинаковый или больший класс комбинируемости, чем <code>C</code>. Это правило распространяется только на строки, прошедшие каноническую декомпозицию; * символ считается ''первичным'' композитом, если имеет каноническую декомпозицию в таблице символов Юникода (или каноническую декомпозицию для [[Хангыль\|хангыля]] и он не входит в [http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table список исключений]); * символ <code>X</code> может быть первично совмещён с символом <code>Y</code>, если и только если существует первичный композит <code>Z</code>, канонически эквивалентный последовательности <<code>X</code>, <code>Y</code>>; * если очередной символ <code>C</code> не блокируется последним встреченным начальным базовым символом <code>L</code> и он может быть успешно первично совмещён с ним, то <code>L</code> заменяется на композит <code>L-C</code>, а <code>C</code> удаляется. Пример: {\| style="text-align:center" \| {\| class="wikitable" style="text-align:center" \| <big>o</big> \|- \| <small>U+006F</small> \|} \| {\| class="wikitable" style="text-align:center" \| <big> ̂</big> \|- \| <small>U+0302</small> \|} \| colspan="2" \| → \| {\| class="wikitable" \| <big>ô</big> \|- \| <small>U+00F4</small> \|} \|} === NFKD === NFKD, {{lang-en\|'''n'''ormalization '''f'''orm '''KD'''}}, форма нормализации KD — совместимая декомпозиция — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и замены символов текста по таблицам совместимой декомпозиции. Таблицы совместимой декомпозиции предусматривают замену на почти эквивалентные символы<ref>[http://habrahabr.ru/post/45489/ Нормализация Unicode]</ref>: * похожих на буквы (ℍ и ℌ); * обведённых кружками (①); * с изменёнными размерами (ｶ и カ); * повёрнутых (︷ и {); * степеней (⁹ и ₉); * дробей (¼); * других (™). Примеры: {\| style="text-align:center;" \| {\| class="wikitable" style="text-align:center;" \| <big>ℍ</big> \|- \| <small>U+210d</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>H</big> \|- \| <small>U+0048</small> \|} \|} {\| style="text-align:center;" \| {\| class="wikitable" style="text-align:center;" \| <big>①</big> \|- \| <small>U+2460</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>1</big> \|- \| <small>U+0031</small> \|} \|} {\| \| {\| class="wikitable" style="text-align:center;" \| <big>ｶ</big> \|- \| <small>U+FF76</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>カ</big> \|- \| <small>U+30AB</small> \|} \|} {\| \| {\| class="wikitable" style="text-align:center;" \| <big>︷</big> \|- \| <small>U+FE37</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>{</big> \|- \| <small>U+007B</small> \|} \|} {\| \| {\| class="wikitable" style="text-align:center;" \| <big>⁹</big> \|- \| <small>U+2079</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>9</big> \|- \| <small>U+0039</small> \|} \|} {\| \| {\| class="wikitable" style="text-align:center;" \| <big>¼</big> \|- \| <small>U+00BC</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>1</big> \|\| <big> ⁄ </big> \|\| <big>4</big> \|- \| <small>U+0031</small> \|\| <small>U+2044</small> \|\| <small>U+0034</small> \|} \|} {\| \| {\| class="wikitable" style="text-align:center;" \| <big>™</big> \|- \| <small>U+2122</small> \|} \| colspan="2" \| → \| {\| class="wikitable" style="text-align:center;" \| <big>T</big> \|\| <big>M</big> \|- \| <small>U+0054</small> \|\| <small>U+004D</small> \|} \|} === NFKC === NFKC, {{lang-en\|'''n'''ormalization '''f'''orm '''KC'''}}, форма нормализации KC — алгоритм, согласно которому последовательно выполняются совместимая декомпозиция (алгоритм NFKD) и каноническая композиция (алгоритм NFC). === Примеры === {\| class="standard" !Исходный текст\|\|NFD\|\|NFC\|\|NFKD\|\|NFKC \|- \| <!-- fi --> {\| class="wikitable" style="text-align:center;" \| <big>ﬁ</big> \|- \| <small>U+FB01</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ﬁ</big> \|- \| <small>U+FB01</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ﬁ</big> \|- \| <small>U+FB01</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>f</big> \|\| <big>i</big> \|- \| <small>U+0066</small> \|\| <small>U+0069</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>f</big> \|\| <big>i</big> \|- \| <small>U+0066</small> \|\| <small>U+0069</small> \|} \|- \| <!-- 2^5 --> {\| class="wikitable" style="text-align:center;" \| <big>2</big> \|\| <big>⁵</big> \|- \| <small>U+0032</small> \|\| <small>U+2075</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>2</big> \|\| <big>⁵</big> \|- \| <small>U+0032</small> \|\| <small>U+2075</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>2</big> \|\| <big>⁵</big> \|- \| <small>U+0032</small> \|\| <small>U+2075</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>2</big> \|\| <big>5</big> \|- \| <small>U+0032</small> \|\| <small>U+0035</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>2</big> \|\| <big>5</big> \|- \| <small>U+0032</small> \|\| <small>U+0035</small> \|} \|- \| <!-- "s" (looks like "f") with two dots --> {\| class="wikitable" style="text-align:center;" \| colspan="2" \| <big>ẛ̣</big> \|- \| <small>U+1E9B</small> \|\| <small>U+0323</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ſ</big> \|\| <big>̣</big> \|\| <big>̇</big> \|- \| <small>U+017F</small> \|\| <small>U+0323</small> \|\| <small>U+0307</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ẛ</big> \|\| <big>̣</big> \|- \| <small>U+1E9B</small> \|\| <small>U+0323</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>s</big> \|\| <big>̣</big> \|\| <big>̇</big> \|- \| <small>U+0073</small> \|\| <small>U+0323</small> \|\| <small>U+0307</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ṩ</big> \|- \| <small>U+1E69</small> \|} \|- \| <!-- "й" --> {\| class="wikitable" style="text-align:center;" \| <big>й</big> \|- \| <small>U+0439</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>и</big> \|\| <big> ̆</big> \|- \| <small>U+0438</small> \|\| <small>U+0306</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>й</big> \|- \| <small>U+0439</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>и</big> \|\| <big> ̆</big> \|- \| <small>U+0438</small> \|\| <small>U+0306</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>й</big> \|- \| <small>U+0439</small> \|} \|- \| <!-- "ё" --> {\| class="wikitable" style="text-align:center;" \| <big>ё</big> \|- \| <small>U+0451</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>е</big> \|\| <big>̈</big> \|- \| <small>U+0435</small> \|\| <small>U+0308</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ё</big> \|- \| <small>U+0451</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>е</big> \|\| <big>̈</big> \|- \| <small>U+0435</small> \|\| <small>U+0308</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ё</big> \|- \| <small>U+0451</small> \|} \|- \| <!-- "А" --> {\| class="wikitable" style="text-align:center;" \| <big>А</big> \|- \| <small>U+0410</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>А</big> \|- \| <small>U+0410</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>А</big> \|- \| <small>U+0410</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>А</big> \|- \| <small>U+0410</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>А</big> \|- \| <small>U+0410</small> \|} \|- \| <!-- "が" --> {\| class="wikitable" style="text-align:center;" \| <big>が</big> \|- \| <small>U+304C</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>か</big> \|\| <big>゙</big> \|- \| <small>U+304B</small> \|\| <small>U+3099</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>が</big> \|- \| <small>U+304C</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>か</big> \|\| <big>゙</big> \|- \| <small>U+304B</small> \|\| <small>U+3099</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>が</big> \|- \| <small>U+304C</small> \|} \|- \| <!-- "VIII" --> {\| class="wikitable" style="text-align:center;" \| <big>Ⅷ</big> \|- \| <small>U+2167</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>Ⅷ</big> \|- \| <small>U+2167</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>Ⅷ</big> \|- \| <small>U+2167</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>V</big> \|\| <big>I</big> \|\| <big>I</big> \|\| <big>I</big> \|- \| <small>U+0056</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>V</big> \|\| <big>I</big> \|\| <big>I</big> \|\| <big>I</big> \|- \| <small>U+0056</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|} \|- \| <!-- "ç" --> {\| class="wikitable" style="text-align:center;" \| <big>ç</big> \|- \| <small>U+00E7</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>c</big> \|\| <big>̧</big> \|- \| <small>U+0063</small> \|\| <small>U+0327</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ç</big> \|- \| <small>U+00E7</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>c</big> \|\| <big>̧</big> \|- \| <small>U+0063</small> \|\| <small>U+0327</small> \|} \| {\| class="wikitable" style="text-align:center;" \| <big>ç</big> \|- \| <small>U+00E7</small> \|} \|} == Двунаправленное письмо == Стандарт Юникод поддерживает письменности языков как с направлением написания слева направо ({{lang-en\|left-to-right, LTR}}), так и с написанием справа налево ({{lang-en\|right-to-left, RTL}}) — например, [[арабское письмо\|арабское]] и [[еврейский алфавит\|еврейское]] письмо. В обоих случаях символы хранятся в «естественном» порядке; их отображение с учётом нужного направления письма обеспечивается приложением. Кроме того, Юникод поддерживает комбинированные тексты, сочетающие фрагменты с разным направлением письма. Данная возможность называется ''двунаправленность'' ({{lang-en\|bidirectional text, BiDi}}). Некоторые упрощённые обработчики текста (например, в сотовых телефонах) могут поддерживать Юникод, но не иметь поддержки двунаправленности. Все символы Юникода поделены на несколько категорий: пишущиеся слева направо, пишущиеся справа налево, и пишущиеся в любом направлении. Символы последней категории (в основном это [[знаки пунктуации]]) при отображении принимают направление окружающего их текста. == Представленные символы == [[Файл:Roadmap to Unicode BMP multilingual.svg\|lang=ru\|right\|500px\|thumb\|Схема [[Плоскость (Юникод)#Основная многоязычная плоскость\|основной мультиязычной плоскости]] Юникода]] {{Main\|Плоскость (Юникод)}} Юникод включает практически все современные [[письменность\|письменности]], в том числе: {{columns-list\|2\| * [[арабское письмо\|арабскую]], * [[армянское письмо\|армянскую]], * [[бенгальское письмо\|бенгальскую]], * [[Бирманское письмо\|бирманскую]], * [[Глаголица\|глаголицу]], * [[Греческий алфавит\|греческую]], * [[грузинское письмо\|грузинскую]], * [[деванагари]], * [[еврейский алфавит\|еврейскую]], * [[Кириллица\|кириллицу]], * [[китайское письмо\|китайскую]] (китайские иероглифы активно используются в [[японский язык\|японском языке]], а также изредка в [[корейский язык\|корейском]]), * [[коптское письмо\|коптскую]], * [[Кхмерское письмо\|кхмерскую]], * [[Латинский алфавит\|латинскую]], * [[Тамильское письмо\|тамильскую]], * [[Хангыль\|корейскую (хангыль)]], * [[письмо чероки\|чероки]], * [[Эфиопское письмо\|эфиопскую]], * [[японское письмо\|японскую]] (которая включает в себя, кроме [[кана\|слоговой азбуки]], ещё и [[кандзи\|китайские иероглифы]]) }} и другие. С академическими целями добавлены многие исторические письменности, в том числе: [[руны\|германские руны]], [[Древнетюркское письмо\|древнетюркские руны]], [[древнегреческий язык\|древнегреческая письменность]], [[египетские иероглифы]], [[клинопись]], [[письменность майя]], [[этрусский алфавит]]. В Юникоде представлен широкий набор [[таблица математических символов\|математических]] и [[музыка]]льных символов, а также [[пиктограмма\|пиктограмм]]. [[государственный флаг\|Государственные флаги]] не включены в Юникод напрямую. Для их кодирования используются пары из 26 буквенных символов, предназначенных для представления двухбуквенных кодов стран по стандарту [[ISO 3166-1 alpha-2]]. Эти буквы закодированы в диапазоне от {{unichar\|1F1E6\|regional indicator symbol letter a\|html=}} до {{unichar\|1F1FF\|regional indicator symbol letter z\|html=}}. В Юникод принципиально не включаются [[логотип]]ы компаний и продуктов, хотя они и встречаются в шрифтах (например, логотип [[Apple]] в кодировке [[MacRoman]] (0xF0) или логотип [[Microsoft Windows\|Windows]] в шрифте Wingdings (0xFF)). В юникодовских шрифтах логотипы должны размещаться только в области пользовательских символов. == ISO/IEC 10646 == Консорциум Юникода работает в тесной связи с рабочей группой ISO/IEC/JTC1/SC2/WG2, которая занимается разработкой международного стандарта 10646 ([[ISO]]/[[IEC]] 10646). Между стандартом Юникода и ISO/IEC 10646 установлена синхронизация, хотя каждый стандарт использует свою терминологию и систему документации. Сотрудничество Консорциума Юникода с Международной организацией по стандартизации ({{lang-en\|International Organization for Standardization, ISO}}) началось в [[1991 год]]у. В [[1993 год]]у ISO выпустила стандарт DIS 10646.1. Для синхронизации с ним Консорциум утвердил стандарт Юникода версии 1.1, в который были внесены дополнительные символы из DIS 10646.1. В результате значения закодированных символов в Unicode 1.1 и DIS 10646.1 полностью совпали. В дальнейшем сотрудничество двух организаций продолжилось. В 2000 году стандарт Unicode 3.0 был синхронизирован с ISO/IEC 10646-1:2000. Предстоящая третья версия ISO/IEC 10646 будет синхронизирована с Unicode 4.0. Возможно, эти спецификации даже будут опубликованы как единый стандарт. Аналогично форматам UTF-16 и UTF-32 в стандарте Юникода, стандарт ISO/IEC 10646 также имеет две основные формы кодирования символов: UCS-2 (2 байта на символ, аналогично UTF-16) и UCS-4 (4 байта на символ, аналогично UTF-32). UCS значит ''универсальный набор кодированных символов'' ({{lang-en\|universal coded character set}}). UCS-2 можно считать подмножеством UTF-16 (UTF-16 без суррогатных пар), а UCS-4 является синонимом для UTF-32. Различия стандартов Юникод и ISO/IEC 10646: * небольшие различия в терминологии; * ISO/IEC 10646 не включает разделы, необходимые для полноценной реализации поддержки Юникода: нет данных о двоичном кодировании символов; нет описания алгоритмов сравнения ({{lang-en\|collation}}) и отрисовки ({{lang-en\|rendering}}) символов; ** нет перечня свойств символов (например, нет перечня свойств, необходимых для реализации поддержки двунаправленного ({{lang-en\|bi-directional}}) письма). == Способы представления == Юникод имеет несколько форм представления ({{lang-en\|Unicode transformation format, UTF}}): [[UTF-8]], [[UTF-16]] (UTF-16BE, UTF-16LE) и [[UTF-32]] (UTF-32BE, UTF-32LE). Была разработана также форма представления [[UTF-7]] для передачи по семибитным каналам, но из-за несовместимости с [[ASCII]] она не получила распространения и не включена в стандарт. 1 апреля 2005 года были предложены две [[День смеха\|шуточные]] формы представления: UTF-9 и UTF-18 ([http://tools.ietf.org/html/rfc4042 RFC{{nbsp}}4042]). В [[Microsoft]] [[Windows NT]] и основанных на ней системах [[Windows 2000]] и [[Windows XP]] в основном [[Юникод в операционных системах семейства Microsoft Windows\|используется]] форма UTF-16LE. В [[UNIX]]-подобных [[Операционная система\|операционных системах]] [[GNU/Linux]], [[BSD]] и [[Mac OS X]] принята форма UTF-8 для файлов и UTF-32 или UTF-8 для обработки символов в [[оперативная память\|оперативной памяти]]. [[Punycode]] — другая форма кодирования последовательностей Unicode-символов в так называемые ACE-последовательности, которые состоят только из алфавитно-цифровых символов, как это разрешено в доменных именах. === UTF-8 === {{Основная статья\|UTF-8}} UTF-8 — представление Юникода, обеспечивающее наибольшую компактность и обратную совместимость с 7-битной системой [[ASCII]]; текст, состоящий только из символов с номерами меньше 128, при записи в UTF-8 превращается в обычный текст [[ASCII]] и может быть отображён любой программой, работающей с ASCII; и наоборот, текст, закодированный 7-битной ASCII может быть отображён программой, предназначенной для работы с UTF-8. Остальные символы Юникода изображаются последовательностями длиной от 2 до 4 байт, в которых первый байт всегда имеет маску <code>11xxxxxx</code>, а остальные — <code>10xxxxxx</code>. В UTF-8 не используются суррогатные пары. Формат UTF-8 был изобретён [[2 сентября]] [[1992 год]]а [[Томпсон, Кен\|Кеном Томпсоном]] и [[Пайк, Роб\|Робом Пайком]] и реализован в ОС [[Plan 9]]<ref>http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt{{ref-en}}{{Недоступная ссылка\|date=Октябрь 2019 \|bot=InternetArchiveBot }}</ref>. Сейчас стандарт UTF-8 официально закреплён в документах RFC 3629 и ISO/IEC 10646 Annex D. === UTF-16 и UTF-32 === {{Основная статья\|UTF-16\|UTF-32}} UTF-16 — кодировка, позволяющая записывать символы Юникода в диапазонах U+0000...U+D7FF и U+E000...U+10FFFF (общим количеством 1 112 064). При этом каждый символ записывается одним или двумя словами (суррогатная пара). Кодировка UTF-16 описана в приложении Q к международному стандарту ISO/IEC 10646, а также ей посвящён документ IETF RFC 2781 под названием «UTF-16, an encoding of ISO 10646». UTF-32 — способ представления Юникода, при котором каждый символ занимает ровно 4 байта. Главное преимущество UTF-32 перед кодировками переменной длины заключается в том, что символы Юникод в ней непосредственно индексируемы, поэтому найти символ по номеру его позиции в файле можно чрезвычайно быстро, и получение любого символа ''n''-й позиции при этом является операцией, занимающей всегда одинаковое время. Это также делает замену символов в строках UTF-32 очень простой. Напротив, кодировки с переменной длиной требуют последовательного доступа к символу ''n''-й позиции, что может быть очень затратной по времени операцией. Главный недостаток UTF-32 — это неэффективное использование пространства, так как для хранения любого символа используется четыре байта. Символы, лежащие за пределами нулевой (базовой) плоскости кодового пространства, редко используются в большинстве текстов. Поэтому удвоение, в сравнении с UTF-16, занимаемого строками в UTF-32 пространства, зачастую не оправдано. ==== Порядок байтов ==== {{Основная статья\|Порядок байтов}} В потоке данных UTF-16 младший байт может записываться либо перед старшим ({{lang-en\|UTF-16 little-endian, UTF-16LE}}), либо после старшего ({{lang-en\|UTF-16 big-endian, UTF-16BE}}). Аналогично существует два варианта четырёхбайтной кодировки — UTF-32LE и UTF-32BE. === Маркер последовательности байтов === {{Основная статья\|Маркер последовательности байтов}} Для указания на использование Юникода, в начале текстового файла или потока может передаваться [[Маркер последовательности байтов]] ({{lang-en\|byte order mark (BOM)}}) — символ U+FEFF (неразрывный пробел нулевой ширины). По его виду можно легко различить как формат представления Юникода, так и последовательность байтов. Маркер последовательности байтов может принимать следующий вид: ; UTF-8 : EF BB BF ; UTF-16BE : FE FF ; UTF-16LE : FF FE ; UTF-32BE : 00 00 FE FF ; UTF-32LE : FF FE 00 00 === Юникод и традиционные кодировки === Внедрение Юникода привело к изменению подхода к традиционным 8-битным кодировкам. Если раньше такая кодировка всегда задавалась непосредственно, то теперь она может задаваться таблицей соответствия между данной кодировкой и Юникодом. Фактически почти все 8-битные кодировки теперь можно рассматривать как форму представления некоторого подмножества Юникода. И это намного упростило создание программ, которые должны работать с множеством разных кодировок: теперь, чтобы добавить поддержку ещё одной кодировки, надо всего лишь добавить ещё одну таблицу перекодировки символов в Юникод. Кроме того, многие форматы данных позволяют вставлять любые символы Юникода, даже если документ записан в старой 8-битной кодировке. Например, в HTML можно использовать [[Мнемоники в HTML\|коды с амперсандом]]. === Реализации === Большинство современных операционных систем в той или иной степени обеспечивает поддержку Юникода. В операционных системах семейства [[Windows NT]] для внутреннего представления имён файлов и других системных строк используется двухбайтовая кодировка UTF-16LE. Системные вызовы, принимающие строковые параметры, существуют в однобайтном и двухбайтном вариантах. Подробнее см. в статье [[Юникод в операционных системах семейства Microsoft Windows]]. [[UNIX]]-подобные операционные системы, в том числе [[GNU/Linux]], [[BSD]], [[OS X]], используют для представления Юникода кодировку UTF-8. Большинство программ может работать с UTF-8 как с традиционными однобайтными кодировками, не обращая внимания на то, что символ представляется как несколько последовательных байт. Для работы с отдельными символами строки обычно перекодируются в UCS-4, так что каждому символу соответствует [[машинное слово]]. Одной из первых успешных коммерческих реализаций Юникода стала среда программирования [[Java]]. В ней принципиально отказались от 8-битного представления символов в пользу 16-битного. Это решение увеличило расход памяти, но позволило вернуть в программирование важную абстракцию: произвольный одиночный символ (тип <code>char</code>). В частности, программист мог работать со строкой, как с простым массивом. К сожалению, успех не был окончательным, Юникод перерос ограничение в 16 бит и к версии J2SE 5.0 произвольный символ снова стал занимать переменное число единиц памяти — один <code>char</code> или два (см. [[UTF-16\|суррогатная пара]]). Сейчас большинство языков программирования поддерживает строки Юникода, хотя их представление может различаться в зависимости от реализации. == Методы ввода == Поскольку ни одна [[раскладка клавиатуры]] не может позволить вводить все символы Юникода одновременно, от [[операционная система\|операционных систем]] и [[прикладное программное обеспечение\|прикладных программ]] требуется поддержка альтернативных методов ввода произвольных символов Юникода. === [[Microsoft Windows]] === {{main\|Юникод в операционных системах семейства Microsoft Windows}} Хотя, начиная с [[Windows 2000]], служебная программа «Таблица символов» (charmap.exe) поддерживает символы Юникода и позволяет копировать их в [[буфер обмена]], эта поддержка ограничена только базовой плоскостью (коды символов U+0000…U+FFFF). Символы с кодами от U+10000 «Таблица символов» не отображает. Похожая таблица есть, например, в [[Microsoft Word]]. Иногда можно набрать [[Шестнадцатеричная система счисления\|шестнадцатеричный]] код, нажать {{key\|[[Alt (клавиша)\|Alt]]\|X}}, и код будет заменён на соответствующий символ, например, в [[WordPad]], Microsoft Word. В редакторах {{key\|Alt\|X}} выполняет и обратное преобразование. Во многих программах MS Windows, чтобы получить символ Unicode, нужно при нажатой клавише Alt набрать десятичное значение кода символа на цифровой клавиатуре. Например, полезными при наборе кириллических текстов будут комбинации Alt+0171 (<!-- защита от Викификатора --><nowiki>«</nowiki>), Alt+0187 (<nowiki>»</nowiki>) и Alt+0769 ([[знак ударения]]). Интересны также комбинации Alt+0133 (…) и Alt+0151 (—). === [[Macintosh]] === В [[Mac OS]] 8.5 и более поздних версиях поддерживается метод ввода, называемый «Unicode Hex Input». При зажатой клавише Option требуется набрать четырёхзначный шестнадцатеричный код требуемого символа. Этот метод позволяет вводить символы с кодами, большими U+FFFF, используя пары суррогатов; такие пары операционной системой будут автоматически заменены на одиночные символы. Этот метод ввода перед использованием нужно активизировать в соответствующем разделе системных настроек и затем выбрать как текущий метод ввода в меню клавиатуры. Начиная с [[Mac OS X]] 10.2, существует также приложение «Character Palette», позволяющее выбирать символы из таблицы, в которой можно выделять символы определённого блока или символы, поддерживаемые конкретным шрифтом. === [[GNU/Linux]] === В [[GNOME]] также есть утилита «[[Таблица символов GNOME\|Таблица символов]]» (ранее gucharmap), позволяющая отображать символы определённого блока или системы письма и предоставляющая возможность поиска по названию или описанию символа. Когда код нужного символа известен, его можно ввести в соответствии со стандартом [[Международная организация по стандартизации\|ISO]]{{nbsp}}14755: при зажатых клавишах {{key\|Ctrl\|Shift}} ввести шестнадцатеричный код (начиная с некоторой версии GTK+, ввод кода нужно предварить нажатием клавиши ''«U»''). Вводимый шестнадцатеричный код может иметь до {{num\|32\|бит}} в длину, позволяя вводить любые символы Юникода без использования суррогатных пар. Все приложения [[X Window System\|X{{nbsp}}Window]], включая GNOME и [[KDE]], поддерживают ввод при помощи клавиши {{Key\|[[Compose]]}}. Для клавиатур, на которых нет отдельной клавиши [[Compose]], для этой цели можно назначить любую клавишу — например, {{Key\|[[Caps Lock]]}}. Консоль GNU/Linux также допускает ввод символа Юникода по его коду — для этого десятичный код символа нужно ввести цифрами расширенного блока клавиатуры при зажатой клавише {{Key\|[[Alt (клавиша)\|Alt]]}}. Можно вводить символы и по их шестнадцатеричному коду: для этого нужно зажать клавишу {{key\|AltGr}}, и для ввода цифр A—F использовать клавиши расширенного блока клавиатуры от {{Key\|NumLock}} до {{Key\|Enter}} (по часовой стрелке). Поддерживается также и ввод в соответствии с ISO{{nbsp}}14755. Для того чтобы перечисленные способы могли работать, нужно включить в консоли режим Юникода вызовом <code>unicode_start</code>(1) и выбрать подходящий шрифт вызовом <code>setfont</code>(8). [[Mozilla Firefox]] для Linux поддерживает ввод символов по ISO{{nbsp}}14755. == Проблемы Юникода == В Юникоде английское «a» и польское «a» — один и тот же символ. Точно так же одним и тем же символом (но отличающимся от «a» латинского) считаются русское «а» и сербское «а». Такой принцип кодирования не универсален; по-видимому, решения «на все случаи жизни» вообще не может существовать. * Тексты на [[китайский язык\|китайском]], [[корейский язык\|корейском]] и [[японский язык\|японском]] языках имеют традиционное написание сверху вниз, начиная с правого верхнего угла. Переключение горизонтального и вертикального написания для этих языков не предусмотрено в Юникоде — это должно осуществляться средствами [[язык разметки\|языков разметки]] или внутренними механизмами [[текстовый процессор\|текстовых процессоров]]. * Наличие или отсутствие в Юникоде разных начертаний одного и того же символа в зависимости от языка. Нужно следить, чтобы текст всегда был правильно помечен как относящийся к тому или другому языку. : Так, [[китайское письмо\|китайские иероглифы]] могут иметь разные начертания в китайском, японском ([[кандзи]]) и корейском ([[ханча]]), но при этом в Юникоде обозначаются одним и тем же символом (так называемая CJK-унификация), хотя упрощённые и полные иероглифы всё же имеют разные коды. : Аналогично, [[русский язык\|русский]] и [[сербский язык\|сербский]] <!-- защита от Викификатора --><nowiki>языки</nowiki> используют разное начертание курсивных букв ''п'' и ''т'' (в сербском они выглядят как <span style="text-decoration: overline; font-style: italic">и</span> и <span style="text-decoration: overline; font-style: italic">ш</span>, см. [[сербский курсив]]). * Перевод из строчных букв в заглавные тоже зависит от языка. Например: в [[турецкий язык\|турецком]] существуют буквы [[i без точки\|İi и Iı]] — таким образом, турецкие правила изменения регистра конфликтуют с [[английский язык\|английскими]], которые предписывают «i» переводить в «I». Подобные проблемы есть и в других языках — например, в канадском диалекте французского языка регистр переводится немного не так, как во Франции<ref>[http://www.transl-gunsmoker.ru/2008/11/unicode.html Регистр в Unicode — это непросто]</ref>. * Даже с [[арабские цифры\|арабскими цифрами]] есть определённые типографские тонкости: цифры бывают «прописными» и «[[минускульные цифры\|строчными]]», пропорциональными и [[моноширинный шрифт\|моноширинными]]<ref>В большинстве шрифтов для ПК реализованы «прописные» (маюскульные) моноширинные цифры.</ref> — для Юникода разницы между ними нет. Подобные нюансы остаются за программным обеспечением. Некоторые недостатки связаны не с самим Юникодом, а с возможностями обработчиков текста. * Файлы нелатинского текста в Юникоде всегда занимают больше места, так как один символ кодируется не одним байтом, как в различных национальных кодировках, а последовательностью байтов (исключение составляет UTF-8 для языков, алфавит которых укладывается в ASCII, а также наличие в тексте символов двух и более языков, алфавит которых ''не'' укладывается в ASCII<ref>В некоторых случаях документ (не простой текст) в Юникоде может занимать существенно меньше места, чем документ в однобайтовой кодировке. Например, если некая веб-страница содержит примерно поровну русского и греческого текста, то в однобайтовой кодировке придётся либо русские, либо греческие буквы записывать, используя возможности формата документов, в виде кодов с амперсандом, которые занимают 6—7 байт на символ (при использовании десятичных кодов), то есть в среднем на букву придётся 3,5—4 байта, в то время как UTF-8 занимает только 2 байта на греческую или русскую букву.</ref>). Файл шрифта, необходимый для отображения всех символов таблицы Юникод, занимает сравнительно много места в памяти и требует бо́льших вычислительных ресурсов, чем шрифт только одного национального языка пользователя<ref>Один из файлов шрифтов Arial Unicode имеет размер 24 мегабайта; существует Times New Roman размером 120 мегабайт, он содержит количество символов, близкое к 65536.</ref>. С увеличением мощности компьютерных систем и удешевлением памяти и дискового пространства эта проблема становится всё менее существенной; тем не менее, она остаётся актуальной для портативных устройств, например, для мобильных телефонов. * Хотя поддержка Юникода реализована в наиболее распространённых операционных системах, до сих пор не всё прикладное программное обеспечение поддерживает корректную работу с ним. В частности, не всегда обрабатываются метки порядка байтов ([[Byte order mark\|BOM]]) и плохо поддерживаются диакритические символы. Проблема является временной и есть следствие сравнительной новизны стандартов Юникода (в сравнении с однобайтовыми национальными кодировками). * Производительность всех программ обработки строк (в том числе и сортировок в БД) снижается при использовании Юникода вместо однобайтовых кодировок. Некоторые редкие системы письма всё ещё не представлены должным образом в Юникоде. Изображение «длинных» надстрочных символов, простирающихся над несколькими буквами, как, например, в [[церковнославянский язык\|церковнославянском языке]], пока не реализовано. == «Юникод» или «Уникод»? == «Unicode» — одновременно и имя собственное (или часть имени, например, Unicode Consortium), и имя нарицательное, происходящее из английского языка. На первый взгляд предпочтительнее использовать написание «Уникод». В [[Русский язык\|русском языке]] уже есть [[Морфема\|морфемы]] «уни-» (слова с латинским элементом «uni-» традиционно переводились и писались через «уни-»: универсальный, униполярный, унификация, униформа) и «код». Напротив, торговые марки, заимствованные из [[Английский язык\|английского языка]], обычно передаются посредством практической транскрипции, в которой деэтимологизированное сочетание букв «uni-» записывается в виде «юни-» («[[Юнилевер]]», «[[UNIX\|Юникс]]» и т. п.), то есть точно так же, как в случае с побуквенными сокращениями, вроде [[UNICEF]] «United Nations International Children’s Emergency Fund» — [[ЮНИСЕФ]]. Написание «Юникод» уже твёрдо вошло в русскоязычные тексты. В [[Википедия\|Википедии]] используется более распространённый вариант. В [[MS Windows]] используется вариант «Юникод». На сайте Консорциума есть специальная страница, где рассматриваются проблемы передачи слова «Unicode» в различных языках и системах письма. Для русской кириллицы указан вариант «Юникод»<ref name=autogenerated1 />. Формы, принятые иностранными организациями для русской передачи слова «Unicode», являются рекомендательными. == См. также == * [[Символы, представленные в Юникоде]] * [[ASCII]] * [[ISO 8859-1]] * [[UTF-8]] * [[UTF-16]] * [[UTF-32]] * [[Кириллица в Юникоде]] * [[Дроби в Юникоде]] * [[XeTeX]] * [[Свободные универсальные шрифты]] * [[Windows Glyph List 4]] * [[Широкий символ]] * Библиотека [[GLib]] содержит широкий набор функций для работы c символами и строками в кодировке Unicode * [[Проект:Внесение символов алфавитов народов России в Юникод]] == Примечания == {{примечания\|2}} == Ссылки == * [http://www.unicode.org/ Официальный сайт Консорциума Юникода]{{ref-en}} * {{dmoz\|Computers/Software/Globalization/Character_Encoding/Unicode/\|Unicode}}{{ref-en}} * Статья «[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]»{{ref-ru}} на официальном сайте Консорциума * [http://www.unicode.org/versions/latest/ Последняя версия стандарта Юникод]{{ref-en}} * Последнюю версию стандарта ISO/IEC 10646 ищите в [http://standards.iso.org/ittf/PubliclyAvailableStandards/ списке доступных стандартов]{{ref-en}}. Документы, соответствующие стандарту Unicode 7.0: [http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012.zip ISO/IEC 10646] (файл ZIP){{ref-en}}, [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip Amendments 1] (файл ZIP){{ref-en}}, Amendments 2 (по состоянию 2014-08-06 ещё недоступен) * [http://unicode-table.com/ Таблица символов Юникода с названиями и описаниями]{{ref-ru}}{{ref-en}}{{ref-de}} * [http://www.unicode.org/versions/Unicode5.0.0/appC.pdf Связь Юникода версии 5.0.0 и ISO/IEC 10646] (файл PDF){{ref-en}} * [http://www.cl.cam.ac.uk/~mgk25/unicode.html FAQ по UTF-8 и Unicode]{{ref-en}} * [[Кириллица в Юникоде]]: http://www.unicode.org/charts/PDF/U0400.pdf, http://www.unicode.org/charts/PDF/U0500.pdf, http://www.unicode.org/charts/PDF/U2DE0.pdf, http://www.unicode.org/charts/PDF/UA640.pdf{{ref-en}}{{Недоступная ссылка\|date=Январь 2020 \|bot=InternetArchiveBot }} * [http://www.i18nguy.com/surrogates.html Включение поддержки дополнительных символов Юникода в Windows]{{ref-en}} * [http://www.fileformat.info/info/unicode/char/search.htm Поиск по символам Юникода]{{ref-en}} {{Стандарты ISO}} {{Шрифтовой дизайн}} [[Категория:Юникод\| ]] [[Категория:Стандарты Интернета]] [[Категория:Стандарты ISO]]'
Вики-текст новой страницы после правки (`new_wikitext`)	'{{Use dmy dates\|date=May 2019\|cs1-dates=y}} {{distinguish\|Unicode (telegraphy)}} {{For\|what the term "Unicode" means in Microsoft documentation\|UTF-16}} {{Short description\|Character encoding standard}} {{Infobox character encoding \| name = Unicode \| mime = \| alias = [[Universal Coded Character Set]] (UCS) \| image = New Unicode logo.svg \| caption = Logo of the [[Unicode Consortium]] \| standard = Unicode Standard \| lang = International \| status = \| encodings = [[UTF-8]], [[UTF-16]], [[GB 18030\|GB18030]]<br/>'''Less common''': [[UTF-32]], [[BOCU]], [[Standard Compression Scheme for Unicode\|SCSU]], [[UTF-7]] \| encodes = \| extends = \| prev = [[ISO 8859]], various others \| next = }} {{Contains special characters\| special = uncommon [[Unicode]] characters}} '''Unicode''' is an [[information technology]] [[technical standard\|standard]] for the consistent [[character encoding\|encoding]], representation, and handling of [[character (computing)\|text]] expressed in most of the world's [[writing system]]s. The standard is maintained by the [[Unicode Consortium]], and {{as of\|March 2020\|lc=y}}, there is a repertoire of {{unicodenover}} (these [[character (computing)\|characters]] consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic [[script (Unicode)\|scripts]], as well as multiple symbol sets and [[emoji]]. The character repertoire of the Unicode Standard is synchronized with [[ISO/IEC 10646]], and both are code-for-code identical. ''The Unicode Standard'' consists of a set of code charts for visual reference, an encoding method and set of standard [[character encoding]]s, a set of reference [[data file]]s, and a number of related items, such as character properties, rules for [[Unicode normalization\|normalization]], decomposition, [[Unicode collation algorithm\|collation]], rendering, and [[bidirectional text]] display order (for the correct display of text containing both right-to-left scripts, such as [[Arabic script\|Arabic]] and [[Hebrew alphabet\|Hebrew]], and left-to-right scripts).<ref>{{Cite web \| title = The Unicode Standard: A Technical Introduction \| url = https://www.unicode.org/standard/principles.html \| accessdate = 2010-03-16}}</ref> Unicode's success at unifying character sets has led to its widespread and predominant use in the [[internationalization and localization]] of computer [[software]]. The standard has been implemented in many recent technologies, including modern [[operating system]]s, [[XML]], [[Java (programming language)\|Java]] (and other programming languages), and the [[.NET Framework]]. [[Comparison of Unicode encodings\|Unicode can be implemented]] by different [[character encoding]]s. The Unicode standard defines [[UTF-8]], [[UTF-16]], and [[UTF-32]], and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and [[Universal Coded Character Set\|UCS]]-2 (without full support for Unicode), a precursor of UTF-16; [[GB 18030\|GB18030]] is standardized in China and implements Unicode fully, while not an official Unicode standard. UTF-8, the dominant encoding on the [[World Wide Web]] (used in over 94% of websites {{As of\|2019\|November\|df=\|lc=y}}),<ref>{{Cite web\|url=https://w3techs.com/technologies/cross/character_encoding/ranking\|title=Usage Survey of Character Encodings broken down by Ranking\|website=w3techs.com\|language=en\|access-date=2019-11-11}}</ref> uses one [[byte]]{{efn\|The Unicode Consortium uses the ambiguous term byte; The [[International Organization for Standardization]] (ISO), the [[International Electrotechnical Commission]] (IEC) and the [[Internet Engineering Task Force]] (IETF) use the more specific term [[Octet (computing)\|octet]] in current documents related to Unicode.\|group=note}}for the first 128 [[code point]]s, and up to 4 bytes for other characters.<ref>{{cite web\|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559\|work=The Unicode Standard\|title=Conformance \| date=March 2020\|accessdate=2020-03-15}}</ref> The first 128 Unicode code points represent the [[ASCII]] characters, which means that any ASCII text is also a UTF-8 text. UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called [[Basic Multilingual Plane]] (BMP). With 1,112,064 possible Unicode code points corresponding to characters (see [[#Architecture and terminology\|below]]) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same [[16-bit]] encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text. UTF-32 (also referred to as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used. ==Origin and development== Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the [[ISO/IEC 8859]] standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using [[Latin character]]s and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other). Unicode, in intent, encodes the underlying characters—[[grapheme]]s and grapheme-like units—rather than the variant [[glyph]]s (renderings) for such characters. In the case of [[Chinese characters]], this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see [[Han unification]]). In text processing, Unicode takes the role of providing a unique ''code point''—a [[number]], not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, [[font]], or style) to other software, such as a [[web browser]] or [[word processor]]. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode. The first 256 code points were made identical to the content of [[ISO/IEC 8859-1]] so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "[[Halfwidth and Fullwidth Forms (Unicode block)\|fullwidth forms]]" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean ([[CJK]]) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see [[duplicate characters in Unicode]]. ==={{anchor\|Unicode 88}}History=== Based on experiences with the [[Xerox Character Code Standard]] (XCCS) since 1980,<ref name="unicode-88"/> the origins of Unicode date to 1987, when [[Joe Becker (Unicode)\|Joe Becker]] from [[Xerox]] with [[Lee Collins (software engineer)\|Lee Collins]] and [[Mark Davis (Unicode)\|Mark Davis]] from [[Apple Inc.\|Apple]], started investigating the practicalities of creating a universal character set.<ref>{{cite web \|title=Summary Narrative \|url=https://www.unicode.org/history/summary.html \|access-date=2010-03-15}}</ref> With additional input from Peter Fenwick and Dave Opstad,<ref name="unicode-88"/> Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".<ref name="unicode-88">{{Cite web \|url=https://unicode.org/history/unicode88.pdf \|title=Unicode 88 \|author-last=Becker \|author-first=Joseph D. \|author-link=Joseph D. Becker \|date=1998-09-10 \|orig-year=1988-08-29 \|edition=10th anniversary reprint \|website=unicode.org \|publisher=[[Unicode Consortium]] \|access-date=2016-10-25 \|url-status=live \|archive-url=https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf \|archive-date=2016-11-25 \|quote=In 1978, the initial proposal for a set of "Universal Signs" was made by [[Bob Belleville]] at [[Xerox PARC]]. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the [[Xerox Character Code Standard]] (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.<br/>Unicode arose as the result of eight years of working experience with XCCS. Its fundamental differences from XCCS were proposed by Peter Fenwick and Dave Opstad (pure 16-bit codes), and by [[Lee Collins (Unicode)\|Lee Collins]] (ideographic character unification). Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communication multilingual system products.}}</ref> In this document, entitled ''Unicode 88'', Becker outlined a [[16-bit]] character model:<ref name="unicode-88"/> <blockquote> Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose. </blockquote> His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:<ref name="unicode-88"/> <blockquote> Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2<sup>14</sup> = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes. </blockquote> In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of [[Research Libraries Group\|RLG]], and Glenn Wright of [[Sun Microsystems]], and in 1990, Michel Suignard and Asmus Freytag from [[Microsoft]] and Rick McGowan of [[NeXT]] joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready. The [[Unicode Consortium]] was incorporated in California on 3 January 1991,<ref>[https://unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates] on ''unicode.org.'' Retrieved February 28, 2017.</ref> and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992. In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g., [[Egyptian hieroglyphs]]) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.<ref name=unicoderevisited>{{cite web\|last=Searle\|first=Stephen J\|title=Unicode Revisited\|url=http://tronweb.super-nova.co.jp/unicoderevisited.html\|accessdate=2013-01-18}}</ref> The Microsoft TrueType specification version 1.0 from 1992 used the name ''Apple Unicode'' instead of ''Unicode'' for the Platform ID in the naming table. ===Unicode Consortium=== {{Main\|Unicode Consortium}} The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including [[Adobe Inc.\|Adobe]], [[Apple Inc.\|Apple]], [[Google]], [[International Business Machines\|IBM]], [[Microsoft]], and [[Oracle Corporation]].<ref name="members">{{cite web \| title = The Unicode Consortium Members \| url = https://unicode.org/consortium/members.html \| accessdate = 2019-01-04}}</ref> Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the [[Ministry of Endowments and Religious Affairs (Oman)]] is a full member with voting rights.<ref name="members" /> The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard [[#Unicode Transformation Format and Universal Character Set\|Unicode Transformation Format]] (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with [[multilingualism\|multilingual]] environments. ===Scripts covered=== {{Main\|Script (Unicode)}} [[File:Unicode sample.png\|thumb\|right\|200px\|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts --> Unicode covers almost all scripts ([[writing system]]s) in current use today.<ref>{{cite web \| title = Character Code Charts \| url = https://www.unicode.org/charts/ \| accessdate = 2010-03-17}} </ref>{{failed verification\|date=October 2013}}<ref>{{Cite web\|url=https://home.unicode.org/basic-info/faq/\|title=Unicode FAQ\|last=\|first=\|date=\|website=\|url-status=live\|archive-url=\|archive-date=\|access-date=2020-04-02}}</ref> A total of 154 [[Script (Unicode)\|scripts]] are included in the latest version of Unicode (covering [[alphabet]]s, [[abugida]]s and [[Syllabary\|syllabaries]]), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and [[musical notation\|music]] (in the form of notes and rhythmic symbols), also occur. The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran<ref>{{Cite web \| title=Roadmap to the BMP \| url=https://www.unicode.org/roadmaps/bmp/ \| publisher=[[Unicode Consortium]] \| accessdate=30 July 2018 }}</ref>) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the [https://www.unicode.org/roadmaps/ Unicode Roadmap] page of the [[Unicode Consortium]] Web site. For some scripts on the Roadmap, such as [[Jurchen script\|Jurchen]] and [[Khitan small script]], encoding proposals have been made and they are working their way through the approval process. For others scripts, such as [[Maya script\|Mayan]] (besides numbers) and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved. Some modern invented scripts which have not yet been included in Unicode (e.g., [[Tengwar]]) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., [[Klingon scripts\|Klingon]]) are listed in the [[ConScript Unicode Registry]], along with unofficial but widely used [[Private Use Areas]] code assignments. There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals have been already included into Unicode. The [https://linguistics.berkeley.edu/sei/ Script Encoding Initiative], a project run by Deborah Anderson at the [[University of California, Berkeley]] was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.<ref>{{cite web\|url=https://www.unicode.org/pending/about-sei.html \|title=About The Script Encoding Initiative \|publisher=The Unicode Consortium \|date= \|accessdate=2012-06-04}}</ref> ==={{anchor\|1.0.0\|1.0.1\|1.1\|2.0\|2.1\|3.0\|3.1\|3.2\|4.0\|4.1\|5.0\|5.1\|5.2\|6.0\|6.1\|6.2\|6.3\|7.0\|8.0\|9.0\|10.0\|11.0\|12.0\|12.1\|13.0\|14.0}}Versions=== Unicode is developed in conjunction with the [[International Organization for Standardization]] and shares the character repertoire with [[ISO/IEC 10646]]: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but ''The Unicode Standard'' contains much more information for implementers, covering—in depth—topics such as bitwise encoding, [[Unicode collation algorithm\|collation]] and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting [[Bi-directional text\|bidirectional text]]. The two standards do use slightly different terminology. The Unicode Consortium first published ''The Unicode Standard'' in 1991 (version 1.0), and has published new versions on a regular basis since then. The latest version of the Unicode Standard, version 13.0, was released in March 2020, and is available in electronic format from the consortium's website. The last version of the standard that was published completely in book form (including the code charts) was version 5.0 in 2006, but since version 5.2 (2009) the core specification of the standard has been published as a print-on-demand paperback.<ref name=version6.1PoD>{{cite web\|title=Unicode 6.1 Paperback Available\|url=https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html\|work=announcements_at_unicode.org\|accessdate=2012-05-30}}</ref> The entire text of each version of the standard, including the core specification, standard annexes and code charts, is freely available in [[PDF]] format on the Unicode website. In April 2020, Unicode announced that the release of the forthcoming version 14.0 had been postponed by six months from its initial release of March 2021 due to the [[COVID-19 pandemic]].<ref>{{cite web\|title=Unicode 14.0 Delayed for 6 Months\|url=https://home.unicode.org/unicode-14-0-delayed-for-6-months/\|accessdate=2020-05-05}}</ref> Thus far, the following major and minor versions of the Unicode standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below.<ref>{{cite web \| title = Enumerated Versions of The Unicode Standard \| url = https://www.unicode.org/versions/enumeratedversions.html \| accessdate = 2016-06-21}}</ref> {\| class="wikitable" \|- \|+ Unicode versions \|- !rowspan=2\| Version !rowspan=2\| Date !rowspan=2\| Book !rowspan=2\| Corresponding [[Universal Character Set\|ISO/IEC 10646]] edition !rowspan=2\| [[Script (Unicode)\|Scripts]] !colspan=2\| Characters \|- ! Total{{refn\|The number of characters listed for each version of Unicode is the total number of graphic and format characters (i.e., excluding [[Private Use Area\|private-use characters]], [[Unicode control characters\|control characters]], [[noncharacter\|noncharacters]] and [[surrogate code points]]).\|group=tablenote}} ! Notable additions \|- \| 1.0.0 \| October 1991 \| {{ISBN\|0-201-56788-1}} (Vol. 1) \| \| 24 \| 7,129 \| Initial repertoire covers these scripts: [[Arabic script\|Arabic]], [[Armenian alphabet\|Armenian]], [[Bengali alphabet\|Bengali]], [[Zhuyin\|Bopomofo]], [[Cyrillic script\|Cyrillic]], [[Devanagari]], [[Georgian alphabet\|Georgian]], [[Greek alphabet\|Greek and Coptic]], [[Gujarati alphabet\|Gujarati]], [[Gurmukhi script\|Gurmukhi]], [[Hangul]], [[Hebrew alphabet\|Hebrew]], [[Hiragana]], [[Kannada alphabet\|Kannada]], [[Katakana]], [[Lao script\|Lao]], [[Latin script\|Latin]], [[Malayalam script\|Malayalam]], [[Oriya script\|Oriya]], [[Tamil script\|Tamil]], [[Telugu script\|Telugu]], [[Thai alphabet\|Thai]], and [[Tibetan script\|Tibetan]].<ref>{{cite web\| title = Unicode Data 1.0.0\|url= https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt\| accessdate = 2010-03-16}}</ref> \|- \| 1.0.1 \| June 1992 \| {{ISBN\|0-201-60845-6}} (Vol. 2) \| \| 25 \| 28,327<br />(21,204 added; 6 removed) \| The initial set of 20,902 [[CJK Unified Ideographs]] is defined.<ref> {{cite web \| title = Unicode Data 1.0.1 \| url = https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt \| accessdate = 2010-03-16}}</ref> \|- \| 1.1 \| June 1993 \| \| ISO/IEC 10646-1:1993 \| 24 \| 34,168<br />(5,963 added; 89 removed; 33 reclassified as control characters) \| 4,306 more [[Hangul]] syllables added to original set of 2,350 characters. [[Tibetan script\|Tibetan]] removed.<ref>{{cite web \| title = Unicode Data 1995 \| url = https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt \| accessdate = 2010-03-16 }} </ref> \|- \| 2.0 \| July 1996 \| {{ISBN\|0-201-48345-9}} \| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7 \| 25 \| 38,885<br />(11,373 added; 6,656 removed) \| Original set of [[Hangul]] syllables removed, and a new set of 11,172 Hangul syllables added at a new location. [[Tibetan script\|Tibetan]] added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 [[Private use (Unicode)\|Private Use Areas]] allocated.<ref>{{cite web \| title = Unicode Data-2.0.14 \| url = https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt \| accessdate = 2010-03-16}} </ref> \|- \| 2.1 \| May 1998 \| \| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18 \| 25 \| 38,887<br />(2 added) \| [[Euro sign]] and [[Specials (Unicode block)\|Object Replacement Character]] added.<ref>{{cite web \| title = Unicode Data-2.1.2 \| url = https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt \| accessdate = 2010-03-16}} </ref> \|- \| 3.0 \| September 1999 \| {{ISBN\|0-201-61633-5}} \| ISO/IEC 10646-1:2000 \| 38 \| 49,194<br />(10,307 added) \| [[Cherokee syllabary\|Cherokee]], [[Ge'ez alphabet\|Ethiopic]], [[Khmer script\|Khmer]], [[Mongolian script\|Mongolian]], [[Burmese script\|Burmese]], [[Ogham]], [[Runic alphabet\|Runic]], [[Sinhala script\|Sinhala]], [[Syriac alphabet\|Syriac]], [[Tāna\|Thaana]], [[Canadian Aboriginal syllabics\|Unified Canadian Aboriginal Syllabics]], and [[Yi script\|Yi Syllables]] added, as well as a set of [[Braille]] patterns.<ref>{{cite web \| title = Unicode Data-3.0.0 \| url = https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt \| accessdate = 2010-03-16}} </ref> \|- \| 3.1 \| March 2001 \| \| ISO/IEC 10646-1:2000 ISO/IEC 10646-2:2001 \| 41 \| 94,140<br />(44,946 added) \| [[Deseret alphabet\|Deseret]], [[Gothic alphabet\|Gothic]] and [[Old Italic alphabet\|Old Italic]] added, as well as sets of symbols for [[Modern musical symbols\|Western music]] and [[Byzantine music]], and 42,711 additional [[CJK Unified Ideographs]].<ref>{{cite web \| title =Unicode Data-3.1.0 \| url =https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt \| accessdate = 2010-03-16 }} </ref> \|- \| 3.2 \| March 2002 \| \| ISO/IEC 10646-1:2000 plus Amendment 1 ISO/IEC 10646-2:2001 \| 45 \| 95,156<br />(1,016 added) \| [[Philippines\|Philippine]] scripts [[Buhid script\|Buhid]], [[Hanunó'o script\|Hanunó'o]], [[Baybayin\|Tagalog]], and [[Tagbanwa script\|Tagbanwa]] added.<ref>{{cite web \| title = Unicode Data-3.2.0 \| url = https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt \| accessdate = 2010-03-16}} </ref> \|- \| 4.0 \| April 2003 \| {{ISBN\|0-321-18578-1}} \| ISO/IEC 10646:2003 \| 52 \| 96,382<br />(1,226 added) \| [[Cypriot syllabary]], [[Limbu script\|Limbu]], [[Linear B]], [[Osmanya script\|Osmanya]], [[Shavian alphabet\|Shavian]], [[Tai Nüa language#Writing system\|Tai Le]], and [[Ugaritic alphabet\|Ugaritic]] added, as well as [[Hexagram (I Ching)\|Hexagram symbols]].<ref>{{cite web \| title = Unicode Data-4.0.0 \| url = https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt \| accessdate = 2010-03-16}} </ref> \|- \| 4.1 \| March 2005 \| \| ISO/IEC 10646:2003 plus Amendment 1 \| 59 \| 97,655<br />(1,273 added) \| [[Lontara alphabet\|Buginese]], [[Glagolitic alphabet\|Glagolitic]], [[Kharoṣṭhī\|Kharoshthi]], [[New Tai Lue alphabet\|New Tai Lue]], [[Old Persian cuneiform script\|Old Persian]], [[Sylheti Nagari\|Syloti Nagri]], and [[Tifinagh]] added, and [[Coptic alphabet\|Coptic]] was disunified from [[Greek alphabet\|Greek]]. Ancient [[Unicode numerals#Ancient Greek numerals\|Greek numbers]] and [[Musical notation#Ancient Greece\|musical symbols]] were also added.<ref>{{cite web\|url=https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt\|title=Unicode Data-4.1.0\|accessdate=2010-03-16}} </ref> \|- \| 5.0 \| July 2006 \| {{ISBN\|0-321-48091-0}} \| ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3 \| 64 \| 99,024<br />(1,369 added) \| [[Balinese alphabet\|Balinese]], [[Cuneiform]], [[N'Ko alphabet\|N'Ko]], [[Phags-pa script\|Phags-pa]], and [[Phoenician alphabet\|Phoenician]] added.<ref>{{cite web \| title = Unicode Data 5.0.0 \| url = https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt \| accessdate = 2010-03-17}} </ref> \|- \| 5.1 \| April 2008 \| \| ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4 \| 75 \| 100,648<br />(1,624 added) \| [[Carian script\|Carian]], [[Cham alphabet\|Cham]], [[Kayah Li script\|Kayah Li]], [[Lepcha script\|Lepcha]], [[Lycian script\|Lycian]], [[Lydian script\|Lydian]], [[Ol Chiki script\|Ol Chiki]], [[Rejang script\|Rejang]], [[Saurashtra script\|Saurashtra]], [[Sundanese script\|Sundanese]], and [[Vai syllabary\|Vai]] added, as well as sets of symbols for the [[Phaistos Disc]], [[Mahjong\|Mahjong tiles]], and [[Dominoes\|Domino tiles]]. There were also important additions for [[Burmese script\|Burmese]], additions of letters and [[Scribal abbreviation]]s used in medieval [[manuscript]]s, and the addition of [[Capital ẞ]].<ref>{{cite web \| title = Unicode Data 5.1.0 \| url = https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt \| accessdate = 2010-03-17 }} </ref> \|- \| 5.2 \| October 2009 \| {{ISBN\|978-1-936213-00-9}} \| ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6 \| 90 \| 107,296<br />(6,648 added) \| [[Avestan alphabet\|Avestan]], [[Bamum script\|Bamum]], [[Egyptian hieroglyphs]] (the [[Gardiner's sign list\|Gardiner Set]], comprising 1,071 characters), [[Imperial Aramaic]], [[Inscriptional Pahlavi]], [[Inscriptional Parthian]], [[Javanese script\|Javanese]], [[Kaithi]], [[Fraser alphabet\|Lisu]], [[Meitei Mayek script\|Meetei Mayek]], [[South Arabian alphabet\|Old South Arabian]], [[Old Turkic script\|Old Turkic]], [[Samaritan script\|Samaritan]], [[Tai Tham script\|Tai Tham]] and [[Tai Viet script\|Tai Viet]] added. 4,149 additional [[CJK Unified Ideographs]] (CJK-C), as well as extended Jamo for [[Hangul\|Old Hangul]], and characters for [[Vedic Sanskrit]].<ref>{{cite web \| title = Unicode Data 5.2.0 \| url = https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt \| accessdate = 2010-03-17}} </ref> \|- \| 6.0 \| October 2010 \| {{ISBN\|978-1-936213-01-6}} \| ISO/IEC 10646:2010 plus the [[Indian rupee sign]] \| 93 \| 109,384<br />(2,088 added) \| [[Batak alphabet\|Batak]], [[Brāhmī script\|Brahmi]], [[Mandaic alphabet\|Mandaic]], [[playing card]] symbols, [[Traffic sign\|transport]] and [[map]] symbols, [[alchemical symbol]]s, [[emoticons]] and [[emoji]]. 222 additional [[CJK Unified Ideographs]] (CJK-D) added.<ref>{{cite web \| title = Unicode Data 6.0.0 \| url = https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt \| accessdate = 2010-10-11}} </ref> \|- \| 6.1 \| January 2012 \| {{ISBN\|978-1-936213-02-3}} \| ISO/IEC 10646:2012 \| 100 \| 110,116<br />(732 added) \| [[Chakma alphabet\|Chakma]], [[Meroitic alphabet\|Meroitic cursive]], [[Meroitic alphabet\|Meroitic hieroglyphs]], [[Pollard script\|Miao]], [[Śāradā script\|Sharada]], [[Sora Sompeng]], and [[Takri alphabet\|Takri]].<ref>{{cite web \| title = Unicode Data 6.1.0 \| url = https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt \| accessdate = 2012-01-31}} </ref> \|- \| 6.2 \| September 2012 \| {{ISBN\|978-1-936213-07-8}} \| ISO/IEC 10646:2012 plus the [[Turkish lira sign]] \| 100 \| 110,117<br />(1 added) \| [[Turkish lira sign]].<ref>{{cite web \| title = Unicode Data 6.2.0 \| url = https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt \| accessdate = 2012-09-26}} </ref> \|- \| 6.3 \| September 2013 \| {{ISBN\|978-1-936213-08-5}} \| ISO/IEC 10646:2012 plus six characters \| 100 \| 110,122<br />(5 added) \| 5 bidirectional formatting characters.<ref>{{cite web \| title = Unicode Data 6.3.0 \| url = https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt \| accessdate = 2013-09-30}} </ref> \|- \| 7.0 \| June 2014 \| {{ISBN\|978-1-936213-09-2}} \| ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the [[Ruble sign]] \| 123 \| 112,956<br />(2,834 added) \| [[Bassa alphabet\|Bassa Vah]], [[Caucasian Albanian alphabet\|Caucasian Albanian]], [[Duployan shorthand\|Duployan]], [[Elbasan alphabet\|Elbasan]], [[Grantha alphabet\|Grantha]], [[Khojki]], [[Khudabadi alphabet\|Khudawadi]], [[Linear A]], [[Mahajani]], [[Manichaean alphabet\|Manichaean]], [[Mende script\|Mende Kikakui]], [[Modi alphabet\|Modi]], [[Mro script\|Mro]], [[Nabataean alphabet\|Nabataean]], [[Old North Arabian]], [[Old Permic alphabet\|Old Permic]], [[Pahawh Hmong]], [[Palmyrene script\|Palmyrene]], [[Pau Cin Hau script\|Pau Cin Hau]], [[Psalter Pahlavi]], [[Siddhaṃ alphabet\|Siddham]], [[Tirhuta]], [[Warang Citi]], and [[Dingbat]]s.<ref>{{cite web \| title = Unicode Data 7.0.0 \| url = https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt \| accessdate = 2014-06-15}} </ref> \|- \| 8.0 \| June 2015 \| {{ISBN\|978-1-936213-10-8}} \| ISO/IEC 10646:2014 plus Amendment 1, as well as the [[Georgian lari\|Lari sign]], nine CJK unified ideographs, and 41 emoji characters<ref>{{Cite web \| title=Unicode 8.0.0 \| url=https://www.unicode.org/versions/Unicode8.0.0/ \| publisher=Unicode Consortium \| accessdate=2015-06-17 }}</ref> \| 129 \| 120,672<br />(7,716 added) \| [[Ahom alphabet\|Ahom]], [[Anatolian hieroglyphs]], [[Hatran alphabet\|Hatran]], [[Multani alphabet\|Multani]], [[Old Hungarian alphabet\|Old Hungarian]], [[SignWriting]], 5,771 [[CJK Unified Ideographs\|CJK unified ideographs]], a set of lowercase letters for [[Cherokee syllabary\|Cherokee]], and five emoji [[Fitzpatrick scale\|skin tone]] modifiers<ref>{{cite web \| title = Unicode Data 8.0.0 \| url = https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt \| accessdate = 2015-06-17}} </ref> \|- \| 9.0 \| June 2016 \| {{ISBN\|978-1-936213-13-9}} \| ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols<ref>{{Cite web \| title=Unicode 9.0.0 \| url=https://www.unicode.org/versions/Unicode9.0.0/ \| publisher=Unicode Consortium \| accessdate=2016-06-21 }}</ref> \| 135 \| 128,172<br />(7,500 added) \| [[Adlam script\|Adlam]], [[Bhaiksuki alphabet\|Bhaiksuki]], [[Zhang-Zhung language#Scripts\|Marchen]], [[Prachalit Nepal alphabet\|Newa]], [[Osage alphabet\|Osage]], [[Tangut script\|Tangut]], and 72 [[emoji]]<ref>{{cite web \| title = Unicode Data 9.0.0 \| url = https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt \| accessdate = 2016-06-21}} </ref><ref name=laobo>{{cite web\|first=Martim\|last=Lobao\|url=https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/ \|title=These Are The Two Emoji That Weren't Approved For Unicode 9 But Which Google Added To Android Anyway\|website=Android Police\|date= 7 June 2016\|accessdate=4 September 2016}}</ref> \|- \| 10.0 \| June 2017 \| {{ISBN\|978-1-936213-16-0}} \| ISO/IEC 10646:2017 plus 56 [[emoji]] characters, 285 [[hentaigana]] characters, and 3 Zanabazar Square characters<ref>{{Cite web \| title=Unicode 10.0.0 \| url=https://www.unicode.org/versions/Unicode10.0.0/ \| publisher=Unicode Consortium \| accessdate=2017-06-20 }}</ref> \| 139 \| 136,690<br />(8,518 added) \| [[Zanabazar Square alphabet\|Zanabazar Square]], [[Soyombo alphabet\|Soyombo]], [[Masaram Gondi script\|Masaram Gondi]], [[Nüshu script\|Nüshu]], [[hentaigana]] (non-standard [[hiragana]]), 7,494 [[CJK Unified Ideographs\|CJK unified ideographs]], and 56 [[emoji]] \|- \| 11.0 \| June 2018 \| {{ISBN\|978-1-936213-19-1}} \| ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters.<ref>{{Cite web \| title=The Unicode Standard, Version 11.0.0 Appendix C \| url=https://www.unicode.org/versions/Unicode11.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2018-06-11 }}</ref> \| 146 \| 137,374<br />(684 added) \| [[Dogri script\|Dogra]], [[Georgian scripts#Mkhedruli\|Georgian Mtavruli]] capital letters, [[Gunjala Gondi Lipi\|Gunjala Gondi]], [[Hanifi Rohingya script\|Hanifi Rohingya]], [[Indic Siyaq Numbers (Unicode block)\|Indic Siyaq numbers]], [[Makassarese language\|Makasar]], [[Medefaidrin]], [[Sogdian alphabet\|Old Sogdian and Sogdian]], [[Mayan numerals]], 5 urgently needed [[CJK Unified Ideographs\|CJK unified ideographs]], symbols for [[xiangqi]] (Chinese chess) and [[Star (classification)\|star ratings]], and 145 [[emoji]]<ref>{{Cite web\|url=http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html\|title=Announcing The Unicode Standard, Version 11.0\|website=blog.unicode.org\|access-date=2018-06-06}}</ref> \|- \| 12.0 \| March 2019 \| {{ISBN\|978-1-936213-22-1}} \| ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters.<ref>{{Cite web \| title=The Unicode Standard, Version 12.0.0 Appendix C \| url=https://www.unicode.org/versions/Unicode12.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2019-03-05 }}</ref> \| 150 \| 137,928<br />(554 added) \| [[Elymaic]], [[Nandinagari]], [[Nyiakeng Puachue Hmong]], [[Wancho script\|Wancho]], [[Pollard script\|Miao script]] additions for several Miao and Yi dialects in China, [[hiragana]] and [[katakana]] small letters for writing archaic Japanese, [[Tamil script\|Tamil]] historic fractions and symbols, [[Lao alphabet\|Lao]] letters for [[Pali]], Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, and 61 [[emoji]]<ref>{{Cite web\|url=http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html\|title=Announcing The Unicode Standard, Version 12.0\|website=blog.unicode.org\|access-date=2019-03-05}}</ref> \|- \| 12.1 \| May 2019 \| {{ISBN\|978-1-936213-25-2}} \| \| 150 \| 137,929<br />(1 added) \| Adds a single character at U+32FF for the square ligature form of the name of the [[Reiwa\|Reiwa era]].<ref>{{Cite web\|url=http://blog.unicode.org/2019/05/unicode-12-1-en.html\|title=Unicode Version 12.1 released in support of the Reiwa Era\|website=blog.unicode.org\|access-date=2019-05-07}}</ref> \|- \| [http://www.unicode.org/versions/Unicode13.0.0/ 13.0] \| March 2020 \| {{ISBN\|978-1-936213-26-9}} \| ISO/IEC 10646:2020<ref>{{Cite web \| title=The Unicode Standard, Version 13.0– Core Specification Appendix C \| url=https://www.unicode.org/versions/Unicode13.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2020-03-11 }}</ref> \| 154 \| 143,859<br />(5,930 added) \| [[Khwarezmian_language#Writing_system\|Chorasmian]], [[Dhives akuru\|Dives Akuru]], [[Khitan small script]], [[Kurdish alphabets#Yezidi\|Yezidi]], 4,969 CJK unified ideographs added (including 4,939 in [[CJK Unified Ideographs Extension G\|Ext. G]]), Arabic script additions used to write [[Hausa language\|Hausa]], [[Wolof language\|Wolof]], and other languages in Africa and other additions used to write [[Hindko]] and [[Punjabi language\|Punjabi]] in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems from the 1970s and 1980s, and 55 emoji<ref>{{Cite web\|url=http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html\|title=Announcing The Unicode Standard, Version 13.0\|website=blog.unicode.org\|access-date=2020-03-11}}</ref> \|} {{Reflist\|group=tablenote}} ==<span id="Upluslink"></span><span id="codespace"></span> Architecture and terminology== {{See also\|Universal Character Set characters}}<!-- Template:U+ links to this paragraph --> The Unicode Standard defines a ''codespace''<ref name="Glossary">{{cite web\|title = Glossary of Unicode Terms\|url=https://unicode.org/glossary/\|accessdate=2010-03-16}}</ref> of numerical values ranging from 0 through 10FFFF<sub>[[hexadecimal\|16]]</sub>,<ref>{{cite book\|title=The Unicode Standard, Version 13.0 \|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212\|year=2019\|page=19\|chapter=3.4 Characters and Encoding}}</ref> called ''[[code point\|code points]]''<ref name=":0">{{Cite book\|url=http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564\|title=The Unicode Standard Version 12.0 – Core Specification\|last=\|first=\|publisher=\|year=2019\|isbn=\|location=\|page=29\|chapter=2.4 Code Points and Characters}}</ref> and denoted as U+0000 through U+10FFFF ("U+" plus the code point value in [[hexadecimal]], prepended with [[leading zero\|leading zeros]] as necessary to result in a minimum of four digits, ''e. g.'', U+00F7 for the division sign, ÷, versus U+13254 for the [[Egyptian hieroglyph]] designating a [[List of hieroglyphs#O\|reed shelter]] or a [[c:Category:Winding wall (h hieroglyph)\|winding wall]] {{nowrap\|( [[File:Hiero O4.png\|text-bottom\|15px]] )}}<ref>{{Cite web\|url=https://www.unicode.org/versions/Unicode13.0.0/appA.pdf\|date=March 2020\|title=Appendix A: Notational Conventions\|publisher=Unicode Consortium\|work=The Unicode Standard}} In conformity with the bullet point relating to Unicode in [[MOS:ALLCAPS]], the formal Unicode names are not used in this paragraph.</ref>), respectively. Out of these 2<sup>16</sup> + 2<sup>20</sup> defined code points, the code points from U+D800 through U+DFFF, which are used to encode surrogate pairs in [[UTF-16]], are reserved by the Standard and may not be used to encode valid characters, resulting in a net total of 2<sup>16</sup> − 2<sup>11</sup> + 2<sup>20</sup> = 1,112,064 possible code points corresponding to valid Unicode characters. Not all of these code points necessarily correspond to visible characters; several, for example, are assigned to control codes such as [[carriage return]]. ===Code point planes and blocks=== {{Main\|Plane (Unicode)}} The Unicode codespace is divided into seventeen ''planes'', numbered 0 to 16: {{Planes (Unicode)}} All code points in the BMP are accessed as a single code unit in [[UTF-16]] encoding and can be encoded in one, two or three bytes in [[UTF-8]]. Code points in Planes 1 through 16 (''supplementary planes'') are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8. Within each plane, characters are allocated within named ''[[Block (Unicode)\|blocks]]'' of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks. ===General Category property=== Each code point has a single [[Character property (Unicode)#General Category\|General Category]] property. The major categories are denoted: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. In most cases other properties must be used to sufficiently specify the characteristics of a code point. The possible General Categories are: {{General Category (Unicode)}} Code points in the range U+D800–U+DBFF (1,024 code points) are known as high-'''surrogate''' code points, and code points in the range U+DC00–U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point form a surrogate pair in [[UTF-16]] to represent code points greater than U+FFFF. These code points otherwise cannot be used (this rule is ignored often in practice especially when not using UTF-16). A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six of these '''noncharacters''': U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.<ref name="stability-policy">{{cite web \| title = Unicode Character Encoding Stability Policy \| url = https://unicode.org/policies/stability_policy.html \| accessdate = 2010-03-16}} </ref> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the [[byte order mark]] assumes that U+FFFE will never be the first code point in a text. Excluding surrogates and noncharacters leaves 1,111,998 code points available for use. '''Private-use''' code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard<ref>{{cite web \| title = Properties \| url = https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463 \| accessdate = 2020-03-15 }} </ref> so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private-use areas in the Unicode codespace: * Private Use Area: U+E000–U+F8FF (6,400 characters) * Supplementary Private Use Area-A: U+F0000–U+FFFFD (65,534 characters) * Supplementary Private Use Area-B: U+100000–U+10FFFD (65,534 characters). Graphic characters are characters defined by Unicode to have particular semantics, and either have a visible [[glyph]] shape or represent a visible space. As of Unicode 13.0 there are 143,696 graphic characters. '''Format''' characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. For example, {{unichar\|200C\|Zero-width non-joiner\|nlink=}} and {{unichar\|200D\|Zero-width joiner\|nlink=}} may be used to change the default shaping behavior of adjacent characters (e.g., to inhibit ligatures or request ligature formation). There are 163 format characters in Unicode 13.0. Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as '''control''' codes, and correspond to the [[C0 and C1 control codes]] defined in [[ISO/IEC 6429]]. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated ([[Mojibake]]) legacy [[Windows-1252]] characters used by some English and Western European texts with Windows technologies. Graphic characters, format characters, control code characters, and private use characters are known collectively as ''assigned characters''. '''Reserved''' code points are those code points which are available for use, but are not yet assigned. As of Unicode 13.0 there are 830,606 reserved code points. ===Abstract characters=== The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of ''abstract characters'' that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.<ref>{{cite web \| title = Unicode Character Encoding Model \| url = https://unicode.org/reports/tr17/ \| accessdate = 2010-03-16}} </ref> However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an [[ogonek]], a [[dot above]], and an [[acute accent]], which is required in [[Lithuanian language\|Lithuanian]], is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.<ref>{{cite web \| title = Unicode Named Sequences \| url = https://unicode.org/Public/UNIDATA/NamedSequences.txt \| accessdate = 2010-03-16}} </ref> All graphic, format, and private use characters have a unique and immutable name by which they may be identified. This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy.<ref name="stability-policy" /> In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, {{unichar\|A015\|YI SYLLABLE WU}} has the formal alias {{sc2\|YI SYLLABLE ITERATION MARK}}, and {{unichar\|FE18\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''KC'''ET\|note=[[sic]]}} has the formal alias {{sc2\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''CK'''ET}}.<ref>{{cite web \| title = Unicode Name Aliases \| url = https://unicode.org/Public/UNIDATA/NameAliases.txt \| accessdate = 2010-03-16}}</ref> ===Ready-made versus composite characters=== Unicode includes a mechanism for modifying characters that greatly extends the supported glyph repertoire. This covers the use of [[combining diacritical mark]]s that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains [[precomposed character\|precomposed]] versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, ''é'' can be represented in Unicode as [[#Upluslink\|U+]]0065 ({{sc2\|LATIN SMALL LETTER E}}) followed by U+0301 ({{sc2\|COMBINING ACUTE ACCENT}}), but it can also be represented as the precomposed character U+00E9 ({{sc2\|LATIN SMALL LETTER E WITH ACUTE}}). Thus, in many cases, users have multiple ways of encoding the same character. To deal with this, Unicode provides the mechanism of [[canonical equivalence]]. An example of this arises with [[Hangul]], the Korean alphabet. Unicode provides a mechanism for composing Hangul syllables with their individual subcomponents, known as [[Hangul Jamo]]. However, it also provides 11,172 combinations of precomposed syllables made from the most common jamo. The [[CJK]] characters currently have codes only for their precomposed form. Still, most of those characters comprise simpler elements (called [[Radical_(Chinese_characters)\|radicals]]), so in principle Unicode could have decomposed them as it did with Hangul. This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable character (which might do away with some of the problems caused by [[Han unification]]). A similar idea is used by some [[input method]]s, such as [[Cangjie method\|Cangjie]] and [[Wubi method\|Wubi]]. However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does. A set of [[Radical (Chinese character)\|radicals]] was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 12.2 of Unicode 5.2) warns against using [[Ideographic Description Sequences\|ideographic description sequences]] as an alternate representation for previously encoded characters: {{quote\|This process is different from a formal ''encoding'' of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase "an 'e' with an acute accent on it" than to the character sequence <U+0065, U+0301>.}} ===Ligatures=== Many scripts, including [[Arabic script\|Arabic]] and [[Devanagari\|Devanāgarī]], have special orthographic rules that require certain combinations of letterforms to be combined into special [[ligature (typography)\|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard), which became the [[proof of concept]] for [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)\|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography\|AAT]] (by Apple). Instructions are also embedded in fonts to tell the [[operating system]] how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally this approach is only effective in monospaced fonts, but may be used as a fallback rendering method when more complex methods fail. ===Standardized subsets=== Several subsets of Unicode are standardized: Microsoft Windows since [[Windows NT 4.0]] supports [[WGL-4]] with 656 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets:<ref>[https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf CWA 13873:2000 – Multilingual European Subsets in ISO/IEC 10646-1] [[European Committee for Standardization\|CEN]] Workshop Agreement 13873</ref> MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters)<ref>[https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html Multilingual European Character Set 2 (MES-2) Rationale], [[Markus Kuhn (computer scientist)\|Markus Kuhn]], 1998</ref> and MES-3A & MES-3B (two larger subsets, not shown here). Note that MES-2 includes every character in MES-1 and WGL-4. {\| class="wikitable" \|+ {{nobold\|'''WGL-4''', ''MES-1'' and MES-2}} \|- ! Row !! Cells !! Range(s) \|- !rowspan="2"\| 00 \| '''''20–7E''''' \| [[Basic Latin (Unicode block)\|Basic Latin]] (00–7F) \|- \| '''''A0–FF''''' \| [[Latin-1 Supplement (Unicode block)\|Latin-1 Supplement]] (80–FF) \|- !rowspan="2"\| 01 \| '''''00–13,'' 14–15, ''16–2B,'' 2C–2D, ''2E–4D,'' 4E–4F, ''50–7E,'' 7F''' \| [[Latin Extended-A]] (00–7F) \|- \| 8F, '''92,''' B7, DE-EF, '''FA–FF''' \| [[Latin Extended-B]] (80–FF <span title="U+024F">...</span>) \|- !rowspan="3"\| 02 \| 18–1B, 1E–1F \| Latin Extended-B (<span title="U+00180">...</span> 00–4F) \|- \| 59, 7C, 92 \| [[IPA Extensions]] (50–AF) \|- \| BB–BD, '''C6, ''C7,'' C9,''' D6, '''''D8–DB,'' DC, ''DD,''''' DF, EE \| [[Spacing Modifier Letters]] (B0–FF) \|- ! 03 \| 74–75, 7A, 7E, '''84–8A, 8C, 8E–A1, A3–CE,''' D7, DA–E1 \| [[Greek and Coptic\|Greek]] (70–FF) \|- ! 04 \| '''00–5F, 90–91,''' 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9 \| [[Cyrillic (Unicode block)\|Cyrillic]] (00–FF) \|- ! 1E \| 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, '''80–85,''' 9B, '''F2–F3''' \| [[Latin Extended Additional]] (00–FF) \|- ! 1F \| 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE \| [[Greek Extended]] (00–FF) \|- !rowspan="3"\| 20 \| '''13–14, ''15,'' 17, ''18–19,'' 1A–1B, ''1C–1D,'' 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,''' 4A \| [[General Punctuation]] (00–6F) \|- \| '''7F''', 82 \| [[Superscripts and Subscripts]] (70–9F) \|- \| '''A3–A4, A7, ''AC,''''' AF \| [[Currency Symbols (Unicode block)\|Currency Symbols]] (A0–CF) \|- !rowspan="3"\| 21 \| '''05, 13, 16, ''22, 26,'' 2E''' \| [[Letterlike Symbols]] (00–4F) \|- \| '''''5B–5E''''' \| [[Number Forms]] (50–8F) \|- \| '''''90–93,'' 94–95, A8''' \| [[Arrows (Unicode block)\|Arrows]] (90–FF) \|- ! 22 \| 00, '''02,''' 03, '''06,''' 08–09, '''0F, 11–12, 15, 19–1A, 1E–1F,''' 27–28, '''29,''' 2A, '''2B, 48,''' 59, '''60–61, 64–65,''' 82–83, 95, 97 \| [[Mathematical Operators]] (00–FF) \|- ! 23 \| '''02, 0A, 20–21,''' 29–2A \| [[Miscellaneous Technical]] (00–FF) \|- !rowspan="3"\| 25 \| '''00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C''' \| [[Box Drawing]] (00–7F) \|- \| '''80, 84, 88, 8C, 90–93''' \| [[Block Elements]] (80–9F) \|- \| '''A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6''' \| [[Geometric Shapes]] (A0–FF) \|- ! 26 \| '''3A–3C, 40, 42, 60, 63, 65–66, ''6A,'' 6B''' \| [[Miscellaneous Symbols]] (00–FF) \|- ! F0 \| (01–02)<!--in WGL-4, but not in MES-2--> \| [[Private Use Area (Unicode block)\|Private Use Area]] (00–FF ...) \|- ! FB \| '''01–02''' \| [[Alphabetic Presentation Forms]] (00–4F) \|- ! FF \| FD \| [[Specials (Unicode block)\|Specials]] \|} Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "[[replacement character]]" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's [[Last Resort font]] will display a substitute glyph indicating the Unicode range of the character, and the [[SIL International]]'s [[Unicode fallback font\|Unicode Fallback]] font will display a box showing the hexadecimal scalar value of the character. ==={{anchor\|UTF\|UCS}}Mapping and encodings=== Several mechanisms have been specified for storing a series of code points as a series of bytes. <!-- [[Unicode Transformation Format]] redirects here --> Unicode defines two mapping methods: the ''Unicode Transformation Format'' (UTF) encodings, and the ''[[Universal Coded Character Set]]'' (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode ''code points'' to sequences of values in some fixed-size range, termed ''code units''. All UTF encodings map code points to a unique sequence of bytes.<ref>{{cite web\|title=UTF-8, UTF-16, UTF-32 & BOM\|url=https://unicode.org/faq/utf_bom.html\|website=Unicode.org FAQ\|accessdate=12 December 2016}}</ref> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. UTF encodings include: * [[UTF-1]], a retired predecessor of UTF-8, maximizes compatibility with [[ISO/IEC 2022\|ISO 2022]], no longer part of ''The Unicode Standard'' * [[UTF-7]], a 7-bit encoding sometimes used in e-mail, often considered obsolete (not part of ''The Unicode Standard'', but only documented as an informational [[Request for Comments\|RFC]], i.e., not on the Internet Standards Track) * [[UTF-8]], uses one to four bytes for each code point, maximizes compatibility with [[ASCII]] * [[UTF-EBCDIC]], similar to UTF-8 but designed for compatibility with [[EBCDIC]] (not part of ''The Unicode Standard'') * [[UTF-16]], uses one or two 16-bit code units per code point, cannot encode surrogates * [[UTF-32]], uses one 32-bit code unit per code point UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the ''de facto'' standard encoding for interchange of Unicode text. It is used by [[FreeBSD]] and most recent [[Linux distributions]] as a direct replacement for legacy encodings in general text handling. The UCS-2 and UTF-16 encodings specify the Unicode [[Byte Order Mark]] (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or [[endianness\|byte endianness]] detection). The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of [[ligature (typography)\|ligatures]]). The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".<ref>{{Cite book \| title=The Unicode Standard, Version 6.2 \| publisher=The Unicode Consortium \| year=2013 \| isbn=978-1-936213-08-5 \| page=561 }}</ref> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit [[code page]]s. However {{IETF RFC\|3629}}, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM. In UTF-32 and UCS-4, one [[32-bit]] code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the [[GNU Compiler Collection\|gcc]] compilers to generate software uses it as the standard "[[wide character]]" encoding. Some programming languages, such as [[Seed7]], use UTF-32 as internal representation for strings and characters. Recent versions of the [[Python (programming language)\|Python]] programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in [[high-level programming language\|high-level]] coded software. [[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]] (DNS). The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all scripts that are supported by Unicode. Earlier and now historical proposals include [[UTF-5]] and [[UTF-6]]. [[GB 18030\|GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the [[People's Republic of China]] (PRC). [[Binary Ordered Compression for Unicode\|BOCU-1]] and [[Standard Compression Scheme for Unicode\|SCSU]] are Unicode compression schemes. The [[April Fools' Day RFC]] of 2005 specified two [[parody]] UTF encodings, [[UTF-9]] and [[UTF-18]]. ==Adoption== ===Operating systems=== Unicode has become the dominant scheme for internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use [[UCS-2]] (the fixed-width two-byte precursor to UTF-16) and later moved to [[UTF-16]] (the variable-width current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000]], [[Windows XP]], [[Windows Vista]], [[Windows 7]], [[Windows 8]] and [[Windows 10]]), which uses UTF-16 as the sole internal character encoding. The [[Java virtual machine\|Java]] and [[.NET Framework\|.NET]] bytecode environments, [[macOS]], and [[KDE]] also use it for internal representation. Partial support for Unicode can be installed on [[Windows 9x]] through the [[Microsoft Layer for Unicode]]. [[UTF-8]] (originally developed for [[Plan 9 from Bell Labs\|Plan 9]])<ref>{{cite web \| url = https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt \| title = UTF-8 history \| first = Rob \| last = Pike \| authorlink = Rob Pike \| date = 2003-04-30 }}</ref> has become the main storage encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets. UTF-8 is also the most common Unicode encoding used in [[HTML]] documents on the [[World Wide Web]]. Multilingual text-rendering engines which use Unicode include [[Uniscribe]] and [[DirectWrite]] for Microsoft Windows, [[ATSUI]] and [[Core Text]] for macOS, and [[Pango]] for [[GTK+]] and the [[GNOME]] desktop. ===Input methods=== {{Main\|Unicode input}} Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire. [[ISO/IEC 14755]],<ref>{{cite web\|url=https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf \|title=ISO/IEC JTC1/SC 18/WG 9 N \|date= \|accessdate=2012-06-04}}</ref> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the ''Basic method'', where a ''beginning sequence'' is followed by the hexadecimal representation of the code point and the ''ending sequence''. There is also a ''screen-selection entry method'' specified, where the characters are listed in a table in a screen, such as with a character map program. Online tools for finding the code point for a known character include Unicode Lookup<ref>{{cite web\|url=https://unicodelookup.com/\|title=Unicode Lookup\|last=Hedley\|first=Jonathan\|date=2009}}</ref> by Jonathan Hedley and Shapecatcher<ref>{{cite web\|url=http://shapecatcher.com/\|title=Unicode Character Recognition\|last=Milde\|first=Benjamin\|date=2011}}</ref> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on [[Shape context]], one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned. ===Email=== {{Main\|Unicode and email}} [[MIME]] defines two different mechanisms for encoding non-ASCII characters in [[email]], depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the [[UTF-8]] character set and the [[Base64]] or the [[Quoted-printable]] transfer encoding are recommended, depending on whether much of the message consists of [[ASCII]] characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software. The adoption of Unicode in email has been very slow. Some East Asian text is still encoded in encodings such as [[ISO-2022]], and some devices, such as mobile phones, still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as [[Yahoo]], [[Google]] ([[Gmail]]), and [[Microsoft]] ([[Outlook.com]]) support it. ===Web=== {{Main\|Unicode and HTML}} All [[W3C]] recommendations have used Unicode as their ''document character set'' since HTML 4.0. [[Web browser]]s have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from [[typeface\|font]] related issues; e.g. v 6 and older of Microsoft [[Internet Explorer]] did not render many code points unless explicitly told to use a font that contains them.<ref>{{cite web\|first=Alan \|last=Wood \|url=http://www.alanwood.net/unicode/explorer.html#ie5 \|title=Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support \|publisher=Alan Wood \|date= \|accessdate=2012-06-04}}</ref> Although syntax rules may affect the order in which characters are allowed to appear, [[XML]] (including [[XHTML]]) documents, by definition,<ref>{{cite web\|title=Extensible Markup Language (XML) 1.1 (Second Edition)\|url=https://www.w3.org/TR/xml11\|accessdate=2013-11-01}}</ref> comprise characters from most of the Unicode code points, with the exception of: * most of the [[C0 and C1 control codes\|C0 control codes]] * the permanently unassigned code points D800–DFFF * FFFE or FFFF HTML characters manifest either directly as [[byte]]s according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&#916;</code>, <code>&#1049;</code>, <code>&#1511;</code>, <code>&#1605;</code>, <code>&#3671;</code>, <code>&#12354;</code>, <code>&#21494;</code>, <code>&#33865;</code>, and <code>&#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말. When specifying [[Uniform Resource Identifier\|URIs]], for example as [[URL]]s in [[HTTP]] requests, non-ASCII characters must be [[percent encoding\|percent-encoded]]. ===Fonts=== {{Main\|Unicode font}} Unicode is not in principle concerned with fonts ''per se'', seeing them as implementation choices.<ref>{{cite journal \|url = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf \| title = The design of a Unicode font \| journal = Electronic Publishing \| volume = VOL. 6(3), 289–305 \| date = September 1993 \| page = 292 \|last1 = Bigelow \| first1=Charles \| last2 = Holmes \| first2 = Kris}}</ref> Any given character may have many [[allograph]]s, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode standard.<ref>{{cite web \| url= https://www.unicode.org/faq/font_keyboard.html \| title = Fonts and keyboards \| publisher = Unicode Consortium \| date = 28 June 2017 \| accessdate= 13 October 2019}}</ref> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire. Free and retail [[font]]s based on Unicode are widely available, since [[TrueType]] and [[OpenType]] support Unicode. These font formats map Unicode code points to glyphs, but TrueType font is restricted to 65,535 glyphs. [[List of typefaces\|Thousands of fonts]] exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of Unicode fonts\|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., [[font substitution]]. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces. ===Newlines=== Unicode partially addresses the [[newline]] problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of [[Newline#Unicode\|characters]] that conforming applications should recognize as line terminators. In terms of the newline, Unicode introduced {{unichar\|2028\|LINE SEPARATOR}} and {{unichar\|2029\|PARAGRAPH SEPARATOR}}. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding. ==Issues== ===Philosophical and completeness criticisms=== [[Han unification]] (the identification of forms in the [[East Asian language]]s which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the [[Ideographic Research Group]] (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification.<ref>[http://tronweb.super-nova.co.jp/characcodehist.html A Brief History of Character Codes], Steven J. Searle, originally written [https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html 1999], last updated 2004</ref> Unicode has been criticized for failing to separately encode older and alternative forms of [[kanji]] which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). Unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged.<ref name="dw2001">[https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html The secret life of Unicode: A peek at Unicode's soft underbelly], Suzanne Topping, 1 May 2001 ''(Internet Archive)''</ref>{{clarify\|date=April 2010\|reason="and, contains" and meaning of statement}} There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. An example of one is [[TRON (encoding)\|TRON]] (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it). Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 92,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam. Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations, in the form of [[variation Selectors\|Unicode variation sequences]]. For example, the Advanced Typographic tables of [[OpenType]] permit one of a number of alternative glyph representations to be selected when performing the character to glyph mapping process. In this case, information can be provided within plain text to designate which alternate character form to select. [[File:Cyrillic cursive.svg\|thumb\|right\|Various [[Cyrillic]] characters shown with and without italics]] If the difference in the appropriate glyphs for two characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison between Russian (labeled standard) and Serbian characters at right, meaning that the differences are displayed through smart font technology or manually changing fonts. ===Mapping to legacy character sets=== Unicode was designed to provide code-point-by-code-point [[round-trip format conversion]] to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. That has meant that inconsistent legacy architectures, such as [[combining character\|combining diacritics]] and [[precomposed character]]s, both exist in Unicode, giving more than one method of representing some text. This is most pronounced in the three different encoding forms for Korean [[Hangul]]. Since version 3.0, any precomposed characters that can be represented by a combining sequence of already existing characters can no longer be added to the standard in order to preserve interoperability between software using different versions of Unicode. [[Injective]] mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as [[Shift-JIS]] or [[EUC-JP]] and Unicode led to [[round-trip format conversion]] mismatches, particularly the mapping of the character JIS X 0208 '～' (1-33, WAVE DASH), heavily used in legacy database data, to either {{unichar\|FF5E\|FULLWIDTH TILDE}} (in [[Microsoft Windows]]) or {{unichar\|301C\|WAVE DASH}} (other vendors).<ref> [http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc AFII contribution about WAVE DASH], {{Cite web\|url=http://www.ingrid.org/java/i18n/unicode.html\|archiveurl=https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html\|title=An Unicode vendor-specific character table for japanese\|date=2011-04-22\|archive-date=2011-04-22\|website=web.archive.org<!--\|access-date=2019-05-20-->}}</ref> Some Japanese computer programmers objected to Unicode because it requires them to separate the use of {{unichar\|005C\|REVERSE SOLIDUS\|note=backslash}} and {{unichar\|00A5\|YEN SIGN}}, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage.<ref>[https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem ''ISO 646-* Problem''], Section 4.4.3.5 of ''Introduction to I18n'', Tomohiro KUBOTA, 2001</ref> (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in [[ISO 8859-1]], from long before Unicode. ===Indic scripts=== [[Indic script]]s such as [[Tamil script\|Tamil]] and [[Devanagari]] are each allocated only 128 code points, matching the [[ISCII]] standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (aka conjuncts) out of components. Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only.<ref>{{cite web \| title = Arabic Presentation Forms-A \| url = https://www.unicode.org/charts/PDF/UFB50.pdf \| accessdate = 2010-03-20}} </ref><ref>{{cite web \| title = Arabic Presentation Forms-B \| url = https://www.unicode.org/charts/PDF/UFE70.pdf \| accessdate = 2010-03-20}}</ref><ref>{{cite web \| title = Alphabetic Presentation Forms \| url = https://www.unicode.org/charts/PDF/UFB00.pdf \| accessdate = 2010-03-20}}</ref> Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for the [[Tibetan script]] in 2003 when the [[Standardization Administration of China]] proposed encoding 956 precomposed Tibetan syllables,<ref>{{Cite web \| author=China \| title=Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP \| url=https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf \| date=2 December 2002 }}</ref> but these were rejected for encoding by the relevant ISO committee ([[ISO/IEC JTC 1/SC 2]]).<ref>{{Cite web \| author= V. S. Umamaheswaran \| title=Resolutions of WG 2 meeting 44 \| url=https://www.unicode.org/L2/L2003/03390r-n2654.pdf \| at=Resolution M44.20 \| date=7 November 2003 }}</ref> [[Thai alphabet]] support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the [[TIS-620\|Thai Industrial Standard 620]], which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation.<ref name="dw2001" /> Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word {{wiktth\|แสดง}} {{IPA-th\|sa dɛːŋ\|}} "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส. ===Combining characters=== {{Main\|Combining character}} {{See also\|Unicode normalization#Normalization}} Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[Macron (diacritic)\|macron]] and [[acute accent]], but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Similarly, [[dot (diacritic)\|underdots]], as needed in the [[romanization]] of [[Indo-Aryan languages\|Indic]], will often be placed incorrectly.{{Citation needed\|date=July 2011}}. Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded the problem can often be solved by using a specialist Unicode font such as [[Charis SIL]] that uses [[Graphite (SIL)\|Graphite]], [[OpenType]], or [[Apple Advanced Typography\|AAT]] technologies for advanced rendering features. ===Anomalies=== {{main\|Unicode alias names and abbreviations}} The Unicode standard has imposed rules intended to guarantee stability.<ref>[https://www.unicode.org/policies/stability_policy.html Unicode stability policy]</ref> Depending on the strictness of a rule, a change can be prohibited or allowed. For example, a "name" given to a code point cannot and will not change. But a "script" property is more flexible, by Unicode's own rules. In version 2.0, Unicode changed many code point "names" from version 1. At the same moment, Unicode stated that from then on, an assigned name to a code point will never change anymore. This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling {{sc2\|{{typo\|BRAKCET}}}} for {{sc2\|BRACKET}} in a character name). In 2006 a list of anomalies in character names was first published, and, as of April 2017, there were 94 characters with identified issues,<ref name="tn17">{{cite web \|url=https://unicode.org/notes/tn27/ \|title=Unicode Technical Note #27: Known Anomalies in Unicode Character Names \|date=10 April 2017 \|website=unicode.org}}</ref> for example: * {{unichar\|2118\|script capital p\|nlink=Weierstrass p}}: This is a small letter. The capital is {{unichar\|1D4AB\|MATHEMATICAL SCRIPT CAPITAL P}}<ref>[https://www.unicode.org/charts/PDF/U2100.pdf Unicode chart: "actually this has the form of a lowercase calligraphic p, despite its name"]</ref> * {{unichar\|034F\|COMBINING GRAPHEME JOINER\|nlink=Combining grapheme joiner}}: Does not join graphemes.<ref name="tn17" /> * {{unichar\|A015\|YI SYLLABLE WU\|nlink=Yi language}}: This is not a Yi syllable, but a Yi iteration mark. * {{unichar\|FE18\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR {{typo\|BRAKCET}}}}: ''bracket'' is spelled incorrectly.<ref>[https://www.unicode.org/charts/PDF/UFE10.pdf "Misspelling of BRACKET in character name is a known defect"]</ref> Spelling errors are resolved by using [[Unicode alias names and abbreviations]]. ==See also== * [[Comparison of Unicode encodings]] * [[Cultural, political, and religious symbols in Unicode]] * [[International Components for Unicode]] (ICU), now as ICU-<abbr title="technical committee">TC</abbr> a part of Unicode * [[List of binary codes]] * [[List of Unicode characters]] * [[List of XML and HTML character entity references]] * [[Open-source Unicode typefaces]] * [[Standards related to Unicode]] * [[Unicode symbols]] * [[Universal Coded Character Set]] * [[Lotus Multi-Byte Character Set]] (LMBCS), a parallel development with similar intentions ==Further reading== {{refbegin}} * ''The Unicode Standard, Version 3.0'', The Unicode Consortium, Addison-Wesley Longman, Inc., April 2000. {{ISBN\|0-201-61633-5}} * ''The Unicode Standard, Version 4.0'', The Unicode Consortium, Addison-Wesley Professional, 27 August 2003. {{ISBN\|0-321-18578-1}} * ''The Unicode Standard, Version 5.0, Fifth Edition'', The [[Unicode Consortium]], Addison-Wesley Professional, 27 October 2006. {{ISBN\|0-321-48091-0}} * Julie D. Allen. ''The Unicode Standard, Version 6.0'', The [[Unicode Consortium]], Mountain View, 2011, {{ISBN\|9781936213016}}, ([https://www.unicode.org/versions/Unicode6.0.0/]). * ''The Complete Manual of Typography'', James Felici, Adobe Press; 1st edition, 2002. {{ISBN\|0-321-12730-7}} * ''Unicode: A Primer'', Tony Graham, M&T books, 2000. {{ISBN\|0-7645-4625-2}}. * ''Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard'', Richard Gillam, Addison-Wesley Professional; 1st edition, 2002. {{ISBN\|0-201-70052-2}} * ''Unicode Explained'', Jukka K. Korpela, O'Reilly; 1st edition, 2006. {{ISBN\|0-596-10121-X}} {{refend}} {{cite book \|author1=Yannis Haralambous \|author2=Martin Dürst \|editor1-last=Haralambous \|editor1-first=Yannis \|title=Proceedings of Graphemics in the 21st Century, Brest 2018 \|date=2019 \|publisher=Fluxus Editions \|location=Brest \|isbn=978-2-9570549-1-6 \|pages=167-183 \|url=http://www.fluxus-editions.fr/gla1-hara1.php \|ref=https://doi.org/10.36824/2018-graf-hara1 \|chapter=Unicode from a Linguistic Point of View}} ==Notes== {{notelist\|group=note}} ==References== {{reflist\|30em}} ==External links== {{Sister project links\|n=no\|v=no\|q=no\|s=no\|voy=no\|m=Unicode\|mw=no\|species=no}} {{official website\|name=Official website}} {{middot}} {{official website\|url=https://unicode.org/main.html\|name=Official technical site}} * {{DMOZ\|Computers/Software/Globalization/Character_Encoding/Unicode/}} * [http://www.alanwood.net/unicode/ Alan Wood's Unicode Resources]{{snd}} Contains lists of word processors with Unicode capability; fonts and characters are grouped by type; characters are presented in lists, not grids. * [https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont Unicode BMP Fallback Font] Displays the Unicode value of any character in a document, including in the Private Use Area, rather than the glyph itself. {{Unicode navigation\|state=uncollapsed}} {{Character encoding}} {{Authority control}} [[Category:Unicode\| ]] [[Category:Character encoding]] [[Category:Digital typography]]'
Унифицированная разница изменений правки (`edit_diff`)	'@@ -1,1208 +1,774 @@ -[[Файл:New Unicode logo.svg\|x200px\|thumb\|right\|Логотип Unicode Consortium]] +{{Use dmy dates\|date=May 2019\|cs1-dates=y}} +{{distinguish\|Unicode (telegraphy)}} +{{For\|what the term "Unicode" means in Microsoft documentation\|UTF-16}} +{{Short description\|Character encoding standard}} +{{Infobox character encoding +\| name = Unicode +\| mime = +\| alias = [[Universal Coded Character Set]] (UCS) +\| image = New Unicode logo.svg +\| caption = Logo of the [[Unicode Consortium]] +\| standard = Unicode Standard +\| lang = International +\| status = +\| encodings = [[UTF-8]], [[UTF-16]], [[GB 18030\|GB18030]]<br/>'''Less common''': [[UTF-32]], [[BOCU]], [[Standard Compression Scheme for Unicode\|SCSU]], [[UTF-7]] +\| encodes = +\| extends = +\| prev = [[ISO 8859]], various others +\| next = +}} +{{Contains special characters\| special = uncommon [[Unicode]] characters}} +'''Unicode''' is an [[information technology]] [[technical standard\|standard]] for the consistent [[character encoding\|encoding]], representation, and handling of [[character (computing)\|text]] expressed in most of the world's [[writing system]]s. The standard is maintained by the [[Unicode Consortium]], and {{as of\|March 2020\|lc=y}}, there is a repertoire of {{unicodenover}} (these [[character (computing)\|characters]] consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic [[script (Unicode)\|scripts]], as well as multiple symbol sets and [[emoji]]. The character repertoire of the Unicode Standard is synchronized with [[ISO/IEC 10646]], and both are code-for-code identical. + +''The Unicode Standard'' consists of a set of code charts for visual reference, an encoding method and set of standard [[character encoding]]s, a set of reference [[data file]]s, and a number of related items, such as character properties, rules for [[Unicode normalization\|normalization]], decomposition, [[Unicode collation algorithm\|collation]], rendering, and [[bidirectional text]] display order (for the correct display of text containing both right-to-left scripts, such as [[Arabic script\|Arabic]] and [[Hebrew alphabet\|Hebrew]], and left-to-right scripts).<ref>{{Cite web \| title = The Unicode Standard: A Technical Introduction \| url = https://www.unicode.org/standard/principles.html \| accessdate = 2010-03-16}}</ref> + +Unicode's success at unifying character sets has led to its widespread and predominant use in the [[internationalization and localization]] of computer [[software]]. The standard has been implemented in many recent technologies, including modern [[operating system]]s, [[XML]], [[Java (programming language)\|Java]] (and other programming languages), and the [[.NET Framework]]. + +[[Comparison of Unicode encodings\|Unicode can be implemented]] by different [[character encoding]]s. The Unicode standard defines [[UTF-8]], [[UTF-16]], and [[UTF-32]], and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and [[Universal Coded Character Set\|UCS]]-2 (without full support for Unicode), a precursor of UTF-16; [[GB 18030\|GB18030]] is standardized in China and implements Unicode fully, while not an official Unicode standard. + +UTF-8, the dominant encoding on the [[World Wide Web]] (used in over 94% of websites {{As of\|2019\|November\|df=\|lc=y}}),<ref>{{Cite web\|url=https://w3techs.com/technologies/cross/character_encoding/ranking\|title=Usage Survey of Character Encodings broken down by Ranking\|website=w3techs.com\|language=en\|access-date=2019-11-11}}</ref> uses one [[byte]]{{efn\|The Unicode Consortium uses the ambiguous term byte; The [[International Organization for Standardization]] (ISO), the [[International Electrotechnical Commission]] (IEC) and the [[Internet Engineering Task Force]] (IETF) use the more specific term [[Octet (computing)\|octet]] in current documents related to Unicode.\|group=note}}for the first 128 [[code point]]s, and up to 4 bytes for other characters.<ref>{{cite web\|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559\|work=The Unicode Standard\|title=Conformance \| date=March 2020\|accessdate=2020-03-15}}</ref> The first 128 Unicode code points represent the [[ASCII]] characters, which means that any ASCII text is also a UTF-8 text. + +UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called [[Basic Multilingual Plane]] (BMP). With 1,112,064 possible Unicode code points corresponding to characters (see [[#Architecture and terminology\|below]]) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same [[16-bit]] encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text. + +UTF-32 (also referred to as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used. + +==Origin and development== +Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the [[ISO/IEC 8859]] standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using [[Latin character]]s and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other). + +Unicode, in intent, encodes the underlying characters—[[grapheme]]s and grapheme-like units—rather than the variant [[glyph]]s (renderings) for such characters. In the case of [[Chinese characters]], this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see [[Han unification]]). + +In text processing, Unicode takes the role of providing a unique ''code point''—a [[number]], not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, [[font]], or style) to other software, such as a [[web browser]] or [[word processor]]. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode. -'''Юнико́д'''<ref name=autogenerated1>{{cite web\|url=http://www.unicode.org/standard/UnicodeTranscriptions.html\|title=Unicode Transcriptions\|publisher=\|date=\|accessdate=10 мая 2010\|lang=en\|archiveurl=https://web.archive.org/web/20060408204540/http://www.unicode.org/standard/UnicodeTranscriptions.html\|archivedate=2006-04-08\|deadlink=yes}}</ref> (чаще всего) или '''Унико́д'''<ref>[http://www.paratype.ru/help/term/terms.asp?code=361 Уникод в словаре Paratype]</ref> ({{lang-en\|Unicode}}) — стандарт [[Набор символов\|кодирования символов]], включающий в себя знаки почти всех письменных [[язык]]ов мира<ref name="unicode-techintro">{{cite web\|url=http://www.unicode.org/standard/principles.html\|title=The Unicode® Standard: A Technical Introduction\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100310120125/http://www.unicode.org/standard/principles.html\|archivedate=2010-03-10\|deadlink=yes}}</ref>. В настоящее время стандарт является преобладающим в [[Интернет\|Интернете]]. +The first 256 code points were made identical to the content of [[ISO/IEC 8859-1]] so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "[[Halfwidth and Fullwidth Forms (Unicode block)\|fullwidth forms]]" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean ([[CJK]]) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see [[duplicate characters in Unicode]]. -Стандарт предложен в [[1991 год]]у некоммерческой организацией «Консорциум Юникода» ({{lang-en\|Unicode Consortium, Unicode Inc.}})<ref>{{cite web\|url=http://www.unicode.org/history/publicationdates.html\|title=History of Unicode Release and Publication Dates\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100110085403/http://www.unicode.org/history/publicationdates.html\|archivedate=2010-01-10\|deadlink=yes}}</ref><ref>{{cite web\|url=http://www.unicode.org/consortium/consort.html\|title=The Unicode Consortium\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627085503/http://www.unicode.org/consortium/consort.html\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Применение этого стандарта позволяет закодировать очень большое число символов из разных систем письменности: в документах, закодированных по стандарту Юникод, могут соседствовать китайские [[иероглиф]]ы, математические символы, буквы [[греческий алфавит\|греческого алфавита]], [[латинский алфавит\|латиницы]] и [[кириллица\|кириллицы]], символы музыкальной нотной нотации, при этом становится ненужным переключение [[кодовая страница\|кодовых страниц]]<ref name="unicode-foreword">{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf\|title=Foreword\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627141434/http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. +==={{anchor\|Unicode 88}}History=== +Based on experiences with the [[Xerox Character Code Standard]] (XCCS) since 1980,<ref name="unicode-88"/> the origins of Unicode date to 1987, when [[Joe Becker (Unicode)\|Joe Becker]] from [[Xerox]] with [[Lee Collins (software engineer)\|Lee Collins]] and [[Mark Davis (Unicode)\|Mark Davis]] from [[Apple Inc.\|Apple]], started investigating the practicalities of creating a universal character set.<ref>{{cite web \|title=Summary Narrative \|url=https://www.unicode.org/history/summary.html \|access-date=2010-03-15}}</ref> With additional input from Peter Fenwick and Dave Opstad,<ref name="unicode-88"/> Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".<ref name="unicode-88">{{Cite web \|url=https://unicode.org/history/unicode88.pdf \|title=Unicode 88 \|author-last=Becker \|author-first=Joseph D. \|author-link=Joseph D. Becker \|date=1998-09-10 \|orig-year=1988-08-29 \|edition=10th anniversary reprint \|website=unicode.org \|publisher=[[Unicode Consortium]] \|access-date=2016-10-25 \|url-status=live \|archive-url=https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf \|archive-date=2016-11-25 \|quote=In 1978, the initial proposal for a set of "Universal Signs" was made by [[Bob Belleville]] at [[Xerox PARC]]. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the [[Xerox Character Code Standard]] (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.<br/>Unicode arose as the result of eight years of working experience with XCCS. Its fundamental differences from XCCS were proposed by Peter Fenwick and Dave Opstad (pure 16-bit codes), and by [[Lee Collins (Unicode)\|Lee Collins]] (ideographic character unification). Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communication multilingual system products.}}</ref> -Стандарт состоит из двух основных частей: универсального набора символов ({{lang-en\|Universal character set, UCS}}) и семейства кодировок ({{lang-en\|Unicode transformation format, UTF}}). Универсальный набор символов перечисляет допустимые по стандарту Юникод символы и присваивает каждому символу код в виде неотрицательного целого числа, записываемого обычно в шестнадцатеричной форме с префиксом <code>U+</code>, например, <code>U+040F</code>. Семейство кодировок определяет способы преобразования кодов символов для передачи в потоке или в файле. +In this document, entitled ''Unicode 88'', Becker outlined a [[16-bit]] character model:<ref name="unicode-88"/> -Коды в стандарте Юникод разделены на несколько областей. Область с кодами от U+0000 до U+007F содержит символы набора [[ASCII]], и коды этих символов совпадают с их кодами в ASCII. Далее расположены области символов других систем письменности, знаки пунктуации и технические символы. Часть кодов зарезервирована для использования в будущем<ref name='unicode-02'>{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf\|title=General Structure\|accessdate=2010-07-05\|archiveurl=https://web.archive.org/web/20100627093139/http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Под символы кириллицы выделены области знаков с кодами от U+0400 до U+052F, от U+2DE0 до U+2DFF, от U+A640 до U+A69F (см. [[Кириллица в Юникоде]])<ref>{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf\|title=European Alphabetic Scripts\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627140856/http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. +<blockquote> +Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose. +</blockquote> -== Предпосылки создания и развитие Юникода == -{{цитата\|Unicode — это уникальный код для любого символа, независимо от платформы, независимо от программы, независимо от языка.\|автор=Консорциум Юникода<ref>[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]</ref>}} -К концу 1980-х годов стандартом стали 8-битные кодировки, их существовало уже большое множество, и постоянно появлялись новые. Это объяснялось как расширением круга поддерживаемых языков, так и стремлением создавать кодировки, частично совместимые между собой (характерный пример — появление [[альтернативная кодировка\|альтернативной кодировки для русского языка]], обусловленное эксплуатацией западных программ, созданных для кодировки [[CP437]]). В результате появилось несколько проблем: -# проблема неправильной раскодировки; -# проблема ограниченности набора символов; -# проблема преобразования одной кодировки в другую; -# проблема дублирования шрифтов. +His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:<ref name="unicode-88"/> -'''Проблема неправильной раскодировки''' вызывала появление в документе символов иностранных языков, не предполагавшихся в документе, или появление не предполагавшихся [[псевдографика\|псевдографических]] символов, прозванных русскоязычными пользователями «кракозябрами». Проблема во многом была вызвана отсутствием стандартизированной формы указания кодировки для файла или потока. Проблему можно было решить либо последовательным внедрением стандарта указания кодировки, либо внедрением общей для всех языков кодировки.<ref name='unicode-foreword' /> +<blockquote> +Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2<sup>14</sup> = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes. +</blockquote> -'''Проблема ограниченности набора символов'''<ref name='unicode-foreword' />. Проблему можно было решить либо переключением шрифтов внутри документа, либо внедрением «широкой» кодировки. Переключение шрифтов издавна практиковалось в [[текстовый процессор\|текстовых процессорах]], причём часто использовались [[нестандартные шрифты\|шрифты с нестандартной кодировкой]], т. н. «dingbat fonts». В итоге при попытке переноса документа в другую систему все нестандартные символы превращались в «кракозябры». +In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of [[Research Libraries Group\|RLG]], and Glenn Wright of [[Sun Microsystems]], and in 1990, Michel Suignard and Asmus Freytag from [[Microsoft]] and Rick McGowan of [[NeXT]] joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready. -'''Проблема преобразования одной кодировки в другую'''. Проблему можно было решить либо составлением таблиц перекодировки для каждой пары кодировок, либо использованием промежуточного преобразования в третью кодировку, включающую все символы всех кодировок<ref>{{cite web\|url=http://www.unicode.org/history/unicode88.pdf\|title=Unicode 88\|accessdate=2010-07-08\|archiveurl=https://web.archive.org/web/20170906035012/http://unicode.org/history/unicode88.pdf\|archivedate=2017-09-06\|deadlink=yes}}</ref>. +The [[Unicode Consortium]] was incorporated in California on 3 January 1991,<ref>[https://unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates] on ''unicode.org.'' Retrieved February 28, 2017.</ref> and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992. -'''Проблема дублирования шрифтов'''. Для каждой кодировки создавался свой шрифт, даже если наборы символов в кодировках совпадали частично или полностью. Проблему можно было решить путём создания «больших» шрифтов, из которых впоследствии выбирались бы нужные для данной кодировки символы. Однако это требовало создания единого реестра символов, чтобы определять, чему что соответствует. +In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g., [[Egyptian hieroglyphs]]) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.<ref name=unicoderevisited>{{cite web\|last=Searle\|first=Stephen J\|title=Unicode Revisited\|url=http://tronweb.super-nova.co.jp/unicoderevisited.html\|accessdate=2013-01-18}}</ref> -Была признана необходимость создания единой «широкой» кодировки. Кодировки с переменной длиной символа, широко использующиеся в Восточной Азии, были признаны слишком сложными в использовании, поэтому было решено использовать символы фиксированной ширины. Использование 32-битных символов казалось слишком расточительным, поэтому было решено использовать 16-битные. +The Microsoft TrueType specification version 1.0 from 1992 used the name ''Apple Unicode'' instead of ''Unicode'' for the Platform ID in the naming table. -Первая версия Юникода представляла собой кодировку с фиксированным размером символа в 16 бит, то есть общее число кодов было 2<sup>16</sup> ({{formatnum:65536}}). С тех пор символы стали обозначать четырьмя шестнадцатеричными цифрами (например, <code>U+04F0</code>). При этом в Юникоде планировалось кодировать не все существующие символы, а только те, которые необходимы в повседневном обиходе. Редко используемые символы должны были размещаться в «области пользовательских символов» ({{lang\|en\|private use area}}), которая первоначально занимала коды <code>U+D800…U+F8FF</code>. Чтобы использовать Юникод также и в качестве промежуточного звена при преобразовании разных кодировок друг в друга, в него включили все символы, представленные во всех наиболее известных кодировках. +===Unicode Consortium=== +{{Main\|Unicode Consortium}} -В дальнейшем, однако, было принято решение кодировать все символы и в связи с этим значительно расширить кодовую область. Одновременно с этим, коды символов стали рассматриваться не как 16-битные значения, а как абстрактные числа, которые в компьютере могут представляться множеством разных способов (см. [[#Способы представления\|способы представления]]). +The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including [[Adobe Inc.\|Adobe]], [[Apple Inc.\|Apple]], [[Google]], [[International Business Machines\|IBM]], [[Microsoft]], and [[Oracle Corporation]].<ref name="members">{{cite web +\| title = The Unicode Consortium Members +\| url = https://unicode.org/consortium/members.html +\| accessdate = 2019-01-04}}</ref> -Поскольку в ряде компьютерных систем (например, [[Windows NT]]<ref name="windows-nt">{{cite web\|url=http://support.microsoft.com/kb/99884\|title=Unicode and Microsoft Windows NT\|work=Microsoft Support\|lang=en\|archiveurl=https://web.archive.org/web/20090926092654/http://support.microsoft.com/kb/99884\|archivedate=2009-09-26\|accessdate=2009-11-12\|deadlink=yes}}</ref>) фиксированные 16-битные символы уже использовались в качестве кодировки по умолчанию, было решено все наиболее важные знаки кодировать только в пределах первых {{formatnum:65536}} позиций (так называемая {{lang-en\|basic multilingual plane, BMP}}). Остальное пространство используется для «дополнительных символов» ({{lang-en\|supplementary characters}}): систем письма вымерших языков или очень редко используемых [[китай]]ских иероглифов, математических и музыкальных символов. +Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the [[Ministry of Endowments and Religious Affairs (Oman)]] is a full member with voting rights.<ref name="members" /> -Для совместимости со старыми 16-битными системами была изобретена система [[UTF-16]], где первые {{formatnum:65536}} позиций, за исключением позиций из интервала U+D800…U+DFFF, отображаются непосредственно как 16-битные числа, а остальные представляются в виде «суррогатных пар» (первый элемент пары из области U+D800…U+DBFF, второй элемент пары из области U+DC00…U+DFFF). Для суррогатных пар была использована часть кодового пространства (2048 позиций), отведённого «для частного использования». +The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard [[#Unicode Transformation Format and Universal Character Set\|Unicode Transformation Format]] (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with [[multilingualism\|multilingual]] environments. -Поскольку в UTF-16 можно отобразить только 2<sup>20</sup>+2<sup>16</sup>−2048 ({{formatnum:1112064}}) символов, то это число и было выбрано в качестве окончательной величины кодового пространства Юникода (диапазон кодов: 0x000000-0x10FFFF). +===Scripts covered=== +{{Main\|Script (Unicode)}} +[[File:Unicode sample.png\|thumb\|right\|200px\|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts --> -Хотя кодовая область Юникода была расширена за пределы 2<sup>16</sup> уже в версии 2.0, первые символы в «верхней» области были размещены только в версии 3.1. +Unicode covers almost all scripts ([[writing system]]s) in current use today.<ref>{{cite web +\| title = Character Code Charts +\| url = https://www.unicode.org/charts/ +\| accessdate = 2010-03-17}} +</ref>{{failed verification\|date=October 2013}}<ref>{{Cite web\|url=https://home.unicode.org/basic-info/faq/\|title=Unicode FAQ\|last=\|first=\|date=\|website=\|url-status=live\|archive-url=\|archive-date=\|access-date=2020-04-02}}</ref> -Роль этой кодировки в веб-секторе постоянно растёт. На начало 2010 доля веб-сайтов, использующих Юникод, составила около 50 %<ref>{{cite web\|url=http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov\|title=Unicode используется почти на 50% веб-сайтов\|lang=ru\|archiveurl=https://web.archive.org/web/20100611042601/http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov\|archivedate=2010-06-11\|accessdate=2010-02-09\|deadlink=yes}}</ref>. +A total of 154 [[Script (Unicode)\|scripts]] are included in the latest version of Unicode (covering [[alphabet]]s, [[abugida]]s and [[Syllabary\|syllabaries]]), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and [[musical notation\|music]] (in the form of notes and rhythmic symbols), also occur. -== Версии Юникода == -Работа по доработке стандарта продолжается. Новые версии выпускаются по мере изменения и пополнения таблиц символов. Параллельно выпускаются новые документы [[Международная организация по стандартизации\|ISO]]/IEC 10646. +The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran<ref>{{Cite web \| title=Roadmap to the BMP \| url=https://www.unicode.org/roadmaps/bmp/ \| publisher=[[Unicode Consortium]] \| accessdate=30 July 2018 }}</ref>) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the [https://www.unicode.org/roadmaps/ Unicode Roadmap] page of the [[Unicode Consortium]] Web site. For some scripts on the Roadmap, such as [[Jurchen script\|Jurchen]] and [[Khitan small script]], encoding proposals have been made and they are working their way through the approval process. For others scripts, such as [[Maya script\|Mayan]] (besides numbers) and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved. -Первый стандарт выпущен в 1991 году, последний на данный момент — в 2020. Версии стандарта 1.0—5.0 публиковались как книги и имеют [[ISBN]]<ref>[http://www.unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates]</ref><ref>[http://www.unicode.org/versions/enumeratedversions.html Enumerated Versions]</ref>. +Some modern invented scripts which have not yet been included in Unicode (e.g., [[Tengwar]]) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., [[Klingon scripts\|Klingon]]) are listed in the [[ConScript Unicode Registry]], along with unofficial but widely used [[Private Use Areas]] code assignments. -Номер версии стандарта составлен из трёх цифр (например, 3.1.1). Третью цифру меняют при внесении в стандарт небольших изменений, не добавляющих новых символов (исключение — версия 1.0.1, в которой добавлены {{iw\|Унифицированные идеограммы ККЯ\|унифицированные идеограммы китайского, японского и корейского письма\|en\|CJK Unified Ideographs}})<ref>[http://www.unicode.org/versions/index.html About Versions]</ref>. +There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals have been already included into Unicode. -База данных символов Юникода ([http://www.unicode.org/ucd/ Unicode Character Database]) доступна для всех версий на официальном сайте как в простом текстовом, так и в XML-формате. Файлы распространяются под BSD-подобной [http://www.unicode.org/copyright.html лицензией]. +The [https://linguistics.berkeley.edu/sei/ Script Encoding Initiative], a project run by Deborah Anderson at the [[University of California, Berkeley]] was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.<ref>{{cite web\|url=https://www.unicode.org/pending/about-sei.html \|title=About The Script Encoding Initiative \|publisher=The Unicode Consortium \|date= \|accessdate=2012-06-04}}</ref> -{{Временная линия Юникода}} +==={{anchor\|1.0.0\|1.0.1\|1.1\|2.0\|2.1\|3.0\|3.1\|3.2\|4.0\|4.1\|5.0\|5.1\|5.2\|6.0\|6.1\|6.2\|6.3\|7.0\|8.0\|9.0\|10.0\|11.0\|12.0\|12.1\|13.0\|14.0}}Versions=== +Unicode is developed in conjunction with the [[International Organization for Standardization]] and shares the character repertoire with [[ISO/IEC 10646]]: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but ''The Unicode Standard'' contains much more information for implementers, covering—in depth—topics such as bitwise encoding, [[Unicode collation algorithm\|collation]] and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting [[Bi-directional text\|bidirectional text]]. The two standards do use slightly different terminology. + +The Unicode Consortium first published ''The Unicode Standard'' in 1991 (version 1.0), and has published new versions on a regular basis since then. The latest version of the Unicode Standard, version 13.0, was released in March 2020, and is available in electronic format from the consortium's website. The last version of the standard that was published completely in book form (including the code charts) was version 5.0 in 2006, but since version 5.2 (2009) the core specification of the standard has been published as a print-on-demand paperback.<ref name=version6.1PoD>{{cite web\|title=Unicode 6.1 Paperback Available\|url=https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html\|work=announcements_at_unicode.org\|accessdate=2012-05-30}}</ref> The entire text of each version of the standard, including the core specification, standard annexes and code charts, is freely available in [[PDF]] format on the Unicode website. + +In April 2020, Unicode announced that the release of the forthcoming version 14.0 had been postponed by six months from its initial release of March 2021 due to the [[COVID-19 pandemic]].<ref>{{cite web\|title=Unicode 14.0 Delayed for 6 Months\|url=https://home.unicode.org/unicode-14-0-delayed-for-6-months/\|accessdate=2020-05-05}}</ref> + +Thus far, the following major and minor versions of the Unicode standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below.<ref>{{cite web \| title = Enumerated Versions of The Unicode Standard \| url = https://www.unicode.org/versions/enumeratedversions.html \| accessdate = 2016-06-21}}</ref> {\| class="wikitable" \|- -\|+ Версии Юникода +\|+ Unicode versions \|- -! Номер версии -! Дата публикации -! [[Международный стандартный книжный номер\|ISBN]] книги -! Издание ISO/IEC 10646 -! Количество [[Письменность\|письменностей]] -! Количество символов<ref group="A">'''Включая''' символы графические ({{lang-en\|graphic}}), управляющие ({{lang-en\|control}}) и символы форматирования ({{lang-en\|format}}); '''не включая''' [[Области для частного использования\|символы для частного использования]] ({{Lang-en\|private-use}}), несимвольные знаки ({{Lang-en\|noncharacters}}) и суррогаты ({{lang-en\|surrogate code points}}).</ref> -! Изменения +!rowspan=2\| Version +!rowspan=2\| Date +!rowspan=2\| Book +!rowspan=2\| Corresponding [[Universal Character Set\|ISO/IEC 10646]] edition +!rowspan=2\| [[Script (Unicode)\|Scripts]] +!colspan=2\| Characters \|- -\| 1.0.0<ref>{{cite web\|title=Unicode® 1.0\|url=http://www.unicode.org/versions/Unicode1.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Октябрь 1991 -\| ISBN 0-201-56788-1 (Vol.1) +! Total{{refn\|The number of characters listed for each version of Unicode is the total number of graphic and format characters (i.e., excluding [[Private Use Area\|private-use characters]], [[Unicode control characters\|control characters]], [[noncharacter\|noncharacters]] and [[surrogate code points]]).\|group=tablenote}} +! Notable additions +\|- +\| 1.0.0 +\| October 1991 +\| {{ISBN\|0-201-56788-1}} (Vol. 1) \| -\| {{formatnum:24}} -\| {{formatnum:7161}} -\| Изначально Юникод содержал символы следующих письменностей: [[арабское письмо]], [[армянское письмо]], [[бенгальское письмо]], [[Чжуинь\|чжуиньское письмо]], [[кириллица]], [[деванагари]], [[грузинское письмо]], [[Греческий алфавит\|греческое и коптское письмо]], [[Гуджарати (письмо)\|гуджарати]], [[гурмукхи]], [[хангыль]], [[Еврейский алфавит\|еврейское письмо]], [[хирагана]], [[Каннада (письмо)\|каннада]], [[катакана]], [[лаосское письмо]], [[Латинский алфавит\|латиница]], [[Малаялам (письмо)\|малаялам]], [[Ория (письмо)\|ория]], [[тамильское письмо]], [[Телугу (письмо)\|телугу]], [[тайское письмо]] и [[тибетское письмо]]<ref>{{cite web -\| title = Unicode Data 1.0.0 -\| url = http://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 24 +\| 7,129 +\| Initial repertoire covers these scripts: [[Arabic script\|Arabic]], [[Armenian alphabet\|Armenian]], [[Bengali alphabet\|Bengali]], [[Zhuyin\|Bopomofo]], [[Cyrillic script\|Cyrillic]], [[Devanagari]], [[Georgian alphabet\|Georgian]], [[Greek alphabet\|Greek and Coptic]], [[Gujarati alphabet\|Gujarati]], [[Gurmukhi script\|Gurmukhi]], [[Hangul]], [[Hebrew alphabet\|Hebrew]], [[Hiragana]], [[Kannada alphabet\|Kannada]], [[Katakana]], [[Lao script\|Lao]], [[Latin script\|Latin]], [[Malayalam script\|Malayalam]], [[Oriya script\|Oriya]], [[Tamil script\|Tamil]], [[Telugu script\|Telugu]], [[Thai alphabet\|Thai]], and [[Tibetan script\|Tibetan]].<ref>{{cite web\| title = Unicode Data 1.0.0\|url= https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt\| accessdate = 2010-03-16}}</ref> \|- \| 1.0.1 -\| Июнь 1992 -\| ISBN 0-201-60845-6 (Vol.2) +\| June 1992 +\| {{ISBN\|0-201-60845-6}} (Vol. 2) \| -\| {{formatnum:25}} -\| {{formatnum:28359}} -\| Добавлены {{formatnum:20902}} {{iw\|Унифицированные идеограммы ККЯ\|унифицированные идеограммы китайского, японского и корейского письма\|en\|CJK Unified Ideographs}}<ref>{{cite web +\| 25 +\| 28,327<br />(21,204 added; 6 removed) +\| The initial set of 20,902 [[CJK Unified Ideographs]] is defined.<ref> +{{cite web \| title = Unicode Data 1.0.1 -\| url = http://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt +\| accessdate = 2010-03-16}}</ref> \|- -\| 1.1<ref>{{cite web\|title=Unicode® 1.1\|url=http://www.unicode.org/versions/Unicode1.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Июнь 1993 +\| 1.1 +\| June 1993 \| \| ISO/IEC 10646-1:1993 -\| {{formatnum:24}} -\| {{formatnum:34233}} -\| Добавлено {{formatnum:4306}} слогов [[Хангыль\|хангыля]], дополнивших уже имеющиеся в кодировке {{formatnum:2350}} символов. Удалены символы [[Тибетское письмо\|тибетского письма]]<ref>{{cite web +\| 24 +\| 34,168<br />(5,963 added; 89 removed; 33 reclassified as control characters) +\| 4,306 more [[Hangul]] syllables added to original set of 2,350 characters. [[Tibetan script\|Tibetan]] removed.<ref>{{cite web \| title = Unicode Data 1995 -\| url = http://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt +\| accessdate = 2010-03-16 }} +</ref> \|- -\| 2.0<ref>{{cite web\|title=Unicode 2.0.0\|url=http://www.unicode.org/versions/Unicode2.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Июль 1996 -\| ISBN 0-201-48345-9 -\| ISO/IEC 10646-1:1993 и Amendments 5, 6, 7 -\| {{formatnum:25}} -\| {{formatnum:38950}} -\| Удалены добавленные ранее слоги [[Хангыль\|хангыля]], и добавлены {{formatnum:11172}} новых слога хангыля с новыми кодами. Возвращены удалённые ранее символы [[Тибетское письмо\|тибетского письма]]; символы получили новые коды и были размещены в разных таблицах. Введён механизм суррогатных ({{lang-en\|surrogate}}) символов. Выделено место для плоскостей ({{lang-en\|planes}}) [[Области для частного использования\|15 и 16]]<ref>{{cite web -\| title = Unicode Data 2.0.14 -\| url = http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 2.0 +\| July 1996 +\| {{ISBN\|0-201-48345-9}} +\| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7 +\| 25 +\| 38,885<br />(11,373 added; 6,656 removed) +\| Original set of [[Hangul]] syllables removed, and a new set of 11,172 Hangul syllables added at a new location. [[Tibetan script\|Tibetan]] added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 [[Private use (Unicode)\|Private Use Areas]] allocated.<ref>{{cite web +\| title = Unicode Data-2.0.14 +\| url = https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt +\| accessdate = 2010-03-16}} +</ref> \|- -\| 2.1<ref>{{cite web\|title=Unicode 2.1.0\|url=http://www.unicode.org/versions/Unicode2.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Май 1998 +\| 2.1 +\| May 1998 \| -\| ISO/IEC 10646-1:1993, Amendments 5, 6, 7, два символа из Amendment 18 +\| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18 \| 25 -\| {{formatnum:38952}} -\| Добавлен [[символ евро]]<ref>{{cite web -\| title = Unicode Data 2.1.2 -\| url = http://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 38,887<br />(2 added) +\| [[Euro sign]] and [[Specials (Unicode block)\|Object Replacement Character]] added.<ref>{{cite web +\| title = Unicode Data-2.1.2 +\| url = https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt +\| accessdate = 2010-03-16}} +</ref> \|- -\| 3.0<ref>{{cite web\|title=Unicode 3.0.0\|url=http://www.unicode.org/versions/Unicode3.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Сентябрь 1999 -\| ISBN 0-201-61633-5 +\| 3.0 +\| September 1999 +\| {{ISBN\|0-201-61633-5}} \| ISO/IEC 10646-1:2000 -\| {{formatnum:38}} -\| {{formatnum:49259}} -\| Добавлены письмо [[Чероки (письмо)\|чероки]], [[эфиопское письмо]], [[кхмерское письмо]], [[монгольские письменности]], [[бирманское письмо]], [[огамическое письмо]], [[руны]], [[сингальское письмо]], [[сирийское письмо]], [[Тана (письмо)\|тана]], [[канадское слоговое письмо]] и [[письмо и]], а также символы [[Шрифт Брайля\|шрифта Брайля]]<ref>{{cite web -\| title = Unicode Data 3.0.0 -\| url = http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 38 +\| 49,194<br />(10,307 added) +\| [[Cherokee syllabary\|Cherokee]], [[Ge'ez alphabet\|Ethiopic]], [[Khmer script\|Khmer]], [[Mongolian script\|Mongolian]], [[Burmese script\|Burmese]], [[Ogham]], [[Runic alphabet\|Runic]], [[Sinhala script\|Sinhala]], [[Syriac alphabet\|Syriac]], [[Tāna\|Thaana]], [[Canadian Aboriginal syllabics\|Unified Canadian Aboriginal Syllabics]], and [[Yi script\|Yi Syllables]] added, as well as a set of [[Braille]] patterns.<ref>{{cite web +\| title = Unicode Data-3.0.0 +\| url = https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt +\| accessdate = 2010-03-16}} +</ref> \|- -\| 3.1<ref>{{cite web\|title=Unicode 3.1.0\|url=http://www.unicode.org/versions/Unicode3.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Март 2001 +\| 3.1 +\| March 2001 \| \| ISO/IEC 10646-1:2000 ISO/IEC 10646-2:2001 -\| {{formatnum:41}} -\| {{formatnum:94205}} -\| Добавлены [[Дезеретский алфавит\|дезеретское письмо]], [[готское письмо]] и {{iw\|древнеиталийское письмо\|\|en\|Old Italic alphabet}}, а также символы [[Современная музыкальная нотация\|западной]] и [[Византийская музыка\|византийской]] музыки, {{formatnum:42711}} {{iw\|Унифицированные идеограммы ККЯ\|унифицированных идеограмм китайского, японского и корейского письма\|en\|CJK Unified Ideographs}}. Выделено место для плоскостей [[Плоскость (Юникод)#Дополнительная многоязычная плоскость\|1]], [[Плоскость (Юникод)#Дополнительная идеографическая плоскость\|2]] и [[Плоскость (Юникод)#Специализированная дополнительная плоскость\|14]]<ref>{{cite web -\| title = Unicode Data 3.1.0 -\| url = http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 41 +\| 94,140<br />(44,946 added) +\| [[Deseret alphabet\|Deseret]], [[Gothic alphabet\|Gothic]] and [[Old Italic alphabet\|Old Italic]] added, as well as sets of symbols for [[Modern musical symbols\|Western music]] and [[Byzantine music]], and 42,711 additional [[CJK Unified Ideographs]].<ref>{{cite web +\| title =Unicode Data-3.1.0 +\| url =https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt +\| accessdate = 2010-03-16 }} +</ref> \|- -\| 3.2<ref>{{cite web\|title=Unicode 3.2.0\|url=http://www.unicode.org/versions/Unicode3.2.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Март 2002 +\| 3.2 +\| March 2002 \| -\| ISO/IEC 10646-1:2000 и Amendment 1 +\| ISO/IEC 10646-1:2000 plus Amendment 1 ISO/IEC 10646-2:2001 -\| {{formatnum:45}} -\| {{formatnum:95221}} -\| Добавлены [[Бухид (письмо)\|письмо бухид]], {{iw\|Письмо хануноо\|хануноо\|en\|Hanunó'o script}}, [[байбайин]] и [[Тагбанва (письмо)\|письмо тагбанва]]<ref>{{cite web -\| title = Unicode Data 3.2.0 -\| url = http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 45 +\| 95,156<br />(1,016 added) +\| [[Philippines\|Philippine]] scripts [[Buhid script\|Buhid]], [[Hanunó'o script\|Hanunó'o]], [[Baybayin\|Tagalog]], and [[Tagbanwa script\|Tagbanwa]] added.<ref>{{cite web +\| title = Unicode Data-3.2.0 +\| url = https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt +\| accessdate = 2010-03-16}} +</ref> \|- -\| 4.0<ref>{{cite web\|title=Unicode 4.0.0\|url=http://www.unicode.org/versions/Unicode4.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Апрель 2003 -\| ISBN 0-321-18578-1 +\| 4.0 +\| April 2003 +\| {{ISBN\|0-321-18578-1}} \| ISO/IEC 10646:2003 -\| {{formatnum:52}} -\| {{formatnum:96447}} -\| Добавлены [[кипрское письмо]], [[Лимбу (письмо)\|письмо лимбу]], [[линейное письмо Б]], [[сомалийское письмо]], [[Алфавит Шоу\|алфавит шоу]], [[Тай-ныа#Письменность\|письмо лы]] и [[угаритское письмо]], а также символы [[Гексаграмма (Ицзин)\|гексаграмм]]<ref>{{cite web -\| title = Unicode Data 4.0.0 -\| url = http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| 52 +\| 96,382<br />(1,226 added) +\| [[Cypriot syllabary]], [[Limbu script\|Limbu]], [[Linear B]], [[Osmanya script\|Osmanya]], [[Shavian alphabet\|Shavian]], [[Tai Nüa language#Writing system\|Tai Le]], and [[Ugaritic alphabet\|Ugaritic]] added, as well as [[Hexagram (I Ching)\|Hexagram symbols]].<ref>{{cite web +\| title = Unicode Data-4.0.0 +\| url = https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt +\| accessdate = 2010-03-16}} +</ref> \|- -\| 4.1<ref>{{cite web\|title=Unicode 4.1.0\|url=http://www.unicode.org/versions/Unicode4.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Март 2005 +\| 4.1 +\| March 2005 \| -\| ISO/IEC 10646:2003 и Amendment 1 -\| {{formatnum:59}} -\| {{formatnum:97720}} -\| Добавлены [[Лонтара\|письмо лонтара]], [[глаголица]], [[Кхароштхи\|письмо кхароштхи]], [[новое письмо лы]], [[древнеперсидская клинопись]], [[силхетское нагари]] и [[древнеливийское письмо]]. Символы [[Коптское письмо\|коптского письма]] были отделены от символов [[Греческий алфавит\|греческого письма]]. Также добавлены [[Аттическая система счисления\|символы старых греческих цифр]], музыкальные символы Древней Греции и [[символ гривны]] (валюты [[Украина\|Украины]])<ref>{{cite web -\| title = Unicode Data 4.1.0 -\| url = http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| ISO/IEC 10646:2003 plus Amendment 1 +\| 59 +\| 97,655<br />(1,273 added) +\| [[Lontara alphabet\|Buginese]], [[Glagolitic alphabet\|Glagolitic]], [[Kharoṣṭhī\|Kharoshthi]], [[New Tai Lue alphabet\|New Tai Lue]], [[Old Persian cuneiform script\|Old Persian]], [[Sylheti Nagari\|Syloti Nagri]], and [[Tifinagh]] added, and [[Coptic alphabet\|Coptic]] was disunified from [[Greek alphabet\|Greek]]. Ancient [[Unicode numerals#Ancient Greek numerals\|Greek numbers]] and [[Musical notation#Ancient Greece\|musical symbols]] were also added.<ref>{{cite web\|url=https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt\|title=Unicode Data-4.1.0\|accessdate=2010-03-16}} +</ref> \|- -\| 5.0<ref>{{cite web\|title=Unicode 5.0.0\|url=http://www.unicode.org/versions/Unicode5.0.0/\|date=2006-07-14\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Июль 2006 -\| ISBN 0-321-48091-0 -\| ISO/IEC 10646:2003, Amendments 1, 2, четыре символа из Amendment 3 -\| {{formatnum:64}} -\| {{formatnum:99089}} -\| Добавлены [[балийское письмо]], [[клинопись]], [[Нко (письмо)\|письмо нко]], [[монгольское квадратное письмо]] и [[финикийское письмо]]<ref>{{cite web +\| 5.0 +\| July 2006 +\| {{ISBN\|0-321-48091-0}} +\| ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3 +\| 64 +\| 99,024<br />(1,369 added) +\| [[Balinese alphabet\|Balinese]], [[Cuneiform]], [[N'Ko alphabet\|N'Ko]], [[Phags-pa script\|Phags-pa]], and [[Phoenician alphabet\|Phoenician]] added.<ref>{{cite web \| title = Unicode Data 5.0.0 -\| url = http://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt +\| accessdate = 2010-03-17}} +</ref> \|- -\| 5.1<ref>{{cite web\|title=Unicode 5.1.0\|url=http://www.unicode.org/versions/Unicode5.1.0/\|date=2008-04-04\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Апрель 2008 +\| 5.1 +\| April 2008 \| -\| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4 -\| {{formatnum:75}} -\| {{formatnum:100713}} -\| Добавлены [[карийское письмо]], [[чамская письменность]], [[Кая-ли\|письмо кая-ли]], [[Лепча (письмо)\|письмо лепча]], [[Ликийский алфавит\|ликийское письмо]], [[Лидийский алфавит\|лидийское письмо]], [[Ол-чики\|письмо ол-чики]], [[реджангское письмо]], [[Саураштра (письмо)\|письмо саураштра]], [[сунданское письмо]],[[Древнетюркское письмо]] и [[Ваи (письмо)\|письмо ваи]]. Добавлены [[Фестский диск\|символы фестского диска]], символы костей для [[маджонг]]а и [[домино]], [[заглавная буква эсцет]] (ẞ), а также буквы латиницы, использовавшиеся в средневековых [[Рукопись\|рукописях]] для {{iw\|аббревация\|аббревиации\|en\|Scribal abbreviation}}. Новыми символами дополнен набор символов [[Бирманское письмо\|бирманского письма]]<ref>{{cite web +\| ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4 +\| 75 +\| 100,648<br />(1,624 added) +\| [[Carian script\|Carian]], [[Cham alphabet\|Cham]], [[Kayah Li script\|Kayah Li]], [[Lepcha script\|Lepcha]], [[Lycian script\|Lycian]], [[Lydian script\|Lydian]], [[Ol Chiki script\|Ol Chiki]], [[Rejang script\|Rejang]], [[Saurashtra script\|Saurashtra]], [[Sundanese script\|Sundanese]], and [[Vai syllabary\|Vai]] added, as well as sets of symbols for the [[Phaistos Disc]], [[Mahjong\|Mahjong tiles]], and [[Dominoes\|Domino tiles]]. There were also important additions for [[Burmese script\|Burmese]], additions of letters and [[Scribal abbreviation]]s used in medieval [[manuscript]]s, and the addition of [[Capital ẞ]].<ref>{{cite web \| title = Unicode Data 5.1.0 -\| url = http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt +\| accessdate = 2010-03-17 }} +</ref> \|- -\| 5.2<ref>{{cite web\|title=Unicode® 5.2.0\|url=http://www.unicode.org/versions/Unicode5.2.0/\|date=2009-10-01\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Октябрь 2009 -\| -\| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4, 5, 6 -\| {{formatnum:90}} -\| {{formatnum:107361}} -\| Добавлены [[Авестийский алфавит\|авестийское письмо]], [[Бамум (письменность)\|письмо бамум]], [[египетское иероглифическое письмо]] (по {{iw\|список Гардинера\|списку Гардинера\|en\|Gardiner's sign list}}, содержащему {{formatnum:1071}} символ), [[имперское арамейское письмо]], {{iw\|пахлевийское эпиграфическое письмо\|\|en\|Inscriptional Pahlavi}}, {{iw\|парфянское эпиграфическое письмо\|\|en\|Inscriptional Parthian}}, [[яванское письмо]], [[Кайтхи\|письмо кайтхи]], [[Алфавит Фрейзера\|письмо лису]], [[Манипури (письмо)\|письмо манипури]], [[южноаравийское письмо]], [[древнетюркское письмо]], [[самаритянское письмо]], [[Ланна (письмо)\|письмо ланна]] и {{iw\|письмо тай-вьет\|\|en\|Tai Viet script}}. Добавлены {{formatnum:4149}} новых {{iw\|унифицированные идеограммы китайского, японского, корейского письма\|унифицированных идеограмм китайского, японского и корейского письма\|en\|CJK Unified Ideographs}} (CJK-C), символы [[Ведийский язык\|ведийского письма]], [[символ тенге]] (валюты [[Казахстан]]а), а также расширен набор символов чамо [[Хангыль\|старого хангыля]]<ref>{{cite web +\| 5.2 +\| October 2009 +\| {{ISBN\|978-1-936213-00-9}} +\| ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6 +\| 90 +\| 107,296<br />(6,648 added) +\| [[Avestan alphabet\|Avestan]], [[Bamum script\|Bamum]], [[Egyptian hieroglyphs]] (the [[Gardiner's sign list\|Gardiner Set]], comprising 1,071 characters), [[Imperial Aramaic]], [[Inscriptional Pahlavi]], [[Inscriptional Parthian]], [[Javanese script\|Javanese]], [[Kaithi]], [[Fraser alphabet\|Lisu]], [[Meitei Mayek script\|Meetei Mayek]], [[South Arabian alphabet\|Old South Arabian]], [[Old Turkic script\|Old Turkic]], [[Samaritan script\|Samaritan]], [[Tai Tham script\|Tai Tham]] and [[Tai Viet script\|Tai Viet]] added. 4,149 additional [[CJK Unified Ideographs]] (CJK-C), as well as extended Jamo for [[Hangul\|Old Hangul]], and characters for [[Vedic Sanskrit]].<ref>{{cite web \| title = Unicode Data 5.2.0 -\| url = http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt +\| accessdate = 2010-03-17}} +</ref> \|- -\| 6.0<ref>{{cite web\|title=Unicode® 6.0.0\|url=http://www.unicode.org/versions/Unicode6.0.0/\|date=2010-10-11\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Октябрь 2010 -\| -\| ISO/IEC 10646:2010 и [[символ индийской рупии]] -\| {{formatnum:93}} -\| {{formatnum:109449}} -\| Добавлены [[батакское письмо]], [[Брахми\|письмо брахми]], [[мандейское письмо]]. Добавлены символы [[Игральные карты\|игральных карт]], [[Дорожный знак\|дорожных знаков]], [[Географическая карта\|географических карт]], [[Алхимические символы\|алхимии]], [[эмотикон]]а и [[эмодзи]], а также {{formatnum:222}} {{iw\|унифицированные идеограммы китайского, японского и корейского письма\|\|en\|CJK Unified Ideographs}} (CJK-D)<ref>{{cite web +\| 6.0 +\| October 2010 +\| {{ISBN\|978-1-936213-01-6}} +\| ISO/IEC 10646:2010 plus the [[Indian rupee sign]] +\| 93 +\| 109,384<br />(2,088 added) +\| [[Batak alphabet\|Batak]], [[Brāhmī script\|Brahmi]], [[Mandaic alphabet\|Mandaic]], [[playing card]] symbols, [[Traffic sign\|transport]] and [[map]] symbols, [[alchemical symbol]]s, [[emoticons]] and [[emoji]]. 222 additional [[CJK Unified Ideographs]] (CJK-D) added.<ref>{{cite web \| title = Unicode Data 6.0.0 -\| url = http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt +\| accessdate = 2010-10-11}} +</ref> \|- -\| 6.1<ref>{{cite web\|title=Unicode® 6.1.0\|url=http://www.unicode.org/versions/Unicode6.1.0/\|date=2012-01-31\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| Январь 2012 -\| +\| 6.1 +\| January 2012 +\| {{ISBN\|978-1-936213-02-3}} \| ISO/IEC 10646:2012 -\| {{formatnum:100}} -\| {{formatnum:110181}} -\| Добавлены [[Чакма (письмо)\|письмо чакма]], [[Мероитское письмо\|мероитский курсив и мероитские иероглифы]], [[Письмо Полларда\|письмо мяо]], [[Шарада (письмо)\|письмо шарада]], {{iw\|письмо соранг-сомпенг\|\|en\|Sora Sompeng}} и [[Такри\|письмо такри]]<ref>{{cite web +\| 100 +\| 110,116<br />(732 added) +\| [[Chakma alphabet\|Chakma]], [[Meroitic alphabet\|Meroitic cursive]], [[Meroitic alphabet\|Meroitic hieroglyphs]], [[Pollard script\|Miao]], [[Śāradā script\|Sharada]], [[Sora Sompeng]], and [[Takri alphabet\|Takri]].<ref>{{cite web \| title = Unicode Data 6.1.0 -\| url = http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt +\| accessdate = 2012-01-31}} +</ref> \|- -\| 6.2<ref>{{cite web\|title=Unicode® 6.2.0\|url=http://www.unicode.org/versions/Unicode6.2.0/\|date=2012-09-26\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-07\|language=en}}</ref> -\| Сентябрь 2012 -\| -\| ISO/IEC 10646:2012 и [[символ турецкой лиры]] -\| {{formatnum:100}} -\| {{formatnum:110182}} -\| Добавлен [[символ турецкой лиры]] (валюты [[Турция\|Турции]])<ref>{{cite web +\| 6.2 +\| September 2012 +\| {{ISBN\|978-1-936213-07-8}} +\| ISO/IEC 10646:2012 plus the [[Turkish lira sign]] +\| 100 +\| 110,117<br />(1 added) +\| [[Turkish lira sign]].<ref>{{cite web \| title = Unicode Data 6.2.0 -\| url = http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt +\| accessdate = 2012-09-26}} +</ref> \|- -\| 6.3<ref>{{cite web\|title=Unicode® 6.3.0\|url=http://www.unicode.org/versions/Unicode6.3.0/\|date=2012-09-30\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-07\|language=en}}</ref> -\| Сентябрь 2013 -\| -\| ISO/IEC 10646:2012 и шесть символов -\| {{formatnum:100}} -\| {{formatnum:110187}} -\| Добавлено пять символов для форматирования двунаправленного текста<ref>{{cite web +\| 6.3 +\| September 2013 +\| {{ISBN\|978-1-936213-08-5}} +\| ISO/IEC 10646:2012 plus six characters +\| 100 +\| 110,122<br />(5 added) +\| 5 bidirectional formatting characters.<ref>{{cite web \| title = Unicode Data 6.3.0 -\| url = http://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt +\| accessdate = 2013-09-30}} +</ref> \|- -\| 7.0<ref>{{cite web\|title=Unicode® 7.0.0\|url=http://www.unicode.org/versions/Unicode7.0.0/\|date=2014-06-16\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| 16 июня 2014 -\| -\| ISO/IEC 10646:2012, Amendments 1, 2 и [[символ российского рубля\|символ рубля]] -\| {{formatnum:123}} -\| {{formatnum:113021}} -\| Добавлены [[Басса (письмо)\|письмо басса]], [[агванское письмо]], [[Система Дюплойе\|стенография Дюплойе]], [[эльбасанское письмо]], [[Грантха\|письмо грантха]], {{iw\|письмо ходжики\|\|en\|Khojki}}, {{iw\|письменность худавади\|\|en\|Khudabadi alphabet}}, [[линейное письмо А]], {{iw\|письмо махаджани\|\|en\|Mahajani}}, [[манихейское письмо]], [[Кикакуи\|письмо кикакуи]], [[Моди (письмо)\|письмо моди]], {{iw\|письмо мро\|\|en\|Mro script}}, [[набатейское письмо]], [[Северноаравийские языки\|северноаравийское письмо]], [[древнепермское письмо]], [[Пахау\|письмо пахау]], [[Пальмирский алфавит\|пальмирское письмо]], {{iw\|письмо по чин хо\|\|en\|Pau Cin Hau}}, {{iw\|письмо псалтирь пехлеви\|\|en\|Psalter Pahlavi}}, [[сиддхаматрика]], [[Тирхута\|письмо тирхута]], [[варанг-кшити]] и {{iw\|дингбат\|орнамент дингбат\|en\|Dingbat}}, а также [[символ российского рубля]] и [[символ азербайджанского маната]]<ref>{{cite web +\| 7.0 +\| June 2014 +\| {{ISBN\|978-1-936213-09-2}} +\| ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the [[Ruble sign]] +\| 123 +\| 112,956<br />(2,834 added) +\| [[Bassa alphabet\|Bassa Vah]], [[Caucasian Albanian alphabet\|Caucasian Albanian]], [[Duployan shorthand\|Duployan]], [[Elbasan alphabet\|Elbasan]], [[Grantha alphabet\|Grantha]], [[Khojki]], [[Khudabadi alphabet\|Khudawadi]], [[Linear A]], [[Mahajani]], [[Manichaean alphabet\|Manichaean]], [[Mende script\|Mende Kikakui]], [[Modi alphabet\|Modi]], [[Mro script\|Mro]], [[Nabataean alphabet\|Nabataean]], [[Old North Arabian]], [[Old Permic alphabet\|Old Permic]], [[Pahawh Hmong]], [[Palmyrene script\|Palmyrene]], [[Pau Cin Hau script\|Pau Cin Hau]], [[Psalter Pahlavi]], [[Siddhaṃ alphabet\|Siddham]], [[Tirhuta]], [[Warang Citi]], and [[Dingbat]]s.<ref>{{cite web \| title = Unicode Data 7.0.0 -\| url = http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt +\| accessdate = 2014-06-15}} +</ref> \|- -\| 8.0<ref>{{cite web\|title=Unicode® 8.0.0\|url=http://www.unicode.org/versions/Unicode8.0.0/\|date=2015-06-17\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| 17 июня 2015 -\| -\|ISO/IEC 10646:2014, Amendment 1, [[символ лари]], 9 унифицированных идеограмм ККЯ, 41 [[эмодзи]] -\|129 -\| {{formatnum:120737}} -\|Добавлены [[Ахом (письмо)\|письмо ахом]], [[анатолийские иероглифы]], [[Хатран\|письмо хатран]], [[Мултани\|письмо мултани]], [[венгерские руны]], [[SignWriting]], {{formatnum:5776}} [[Унифицированные идеограммы ККЯ — расширение E]], строчные буквы письма [[чероки]], буквы латиницы для немецкой диалектологии, 41 [[эмодзи]], а также пять символов изменения [[Шкала Фитцпатрика\|цвета кожи]] для эмотиконов. Добавлен [[символ лари]] (валюты [[Грузия\|Грузии]])<ref>{{cite web +\| 8.0 +\| June 2015 +\| {{ISBN\|978-1-936213-10-8}} +\| ISO/IEC 10646:2014 plus Amendment 1, as well as the [[Georgian lari\|Lari sign]], nine CJK unified ideographs, and 41 emoji characters<ref>{{Cite web \| title=Unicode 8.0.0 \| url=https://www.unicode.org/versions/Unicode8.0.0/ \| publisher=Unicode Consortium \| accessdate=2015-06-17 }}</ref> +\| 129 +\| 120,672<br />(7,716 added) +\| [[Ahom alphabet\|Ahom]], [[Anatolian hieroglyphs]], [[Hatran alphabet\|Hatran]], [[Multani alphabet\|Multani]], [[Old Hungarian alphabet\|Old Hungarian]], [[SignWriting]], 5,771 [[CJK Unified Ideographs\|CJK unified ideographs]], a set of lowercase letters for [[Cherokee syllabary\|Cherokee]], and five emoji [[Fitzpatrick scale\|skin tone]] modifiers<ref>{{cite web \| title = Unicode Data 8.0.0 -\| url = http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-04 -}}</ref> +\| url = https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt +\| accessdate = 2015-06-17}} +</ref> \|- -\| 9.0<ref>{{cite web\|title=Unicode® 9.0.0\|url=http://www.unicode.org/versions/Unicode9.0.0/\|date=2016-06-21\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| 21 июня 2016 -\| -\|ISO/IEC 10646:2014, Amendments 1, 2, адлам, нева, японские символы для ТВ, 74 [[эмодзи]] и символов -\|135 -\| {{formatnum:128237}} -\|Добавлены [[Адлам\|письмо адлам]], [[Бхайкшуки\|письмо бхайкшуки]], [[Марчен\|письмо марчен]], [[Нева (письмо)\|письмо нева]], [[Осейдж (письмо)\|письмо осейдж]], [[тангутское письмо]], а также 72 [[эмодзи]] и японские символы для телевидения<ref>{{cite web +\| 9.0 +\| June 2016 +\| {{ISBN\|978-1-936213-13-9}} +\| ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols<ref>{{Cite web \| title=Unicode 9.0.0 \| url=https://www.unicode.org/versions/Unicode9.0.0/ \| publisher=Unicode Consortium \| accessdate=2016-06-21 }}</ref> +\| 135 +\| 128,172<br />(7,500 added) +\| [[Adlam script\|Adlam]], [[Bhaiksuki alphabet\|Bhaiksuki]], [[Zhang-Zhung language#Scripts\|Marchen]], [[Prachalit Nepal alphabet\|Newa]], [[Osage alphabet\|Osage]], [[Tangut script\|Tangut]], and 72 [[emoji]]<ref>{{cite web \| title = Unicode Data 9.0.0 -\| url = http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-06 -}}</ref> +\| url = https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt +\| accessdate = 2016-06-21}} +</ref><ref name=laobo>{{cite web\|first=Martim\|last=Lobao\|url=https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/ \|title=These Are The Two Emoji That Weren't Approved For Unicode 9 But Which Google Added To Android Anyway\|website=Android Police\|date= 7 June 2016\|accessdate=4 September 2016}}</ref> \|- -\| 10.0<ref>{{cite web\|title=Unicode® 10.0.0\|url=http://www.unicode.org/versions/Unicode10.0.0/\|date=2017-06-27\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref> -\| 20 июня 2017 -\| -\|ISO/IEC 10646:2017, 56 [[эмодзи]], 285 символов [[Хэнтайгана\|хэнтайганы]], 3 символа квадратного письма Дзанабадзара -\|139 -\| {{formatnum:136755}} -\|Добавлены [[Монгольские письменности#Горизонтальное квадратное письмо\|квадратное письмо Дзанабадзара]], [[Соёмбо (письмо)\|письмо соёмбо]], [[гонди Масарама]], [[Нюй-шу\|письмо нюй-шу]], [[Хэнтайгана\|письмо хэнтайгана]], {{formatnum:7494}} [[Унифицированные идеограммы ККЯ — расширение F]], а также 56 [[эмодзи]] и символ [[биткойн]]а<ref>{{cite web -\| title = Unicode Data 10.0.0 -\| url = http://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt -\| lang = en -\| accessdate = 2017-12-07 -}}</ref> +\| 10.0 +\| June 2017 +\| {{ISBN\|978-1-936213-16-0}} +\| ISO/IEC 10646:2017 plus 56 [[emoji]] characters, 285 [[hentaigana]] characters, and 3 Zanabazar Square characters<ref>{{Cite web \| title=Unicode 10.0.0 \| url=https://www.unicode.org/versions/Unicode10.0.0/ \| publisher=Unicode Consortium \| accessdate=2017-06-20 }}</ref> +\| 139 +\| 136,690<br />(8,518 added) +\| [[Zanabazar Square alphabet\|Zanabazar Square]], [[Soyombo alphabet\|Soyombo]], [[Masaram Gondi script\|Masaram Gondi]], [[Nüshu script\|Nüshu]], [[hentaigana]] (non-standard [[hiragana]]), 7,494 [[CJK Unified Ideographs\|CJK unified ideographs]], and 56 [[emoji]] \|- \| 11.0 -\| Июнь 2018 -\| -\| ISO/IEC 10646:2017 +\| June 2018 +\| {{ISBN\|978-1-936213-19-1}} +\| ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters.<ref>{{Cite web \| title=The Unicode Standard, Version 11.0.0 Appendix C \| url=https://www.unicode.org/versions/Unicode11.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2018-06-11 }}</ref> \| 146 -\| {{formatnum:137439}} -\|Добавлены догра, [[грузинское письмо]] мтаврули, гунджалское гонди, [[ханифи]], индийские цифры сийяк, [[Макасарский язык\|макасарское]] письмо, медефайдрин, (древне-)[[согдийское письмо]], [[цифры майя]], 5 идеограмм ККЯ, символы [[сянци]] и половин звёздочек для оценки, а также 145 [[эмодзи]], четыре символа изменения причёски для эмотиконов и символ [[копилефт]]а<ref>{{Cite web\|url=http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt\|title=Unicode Data 11.0.0\|author=\|website=\|date=\|publisher=\|accessdate=2019-04-12\|lang=en}}</ref><ref>[http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html The Unicode Blog: Announcing The Unicode® Standard, Version 11.0]</ref><ref>[http://www.unicode.org/versions/Unicode11.0.0/ Unicode 11.0.0]</ref> +\| 137,374<br />(684 added) +\| [[Dogri script\|Dogra]], [[Georgian scripts#Mkhedruli\|Georgian Mtavruli]] capital letters, [[Gunjala Gondi Lipi\|Gunjala Gondi]], [[Hanifi Rohingya script\|Hanifi Rohingya]], [[Indic Siyaq Numbers (Unicode block)\|Indic Siyaq numbers]], [[Makassarese language\|Makasar]], [[Medefaidrin]], [[Sogdian alphabet\|Old Sogdian and Sogdian]], [[Mayan numerals]], 5 urgently needed [[CJK Unified Ideographs\|CJK unified ideographs]], symbols for [[xiangqi]] (Chinese chess) and [[Star (classification)\|star ratings]], and 145 [[emoji]]<ref>{{Cite web\|url=http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html\|title=Announcing The Unicode Standard, Version 11.0\|website=blog.unicode.org\|access-date=2018-06-06}}</ref> \|- -\|12.0 -\|Март 2019 -\| -\|ISO/IEC 10646:2017, Amendments 1, 2, а также 62 дополнительных символов -\|150 -\|{{formatnum:137993}} -\|Добавлены элимайское письмо, {{Не переведено 3\|надинагари\|3=en\|4=Nandinagari}}, хмонг, ванчо, дополнения для [[Письмо Полларда\|письма Полларда]], малая [[кана]] для старых японских текстов, исторические дроби и символы [[Тамильское письмо\|тамильского письма]], буквы [[Лаосское письмо\|лаосского письма]] для [[пали]], буквы латиницы для транслитерации угаритского, управляющие символы форматирования египетских иероглифов, а также 61 [[эмодзи]]<ref>[http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html The Unicode Blog: Announcing The Unicode® Standard, Version 12.0]</ref><ref>[http://www.unicode.org/versions/Unicode12.0.0/ Unicode 12.0.0]</ref> +\| 12.0 +\| March 2019 +\| {{ISBN\|978-1-936213-22-1}} +\| ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters.<ref>{{Cite web \| title=The Unicode Standard, Version 12.0.0 Appendix C \| url=https://www.unicode.org/versions/Unicode12.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2019-03-05 }}</ref> +\| 150 +\| 137,928<br />(554 added) +\| [[Elymaic]], [[Nandinagari]], [[Nyiakeng Puachue Hmong]], [[Wancho script\|Wancho]], [[Pollard script\|Miao script]] additions for several Miao and Yi dialects in China, [[hiragana]] and [[katakana]] small letters for writing archaic Japanese, [[Tamil script\|Tamil]] historic fractions and symbols, [[Lao alphabet\|Lao]] letters for [[Pali]], Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, and 61 [[emoji]]<ref>{{Cite web\|url=http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html\|title=Announcing The Unicode Standard, Version 12.0\|website=blog.unicode.org\|access-date=2019-03-05}}</ref> \|- -\|12.1 -\|Май 2019 +\| 12.1 +\| May 2019 +\| {{ISBN\|978-1-936213-25-2}} \| -\| -\|150 -\|{{formatnum:137994}} -\|Добавлен квадратный символ эпохи [[рэйва]]<ref>[http://blog.unicode.org/2019/05/unicode-12-1-en.html The Unicode Blog: Unicode Version 12.1 released in support of the Reiwa Era]</ref><ref>[http://www.unicode.org/versions/Unicode12.1.0/ Unicode 12.1.0]</ref> +\| 150 +\| 137,929<br />(1 added) +\| Adds a single character at U+32FF for the square ligature form of the name of the [[Reiwa\|Reiwa era]].<ref>{{Cite web\|url=http://blog.unicode.org/2019/05/unicode-12-1-en.html\|title=Unicode Version 12.1 released in support of the Reiwa Era\|website=blog.unicode.org\|access-date=2019-05-07}}</ref> \|- -\|13.0 -\|Март 2020 -\| -\| -\|154 -\|{{formatnum:143859}} -\|Добавлены [[хорезмийское письмо]], письмо [[дивес акуру]], [[малое киданьское письмо]], [[езидское письмо]], {{formatnum:4969}} идеограмм ККЯ (включая {{formatnum:4939}} [[Унифицированные идеограммы ККЯ — расширение G]]), а также 55 [[эмодзи]], символы [[Creative Commons]] и символы для унаследованной вычислительной техники. Выделено место для [[Плоскость (Юникод)#Третичная идеографическая плоскость\|плоскости 3]]<ref>[http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html The Unicode Blog: Announcing The Unicode Standard, Version 13.0]</ref><ref>[http://www.unicode.org/versions/Unicode13.0.0/ Unicode 13.0.0]</ref> -\|- -\|colspan="7" \| '''Примечания''' -<references group="A" /> +\| [http://www.unicode.org/versions/Unicode13.0.0/ 13.0] +\| March 2020 +\| {{ISBN\|978-1-936213-26-9}} +\| ISO/IEC 10646:2020<ref>{{Cite web \| title=The Unicode Standard, Version 13.0– Core Specification Appendix C \| url=https://www.unicode.org/versions/Unicode13.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2020-03-11 }}</ref> +\| 154 +\| 143,859<br />(5,930 added) +\| [[Khwarezmian_language#Writing_system\|Chorasmian]], [[Dhives akuru\|Dives Akuru]], [[Khitan small script]], [[Kurdish alphabets#Yezidi\|Yezidi]], 4,969 CJK unified ideographs added (including 4,939 in [[CJK Unified Ideographs Extension G\|Ext. G]]), Arabic script additions used to write [[Hausa language\|Hausa]], [[Wolof language\|Wolof]], and other languages in Africa and other additions used to write [[Hindko]] and [[Punjabi language\|Punjabi]] in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems from the 1970s and 1980s, and 55 emoji<ref>{{Cite web\|url=http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html\|title=Announcing The Unicode Standard, Version 13.0\|website=blog.unicode.org\|access-date=2020-03-11}}</ref> \|} -== Кодовое пространство == -Хотя формы записи UTF-8 и UTF-32 позволяют кодировать до 2<sup>31</sup> ({{formatnum:2147483648}}) кодовых позиций, было принято решение использовать лишь {{formatnum:1112064}} для совместимости с UTF-16. Впрочем, даже и этого в данный момент более чем достаточно — в версии 13.0 используется всего {{formatnum:143859}} кодовых позиций. +{{Reflist\|group=tablenote}} -Кодовое пространство разбито на 17 ''[[Плоскость (Юникод)\|плоскостей]]'' ({{lang-en\|planes}}) по 2<sup>16</sup> ({{formatnum:65536}}) символов. Нулевая плоскость ({{lang-en2\|plane{{nbsp}}0}}) называется ''базовой'' ({{lang-en2\|basic}}) и содержит символы наиболее употребительных письменностей. Остальные плоскости — дополнительные ({{lang-en2\|supplementary}}). Первая плоскость ({{lang-en2\|plane{{nbsp}}1}}) используется в основном для исторических письменностей, вторая ({{lang-en2\|plane{{nbsp}}2}}) — для редко используемых иероглифов [[CJK\|китайского письма (ККЯ)]], третья ({{lang-en2\|plane{{nbsp}}3}}) зарезервирована для архаичных китайских иероглифов<ref>[http://unicode.org/roadmaps/tip/ Roadmap to the TIP (Tertiary Ideographic Plane)]</ref>. Плоскость 14 отведена для символов, используемых по особому назначению. Плоскости 15 и 16 выделены для частного употребления<ref name='unicode-02' />. +==<span id="Upluslink"></span><span id="codespace"></span> Architecture and terminology== +{{See also\|Universal Character Set characters}}<!-- Template:U+ links to this paragraph --> -Для обозначения символов Unicode используется запись вида «U+''xxxx''» (для кодов 0…FFFF), или «U+''xxxxx''» (для кодов 10000…FFFFF), или «U+''xxxxxx''» (для кодов 100000…10FFFF), где ''xxx'' — [[шестнадцатеричная система счисления\|шестнадцатеричные]] цифры. Например, символ «я» (U+044F) имеет код 044F{{sub\|16}}{{nbsp}}= 1103{{sub\|[[десятичная система счисления\|10]]}}. +The Unicode Standard defines a ''codespace''<ref name="Glossary">{{cite web\|title = Glossary of Unicode Terms\|url=https://unicode.org/glossary/\|accessdate=2010-03-16}}</ref> of numerical values ranging from 0 through 10FFFF<sub>[[hexadecimal\|16]]</sub>,<ref>{{cite book\|title=The Unicode Standard, Version 13.0 \|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212\|year=2019\|page=19\|chapter=3.4 Characters and Encoding}}</ref> called ''[[code point\|code points]]''<ref name=":0">{{Cite book\|url=http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564\|title=The Unicode Standard Version 12.0 – Core Specification\|last=\|first=\|publisher=\|year=2019\|isbn=\|location=\|page=29\|chapter=2.4 Code Points and Characters}}</ref> and denoted as U+0000 through U+10FFFF ("U+" plus the code point value in [[hexadecimal]], prepended with [[leading zero\|leading zeros]] as necessary to result in a minimum of four digits, ''e. g.'', U+00F7 for the division sign, ÷, versus U+13254 for the [[Egyptian hieroglyph]] designating a [[List of hieroglyphs#O\|reed shelter]] or a [[c:Category:Winding wall (h hieroglyph)\|winding wall]] {{nowrap\|( [[File:Hiero O4.png\|text-bottom\|15px]] )}}<ref>{{Cite web\|url=https://www.unicode.org/versions/Unicode13.0.0/appA.pdf\|date=March 2020\|title=Appendix A: Notational Conventions\|publisher=Unicode Consortium\|work=The Unicode Standard}} In conformity with the bullet point relating to Unicode in [[MOS:ALLCAPS]], the formal Unicode names are not used in this paragraph.</ref>), respectively. Out of these 2<sup>16</sup> + 2<sup>20</sup> defined code points, the code points from U+D800 through U+DFFF, which are used to encode surrogate pairs in [[UTF-16]], are reserved by the Standard and may not be used to encode valid characters, resulting in a net total of 2<sup>16</sup> − 2<sup>11</sup> + 2<sup>20</sup> = 1,112,064 possible code points corresponding to valid Unicode characters. Not all of these code points necessarily correspond to visible characters; several, for example, are assigned to control codes such as [[carriage return]]. -{\| class="wikitable sortable collapsible collapsed" -\|- -! colspan="3" \| Плоскости Юникода -\|- -! Плоскость !! Название !! Диапазон символов -\|- -\| 0 \|\| Базовая многоязыковая плоскость ({{lang-en2\|Basic multilingual plane, BMP}}) \|\| U+0000…U+FFFF -\|- -\| 1 \|\| Дополнительная многоязыковая плоскость ({{lang-en2\|Supplementary multilingual plane, SMP}}) \|\| U+10000…U+1FFFF -\|- -\| 2 \|\| Дополнительная иероглифическая плоскость ({{lang-en2\|Supplementary ideographic plane, SIP}}) \|\| U+20000…U+2FFFF -\|- -\| 3 \|\| Третичная иероглифическая плоскость ({{lang-en2\|Tertiary ideographic plane, TIP}}) \|\| U+30000…U+3FFFF -\|- -\| 4—13 \|\| не используются \|\| U+40000…U+DFFFF -\|- -\| 14 \|\| Дополнительная плоскость особого назначения ({{lang-en2\|Supplementary special-purpose plane, SSP}}) \|\| U+E0000…U+EFFFF -\|- -\| 15—16 \|\| Дополнительные области для частного использования ({{lang-en2\|Supplementary private use area, SPUA-A/B}}) \|\| U+F0000…U+10FFFF -\|- -\|} +===Code point planes and blocks=== +{{Main\|Plane (Unicode)}} +The Unicode codespace is divided into seventeen ''planes'', numbered 0 to 16: + +{{Planes (Unicode)}} + +All code points in the BMP are accessed as a single code unit in [[UTF-16]] encoding and can be encoded in one, two or three bytes in [[UTF-8]]. Code points in Planes 1 through 16 (''supplementary planes'') are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8. -== Система кодирования == -Универсальная система кодирования (Юникод) представляет собой набор графических символов и способ их кодирования для [[компьютер]]ной обработки текстовых данных. +Within each plane, characters are allocated within named ''[[Block (Unicode)\|blocks]]'' of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks. -Графические символы — это символы, имеющие видимое изображение. Графическим символам противопоставляются управляющие символы и символы форматирования. +===General Category property=== +Each code point has a single [[Character property (Unicode)#General Category\|General Category]] property. The major categories are denoted: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. In most cases other properties must be used to sufficiently specify the characteristics of a code point. The possible General Categories are: -Графические символы включают в себя следующие группы: -* буквы, содержащиеся хотя бы в одном из обслуживаемых [[алфавит]]ов; -* цифры; -* знаки пунктуации; -* специальные знаки ([[математика\|математические]], технические, [[идеограмма\|идеограммы]] и пр.); -* разделители. +{{General Category (Unicode)}} -Юникод — это система для линейного представления текста. Символы, имеющие дополнительные над- или подстрочные элементы, могут быть представлены в виде построенной по определённым правилам последовательности кодов (составной вариант, composite character) или в виде единого символа (монолитный вариант, precomposed character). С 2014 года считается, что все буквы крупных письменностей в Юникод внесены, и если символ доступен в составном варианте, дублировать его в монолитном виде не нужно. +Code points in the range U+D800–U+DBFF (1,024 code points) are known as high-'''surrogate''' code points, and code points in the range U+DC00–U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point form a surrogate pair in [[UTF-16]] to represent code points greater than U+FFFF. These code points otherwise cannot be used (this rule is ignored often in practice especially when not using UTF-16). -=== Политика консорциума === -Консорциум не создаёт нового, а констатирует сложившийся порядок вещей<ref name="emoji">[http://www.unicode.org/faq/emoji_dingbats.html FAQ — Emoji{{nbsp}}& Dingbats]</ref>. Например, картинки «[[эмодзи]]» были добавлены потому, что японские операторы мобильной связи широко их использовали. Для этого добавление символа проходит через сложный процесс<ref name="emoji" />. И, например, [[символ российского рубля]] прошёл его за три месяца, как только получил официальный статус, причём до этого он много лет де-факто использовался и его отказывались включить в Юникод. +A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six of these '''noncharacters''': U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.<ref name="stability-policy">{{cite web +\| title = Unicode Character Encoding Stability Policy +\| url = https://unicode.org/policies/stability_policy.html +\| accessdate = 2010-03-16}} +</ref> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the [[byte order mark]] assumes that U+FFFE will never be the first code point in a text. -[[Товарный знак\|Товарные знаки]] кодируют только в порядке исключения. Так, в Юникоде нет флага [[Windows]] или яблока [[Apple]]. +Excluding surrogates and noncharacters leaves 1,111,998 code points available for use. -Как только символ появился в кодировке, он никогда не сдвинется и не исчезнет. Если же потребуется изменить порядок символов, это делается не переменой позиций, а национальным порядком сортировки. Есть и другие, более тонкие гарантии стабильности — например, не будут меняться таблицы нормализации<ref>[http://www.unicode.org/policies/stability_policy.html Unicode Character Encoding Stability Policy]</ref>. +'''Private-use''' code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard<ref>{{cite web +\| title = Properties +\| url = https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463 +\| accessdate = 2020-03-15 }} +</ref> so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private-use areas in the Unicode codespace: -=== Объединение и дублирование символов === -Один и тот же символ может иметь несколько форм; в Юникод эти формы входят одной кодовой позицией: -* если это сложилось исторически. Например, у [[арабское письмо\|арабских букв]] есть четыре формы: обособленная, в начале, в середине и в конце<ref>Впоследствии конкретным формам арабских букв отвели отдельные позиции. Но всё равно рекомендуется писать по-арабски «общими» вариантами букв.</ref>; -* либо если в одном языке принята одна форма, а в другом — другая. [[Болгарская кириллица (типографика)\|Болгарская кириллица]] отличается от русской, а китайские иероглифы — от японских. +* Private Use Area: U+E000–U+F8FF (6,400 characters) +* Supplementary Private Use Area-A: U+F0000–U+FFFFD (65,534 characters) +* Supplementary Private Use Area-B: U+100000–U+10FFFD (65,534 characters). -С другой стороны, если исторически в шрифтах у разных форм начертания были разные символы, то они остаются разными и в Юникоде. Например, строчная греческая [[Сигма (буква)\|сигма]] имеет две формы, и в Юникоде у них разные коды; буква [[Расширенная латиница\|расширенной латиницы]]{{nbsp}}[[Å (латиница)\|Å]] ({{nobr\|A с кружком}}) и знак [[ангстрем]]а{{nbsp}}Å, греческая буква{{nbsp}}[[Мю\|μ]] и обозначение приставки «[[микро-]]»{{nbsp}}µ — тоже имеют разные кодовые позиции. +Graphic characters are characters defined by Unicode to have particular semantics, and either have a visible [[glyph]] shape or represent a visible space. As of Unicode 13.0 there are 143,696 graphic characters. -Конечно же, похожие символы в неродственных письменностях также ставятся в разные кодовые позиции. Например, буква{{nbsp}}А в [[Латиница\|латинице]], [[Кириллица\|кириллице]], [[Греческий алфавит\|греческом]] и [[Письмо чероки\|чероки]] — разные символы. +'''Format''' characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. For example, {{unichar\|200C\|Zero-width non-joiner\|nlink=}} and {{unichar\|200D\|Zero-width joiner\|nlink=}} may be used to change the default shaping behavior of adjacent characters (e.g., to inhibit ligatures or request ligature formation). There are 163 format characters in Unicode 13.0. -Крайне редко один и тот же символ ставится в две разные кодовые позиции для упрощения обработки текста. [[Штрих (математика)\|Математический штрих]] и такой же штрих для индикации [[Мягкий звук\|мягкости звуков]] — разные символы, второй считается буквой. +Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as '''control''' codes, and correspond to the [[C0 and C1 control codes]] defined in [[ISO/IEC 6429]]. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated ([[Mojibake]]) legacy [[Windows-1252]] characters used by some English and Western European texts with Windows technologies. -== Комбинируемые символы == -[[Файл:Diacritic-j.png\|right\|thumb\|Представление символа «Й» (U+0419) в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306).]] -Cимволы в Юникоде подразделяются на базовые ({{lang-en\|base characters}}) и комбинируемые ({{lang-en\|combining characters}}). Комбинируемые символы обычно следуют за базовым и изменяют его отображение определённым образом. К комбинируемым символам, например, относятся [[диакритические знаки]], знаки ударения. Например, русскую букву «Й» в Юникоде можно записать в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306), отображаемого над базовым. +Graphic characters, format characters, control code characters, and private use characters are known collectively as ''assigned characters''. '''Reserved''' code points are those code points which are available for use, but are not yet assigned. As of Unicode 13.0 there are 830,606 reserved code points. -Комбинируемые символы помечены в таблицах символов Юникода особыми категориями: -* Nonspacing Mark — безинтервальный (непротяжённый) знак; таковые обычно отображаются над или под базовым символом и не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке; -* Enclosing Mark — обрамляющий знак; эти символы также не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке, но отображаются сразу с нескольких сторон базового символа; -* Spacing Combining Mark — интервальный (протяжённый) комбинируемый знак; таковые, как и базовый символ, занимают отдельную горизонтальную позицию (интервал) в отображаемой строке. +===Abstract characters=== +The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of ''abstract characters'' that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.<ref>{{cite web +\| title = Unicode Character Encoding Model +\| url = https://unicode.org/reports/tr17/ +\| accessdate = 2010-03-16}} +</ref> However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an [[ogonek]], a [[dot above]], and an [[acute accent]], which is required in [[Lithuanian language\|Lithuanian]], is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.<ref>{{cite web +\| title = Unicode Named Sequences +\| url = https://unicode.org/Public/UNIDATA/NamedSequences.txt +\| accessdate = 2010-03-16}} +</ref> -Особый тип комбинируемых символов — селекторы варианта начертания ({{lang-en\|variation selectors}}). Они действуют только на те базовые символы, для которых такие варианты определены. К примеру, в версии Юникода 5.0 варианты начертания определены для ряда математических символов, для символов традиционного [[монгольский алфавит\|монгольского алфавита]] и для символов [[Монгольское квадратное письмо\|монгольского квадратного письма]]. +All graphic, format, and private use characters have a unique and immutable name by which they may be identified. This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy.<ref name="stability-policy" /> In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, {{unichar\|A015\|YI SYLLABLE WU}} has the formal alias {{sc2\|YI SYLLABLE ITERATION MARK}}, and {{unichar\|FE18\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''KC'''ET\|note=[[sic]]}} has the formal alias {{sc2\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''CK'''ET}}.<ref>{{cite web +\| title = Unicode Name Aliases +\| url = https://unicode.org/Public/UNIDATA/NameAliases.txt +\| accessdate = 2010-03-16}}</ref> -== Алгоритмы нормализации == -Из-за наличия в Юникоде комбинируемых символов одни и те же знаки письменности можно представить различными кодами. Так, например, букву "Й" в примере выше можно записать как отдельным символом, так и сочетанием базового и комбинированного. Из-за этого сравнение строк байт за байтом становится невозможным. Алгоритмы нормализации ({{lang-en\|normalization forms}}) решают эту проблему, выполняя приведение символов к определённому стандартному виду. Приведение осуществляется путём замены символов на эквивалентные с использованием таблиц и правил. «Декомпозицией» называется замена (разложение) одного символа на несколько составляющих символов, а «композицией», наоборот, — замена (соединение) нескольких составляющих символов на один символ. +===Ready-made versus composite characters=== +Unicode includes a mechanism for modifying characters that greatly extends the supported glyph repertoire. This covers the use of [[combining diacritical mark]]s that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains [[precomposed character\|precomposed]] versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, ''é'' can be represented in Unicode as [[#Upluslink\|U+]]0065 ({{sc2\|LATIN SMALL LETTER E}}) followed by U+0301 ({{sc2\|COMBINING ACUTE ACCENT}}), but it can also be represented as the precomposed character U+00E9 ({{sc2\|LATIN SMALL LETTER E WITH ACUTE}}). Thus, in many cases, users have multiple ways of encoding the same character. To deal with this, Unicode provides the mechanism of [[canonical equivalence]]. -В стандарте Юникода определены четыре алгоритма нормализации текста: NFD, NFC, NFKD и NFKC. +An example of this arises with [[Hangul]], the Korean alphabet. Unicode provides a mechanism for composing Hangul syllables with their individual subcomponents, known as [[Hangul Jamo]]. However, it also provides 11,172 combinations of precomposed syllables made from the most common jamo. -=== NFD === -NFD, {{lang-en\|'''n'''ormalization '''f'''orm '''D'''}} («D» от {{lang-en\|'''d'''ecomposition}}), форма нормализации D — каноническая декомпозиция — алгоритм, согласно которому выполняется рекурсивное разложение составных символов ({{lang-en\|precomposed characters}}) на последовательность из одного или нескольких простых символов в соответствии с таблицами декомпозиции. Рекурсивное потому, что в процессе разложения составной символ может быть разложен на несколько других, некоторые из которых тоже являются составными, и к которым применяется дальнейшее разложение. +The [[CJK]] characters currently have codes only for their precomposed form. Still, most of those characters comprise simpler elements (called [[Radical_(Chinese_characters)\|radicals]]), so in principle Unicode could have decomposed them as it did with Hangul. This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable character (which might do away with some of the problems caused by [[Han unification]]). A similar idea is used by some [[input method]]s, such as [[Cangjie method\|Cangjie]] and [[Wubi method\|Wubi]]. However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does. -Примеры: -{\| style="text-align:center" -\| -{\| class="wikitable" style="text-align:center" - \| Ω - \|- - \| <small>U+2126</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" - \| Ω - \|- - \| <small>U+03A9</small> - \|} -\|} -{\| style="text-align:center" -\| -{\| class="wikitable" style="text-align:center" - \| <big>Å</big> - \|- - \| <small>U+00C5</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" - \| <big>A</big> - \|- - \| <small>U+0041</small> - \|} -\| -{\| class="wikitable" - \| <big> ̊</big> - \|- - \| <small>U+030A</small> - \|} -\|} -{\| style="text-align:center" -\| -{\| class="wikitable" style="text-align:center" - \| <big>ṩ</big> - \|- - \| <small>U+1E69</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" - \| <big>s</big> - \|- - \| <small>U+0073</small> - \|} -\| -{\| class="wikitable" - \| <big> ̣</big> - \|- - \| <small>U+0323</small> - \|} -\| -{\| class="wikitable" - \| <big> ̇</big> - \|- - \| <small>U+0307</small> - \|} -\|} -{\| style="text-align:center" -\| -{\| class="wikitable" style="text-align:center" - \| colspan="2" \| <big>ḍ̇</big> - \|- - \| <small>U+1E0B</small> \|\| <small>U+0323</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" - \| <big>d</big> - \|- - \| <small>U+0064</small> - \|} -\| -{\| class="wikitable" - \| <big> ̣</big> - \|- - \| <small>U+0323</small> - \|} -\| -{\| class="wikitable" - \| <big> ̇</big> - \|- - \| <small>U+0307</small> - \|} -\|} -{\| style="text-align:center" -\| -{\| class="wikitable" style="text-align:center" - \| colspan="3" \| <big>q̣̇</big> - \|- - \| <small>U+0071</small> \|\| <small>U+0307</small> \|\| <small>U+0323</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" - \| <big>q</big> - \|- - \| <small>U+0071</small> - \|} -\| -{\| class="wikitable" - \| <big> ̣</big> - \|- - \| <small>U+0323</small> - \|} -\| -{\| class="wikitable" - \| <big> ̇</big> - \|- - \| <small>U+0307</small> - \|} -\|} +A set of [[Radical (Chinese character)\|radicals]] was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 12.2 of Unicode 5.2) warns against using [[Ideographic Description Sequences\|ideographic description sequences]] as an alternate representation for previously encoded characters: -=== NFC === -NFC, {{lang-en\|'''n'''ormalization '''f'''orm '''C'''}} («C» от {{lang-en\|'''c'''omposition}}), форма нормализации C — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и каноническая композиция. Сначала каноническая декомпозиция (алгоритм NFD) приводит текст к форме D. Затем каноническая композиция — операция, обратная NFD, обрабатывает текст от начала к концу с учётом следующих правил: -* символ <code>S</code> считается ''начальным'', если имеет нулевой класс комбинируемости ({{lang-en\|combining class of zero}}) согласно таблице символов Юникода; -* в любой последовательности символов, начинающейся с символа <code>S</code>, символ <code>C</code> блокируется от <code>S</code>, только если между <code>S</code> и <code>C</code> есть какой-либо символ <code>B</code>, который либо является начальным, либо имеет одинаковый или больший класс комбинируемости, чем <code>C</code>. Это правило распространяется только на строки, прошедшие каноническую декомпозицию; -* символ считается ''первичным'' композитом, если имеет каноническую декомпозицию в таблице символов Юникода (или каноническую декомпозицию для [[Хангыль\|хангыля]] и он не входит в [http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table список исключений]); -* символ <code>X</code> может быть первично совмещён с символом <code>Y</code>, если и только если существует первичный композит <code>Z</code>, канонически эквивалентный последовательности <<code>X</code>, <code>Y</code>>; -* если очередной символ <code>C</code> не блокируется последним встреченным начальным базовым символом <code>L</code> и он может быть успешно первично совмещён с ним, то <code>L</code> заменяется на композит <code>L-C</code>, а <code>C</code> удаляется. +{{quote\|This process is different from a formal ''encoding'' of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase "an 'e' with an acute accent on it" than to the character sequence <U+0065, U+0301>.}} -Пример: -{\| style="text-align:center" -\| -{\| class="wikitable" style="text-align:center" - \| <big>o</big> - \|- - \| <small>U+006F</small> - \|} -\| -{\| class="wikitable" style="text-align:center" - \| <big> ̂</big> - \|- - \| <small>U+0302</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" - \| <big>ô</big> - \|- - \| <small>U+00F4</small> - \|} -\|} +===Ligatures=== +Many scripts, including [[Arabic script\|Arabic]] and [[Devanagari\|Devanāgarī]], have special orthographic rules that require certain combinations of letterforms to be combined into special [[ligature (typography)\|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard), which became the [[proof of concept]] for [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)\|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography\|AAT]] (by Apple). -=== NFKD === -NFKD, {{lang-en\|'''n'''ormalization '''f'''orm '''KD'''}}, форма нормализации KD — совместимая декомпозиция — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и замены символов текста по таблицам совместимой декомпозиции. Таблицы совместимой декомпозиции предусматривают замену на почти эквивалентные символы<ref>[http://habrahabr.ru/post/45489/ Нормализация Unicode]</ref>: -* похожих на буквы (ℍ и ℌ); -* обведённых кружками (①); -* с изменёнными размерами (ｶ и カ); -* повёрнутых (︷ и {); -* степеней (⁹ и ₉); -* дробей (¼); -* других (™). +Instructions are also embedded in fonts to tell the [[operating system]] how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally this approach is only effective in monospaced fonts, but may be used as a fallback rendering method when more complex methods fail. -Примеры: -{\| style="text-align:center;" -\| -{\| class="wikitable" style="text-align:center;" - \| <big>ℍ</big> - \|- - \| <small>U+210d</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>H</big> - \|- - \| <small>U+0048</small> - \|} -\|} -{\| style="text-align:center;" -\| -{\| class="wikitable" style="text-align:center;" - \| <big>①</big> - \|- - \| <small>U+2460</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>1</big> - \|- - \| <small>U+0031</small> - \|} -\|} -{\| -\| -{\| class="wikitable" style="text-align:center;" - \| <big>ｶ</big> - \|- - \| <small>U+FF76</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>カ</big> - \|- - \| <small>U+30AB</small> - \|} -\|} -{\| -\| -{\| class="wikitable" style="text-align:center;" - \| <big>︷</big> - \|- - \| <small>U+FE37</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>{</big> - \|- - \| <small>U+007B</small> - \|} -\|} -{\| -\| -{\| class="wikitable" style="text-align:center;" - \| <big>⁹</big> - \|- - \| <small>U+2079</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>9</big> - \|- - \| <small>U+0039</small> - \|} -\|} -{\| -\| -{\| class="wikitable" style="text-align:center;" - \| <big>¼</big> - \|- - \| <small>U+00BC</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>1</big> \|\| <big> ⁄ </big> \|\| <big>4</big> - \|- - \| <small>U+0031</small> \|\| <small>U+2044</small> \|\| <small>U+0034</small> - \|} -\|} -{\| -\| -{\| class="wikitable" style="text-align:center;" - \| <big>™</big> - \|- - \| <small>U+2122</small> - \|} -\| colspan="2" \| → -\| -{\| class="wikitable" style="text-align:center;" - \| <big>T</big> \|\| <big>M</big> - \|- - \| <small>U+0054</small> \|\| <small>U+004D</small> - \|} -\|} +===Standardized subsets=== +Several subsets of Unicode are standardized: Microsoft Windows since [[Windows NT 4.0]] supports [[WGL-4]] with 656 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets:<ref>[https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf CWA 13873:2000 – Multilingual European Subsets in ISO/IEC 10646-1] [[European Committee for Standardization\|CEN]] Workshop Agreement 13873</ref> -=== NFKC === -NFKC, {{lang-en\|'''n'''ormalization '''f'''orm '''KC'''}}, форма нормализации KC — алгоритм, согласно которому последовательно выполняются совместимая декомпозиция (алгоритм NFKD) и каноническая композиция (алгоритм NFC). +MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters)<ref>[https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html Multilingual European Character Set 2 (MES-2) Rationale], [[Markus Kuhn (computer scientist)\|Markus Kuhn]], 1998</ref> and MES-3A & MES-3B (two larger subsets, not shown here). Note that MES-2 includes every character in MES-1 and WGL-4. -=== Примеры === -{\| class="standard" - !Исходный текст\|\|NFD\|\|NFC\|\|NFKD\|\|NFKC - \|- - \| <!-- fi --> -{\| class="wikitable" style="text-align:center;" -\| <big>ﬁ</big> +{\| class="wikitable" +\|+ {{nobold\|'''WGL-4''', ''MES-1'' and MES-2}} \|- -\| <small>U+FB01</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ﬁ</big> +! Row !! Cells !! Range(s) \|- -\| <small>U+FB01</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ﬁ</big> +!rowspan="2"\| 00 +\| '''''20–7E''''' +\| [[Basic Latin (Unicode block)\|Basic Latin]] (00–7F) \|- -\| <small>U+FB01</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>f</big> \|\| <big>i</big> +\| '''''A0–FF''''' +\| [[Latin-1 Supplement (Unicode block)\|Latin-1 Supplement]] (80–FF) \|- -\| <small>U+0066</small> \|\| <small>U+0069</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>f</big> \|\| <big>i</big> +!rowspan="2"\| 01 +\| '''''00–13,'' 14–15, ''16–2B,'' 2C–2D, ''2E–4D,'' 4E–4F, ''50–7E,'' 7F''' +\| [[Latin Extended-A]] (00–7F) \|- -\| <small>U+0066</small> \|\| <small>U+0069</small> -\|} - \|- - \| <!-- 2^5 --> -{\| class="wikitable" style="text-align:center;" -\| <big>2</big> \|\| <big>⁵</big> +\| 8F, '''92,''' B7, DE-EF, '''FA–FF''' +\| [[Latin Extended-B]] (80–FF <span title="U+024F">...</span>) \|- -\| <small>U+0032</small> \|\| <small>U+2075</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>2</big> \|\| <big>⁵</big> +!rowspan="3"\| 02 +\| 18–1B, 1E–1F +\| Latin Extended-B (<span title="U+00180">...</span> 00–4F) \|- -\| <small>U+0032</small> \|\| <small>U+2075</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>2</big> \|\| <big>⁵</big> +\| 59, 7C, 92 +\| [[IPA Extensions]] (50–AF) \|- -\| <small>U+0032</small> \|\| <small>U+2075</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>2</big> \|\| <big>5</big> +\| BB–BD, '''C6, ''C7,'' C9,''' D6, '''''D8–DB,'' DC, ''DD,''''' DF, EE +\| [[Spacing Modifier Letters]] (B0–FF) \|- -\| <small>U+0032</small> \|\| <small>U+0035</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>2</big> \|\| <big>5</big> +! 03 +\| 74–75, 7A, 7E, '''84–8A, 8C, 8E–A1, A3–CE,''' D7, DA–E1 +\| [[Greek and Coptic\|Greek]] (70–FF) \|- -\| <small>U+0032</small> \|\| <small>U+0035</small> -\|} - \|- - \| <!-- "s" (looks like "f") with two dots --> -{\| class="wikitable" style="text-align:center;" -\| colspan="2" \| <big>ẛ̣</big> +! 04 +\| '''00–5F, 90–91,''' 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9 +\| [[Cyrillic (Unicode block)\|Cyrillic]] (00–FF) \|- -\| <small>U+1E9B</small> \|\| <small>U+0323</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ſ</big> \|\| <big>̣</big> \|\| <big>̇</big> +! 1E +\| 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, '''80–85,''' 9B, '''F2–F3''' +\| [[Latin Extended Additional]] (00–FF) \|- -\| <small>U+017F</small> \|\| <small>U+0323</small> \|\| <small>U+0307</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ẛ</big> \|\| <big>̣</big> +! 1F +\| 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE +\| [[Greek Extended]] (00–FF) \|- -\| <small>U+1E9B</small> \|\| <small>U+0323</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>s</big> \|\| <big>̣</big> \|\| <big>̇</big> +!rowspan="3"\| 20 +\| '''13–14, ''15,'' 17, ''18–19,'' 1A–1B, ''1C–1D,'' 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,''' 4A +\| [[General Punctuation]] (00–6F) \|- -\| <small>U+0073</small> \|\| <small>U+0323</small> \|\| <small>U+0307</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ṩ</big> +\| '''7F''', 82 +\| [[Superscripts and Subscripts]] (70–9F) \|- -\| <small>U+1E69</small> -\|} - \|- - \| <!-- "й" --> -{\| class="wikitable" style="text-align:center;" -\| <big>й</big> +\| '''A3–A4, A7, ''AC,''''' AF +\| [[Currency Symbols (Unicode block)\|Currency Symbols]] (A0–CF) \|- -\| <small>U+0439</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>и</big> \|\| <big> ̆</big> +!rowspan="3"\| 21 +\| '''05, 13, 16, ''22, 26,'' 2E''' +\| [[Letterlike Symbols]] (00–4F) \|- -\| <small>U+0438</small> \|\| <small>U+0306</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>й</big> +\| '''''5B–5E''''' +\| [[Number Forms]] (50–8F) \|- -\| <small>U+0439</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>и</big> \|\| <big> ̆</big> +\| '''''90–93,'' 94–95, A8''' +\| [[Arrows (Unicode block)\|Arrows]] (90–FF) \|- -\| <small>U+0438</small> \|\| <small>U+0306</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>й</big> +! 22 +\| 00, '''02,''' 03, '''06,''' 08–09, '''0F, 11–12, 15, 19–1A, 1E–1F,''' 27–28, '''29,''' 2A, '''2B, 48,''' 59, '''60–61, 64–65,''' 82–83, 95, 97 +\| [[Mathematical Operators]] (00–FF) \|- -\| <small>U+0439</small> -\|} - \|- - \| <!-- "ё" --> -{\| class="wikitable" style="text-align:center;" -\| <big>ё</big> +! 23 +\| '''02, 0A, 20–21,''' 29–2A +\| [[Miscellaneous Technical]] (00–FF) \|- -\| <small>U+0451</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>е</big> \|\| <big>̈</big> +!rowspan="3"\| 25 +\| '''00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C''' +\| [[Box Drawing]] (00–7F) \|- -\| <small>U+0435</small> \|\| <small>U+0308</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ё</big> +\| '''80, 84, 88, 8C, 90–93''' +\| [[Block Elements]] (80–9F) \|- -\| <small>U+0451</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>е</big> \|\| <big>̈</big> +\| '''A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6''' +\| [[Geometric Shapes]] (A0–FF) \|- -\| <small>U+0435</small> \|\| <small>U+0308</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ё</big> +! 26 +\| '''3A–3C, 40, 42, 60, 63, 65–66, ''6A,'' 6B''' +\| [[Miscellaneous Symbols]] (00–FF) \|- -\| <small>U+0451</small> -\|} - \|- - \| <!-- "А" --> -{\| class="wikitable" style="text-align:center;" -\| <big>А</big> +! F0 +\| (01–02)<!--in WGL-4, but not in MES-2--> +\| [[Private Use Area (Unicode block)\|Private Use Area]] (00–FF ...) \|- -\| <small>U+0410</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>А</big> +! FB +\| '''01–02''' +\| [[Alphabetic Presentation Forms]] (00–4F) \|- -\| <small>U+0410</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>А</big> -\|- -\| <small>U+0410</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>А</big> -\|- -\| <small>U+0410</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>А</big> -\|- -\| <small>U+0410</small> -\|} - \|- - \| <!-- "が" --> -{\| class="wikitable" style="text-align:center;" -\| <big>が</big> -\|- -\| <small>U+304C</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>か</big> \|\| <big>゙</big> -\|- -\| <small>U+304B</small> \|\| <small>U+3099</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>が</big> -\|- -\| <small>U+304C</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>か</big> \|\| <big>゙</big> -\|- -\| <small>U+304B</small> \|\| <small>U+3099</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>が</big> -\|- -\| <small>U+304C</small> -\|} - \|- - \| <!-- "VIII" --> -{\| class="wikitable" style="text-align:center;" -\| <big>Ⅷ</big> -\|- -\| <small>U+2167</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>Ⅷ</big> -\|- -\| <small>U+2167</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>Ⅷ</big> -\|- -\| <small>U+2167</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>V</big> \|\| <big>I</big> \|\| <big>I</big> \|\| <big>I</big> -\|- -\| <small>U+0056</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>V</big> \|\| <big>I</big> \|\| <big>I</big> \|\| <big>I</big> -\|- -\| <small>U+0056</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> -\|} - \|- - \| <!-- "ç" --> -{\| class="wikitable" style="text-align:center;" -\| <big>ç</big> -\|- -\| <small>U+00E7</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>c</big> \|\| <big>̧</big> -\|- -\| <small>U+0063</small> \|\| <small>U+0327</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ç</big> -\|- -\| <small>U+00E7</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>c</big> \|\| <big>̧</big> -\|- -\| <small>U+0063</small> \|\| <small>U+0327</small> -\|} -\| -{\| class="wikitable" style="text-align:center;" -\| <big>ç</big> -\|- -\| <small>U+00E7</small> -\|} +! FF +\| FD +\| [[Specials (Unicode block)\|Specials]] \|} -== Двунаправленное письмо == -Стандарт Юникод поддерживает письменности языков как с направлением написания слева направо ({{lang-en\|left-to-right, LTR}}), так и с написанием справа налево ({{lang-en\|right-to-left, RTL}}) — например, [[арабское письмо\|арабское]] и [[еврейский алфавит\|еврейское]] письмо. В обоих случаях символы хранятся в «естественном» порядке; их отображение с учётом нужного направления письма обеспечивается приложением. +Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "[[replacement character]]" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's [[Last Resort font]] will display a substitute glyph indicating the Unicode range of the character, and the [[SIL International]]'s [[Unicode fallback font\|Unicode Fallback]] font will display a box showing the hexadecimal scalar value of the character. + +==={{anchor\|UTF\|UCS}}Mapping and encodings=== + +Several mechanisms have been specified for storing a series of code points as a series of bytes. + +<!-- [[Unicode Transformation Format]] redirects here --> +Unicode defines two mapping methods: the ''Unicode Transformation Format'' (UTF) encodings, and the ''[[Universal Coded Character Set]]'' (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode ''code points'' to sequences of values in some fixed-size range, termed ''code units''. All UTF encodings map code points to a unique sequence of bytes.<ref>{{cite web\|title=UTF-8, UTF-16, UTF-32 & BOM\|url=https://unicode.org/faq/utf_bom.html\|website=Unicode.org FAQ\|accessdate=12 December 2016}}</ref> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. + +UTF encodings include: + +* [[UTF-1]], a retired predecessor of UTF-8, maximizes compatibility with [[ISO/IEC 2022\|ISO 2022]], no longer part of ''The Unicode Standard'' +* [[UTF-7]], a 7-bit encoding sometimes used in e-mail, often considered obsolete (not part of ''The Unicode Standard'', but only documented as an informational [[Request for Comments\|RFC]], i.e., not on the Internet Standards Track) +* [[UTF-8]], uses one to four bytes for each code point, maximizes compatibility with [[ASCII]] +* [[UTF-EBCDIC]], similar to UTF-8 but designed for compatibility with [[EBCDIC]] (not part of ''The Unicode Standard'') +* [[UTF-16]], uses one or two 16-bit code units per code point, cannot encode surrogates +* [[UTF-32]], uses one 32-bit code unit per code point -Кроме того, Юникод поддерживает комбинированные тексты, сочетающие фрагменты с разным направлением письма. Данная возможность называется ''двунаправленность'' ({{lang-en\|bidirectional text, BiDi}}). Некоторые упрощённые обработчики текста (например, в сотовых телефонах) могут поддерживать Юникод, но не иметь поддержки двунаправленности. Все символы Юникода поделены на несколько категорий: пишущиеся слева направо, пишущиеся справа налево, и пишущиеся в любом направлении. Символы последней категории (в основном это [[знаки пунктуации]]) при отображении принимают направление окружающего их текста. +UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the ''de facto'' standard encoding for interchange of Unicode text. It is used by [[FreeBSD]] and most recent [[Linux distributions]] as a direct replacement for legacy encodings in general text handling. -== Представленные символы == -[[Файл:Roadmap to Unicode BMP multilingual.svg\|lang=ru\|right\|500px\|thumb\|Схема [[Плоскость (Юникод)#Основная многоязычная плоскость\|основной мультиязычной плоскости]] Юникода]] -{{Main\|Плоскость (Юникод)}} +The UCS-2 and UTF-16 encodings specify the Unicode [[Byte Order Mark]] (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or [[endianness\|byte endianness]] detection). The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of [[ligature (typography)\|ligatures]]). -Юникод включает практически все современные [[письменность\|письменности]], в том числе: -{{columns-list\|2\| -* [[арабское письмо\|арабскую]], -* [[армянское письмо\|армянскую]], -* [[бенгальское письмо\|бенгальскую]], -* [[Бирманское письмо\|бирманскую]], -* [[Глаголица\|глаголицу]], -* [[Греческий алфавит\|греческую]], -* [[грузинское письмо\|грузинскую]], -* [[деванагари]], -* [[еврейский алфавит\|еврейскую]], -* [[Кириллица\|кириллицу]], -* [[китайское письмо\|китайскую]] (китайские иероглифы активно используются в [[японский язык\|японском языке]], а также изредка в [[корейский язык\|корейском]]), -* [[коптское письмо\|коптскую]], -* [[Кхмерское письмо\|кхмерскую]], -* [[Латинский алфавит\|латинскую]], -* [[Тамильское письмо\|тамильскую]], -* [[Хангыль\|корейскую (хангыль)]], -* [[письмо чероки\|чероки]], -* [[Эфиопское письмо\|эфиопскую]], -* [[японское письмо\|японскую]] (которая включает в себя, кроме [[кана\|слоговой азбуки]], ещё и [[кандзи\|китайские иероглифы]]) -}} -и другие. +The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".<ref>{{Cite book \| title=The Unicode Standard, Version 6.2 \| publisher=The Unicode Consortium \| year=2013 \| isbn=978-1-936213-08-5 \| page=561 }}</ref> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit [[code page]]s. However {{IETF RFC\|3629}}, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM. -С академическими целями добавлены многие исторические письменности, в том числе: [[руны\|германские руны]], [[Древнетюркское письмо\|древнетюркские руны]], [[древнегреческий язык\|древнегреческая письменность]], [[египетские иероглифы]], [[клинопись]], [[письменность майя]], [[этрусский алфавит]]. +In UTF-32 and UCS-4, one [[32-bit]] code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the [[GNU Compiler Collection\|gcc]] compilers to generate software uses it as the standard "[[wide character]]" encoding. Some programming languages, such as [[Seed7]], use UTF-32 as internal representation for strings and characters. Recent versions of the [[Python (programming language)\|Python]] programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in [[high-level programming language\|high-level]] coded software. -В Юникоде представлен широкий набор [[таблица математических символов\|математических]] и [[музыка]]льных символов, а также [[пиктограмма\|пиктограмм]]. +[[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]] (DNS). The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all scripts that are supported by Unicode. Earlier and now historical proposals include [[UTF-5]] and [[UTF-6]]. -[[государственный флаг\|Государственные флаги]] не включены в Юникод напрямую. Для их кодирования используются пары из 26 буквенных символов, предназначенных для представления двухбуквенных кодов стран по стандарту [[ISO 3166-1 alpha-2]]. Эти буквы закодированы в диапазоне от {{unichar\|1F1E6\|regional indicator symbol letter a\|html=}} до {{unichar\|1F1FF\|regional indicator symbol letter z\|html=}}. +[[GB 18030\|GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the [[People's Republic of China]] (PRC). [[Binary Ordered Compression for Unicode\|BOCU-1]] and [[Standard Compression Scheme for Unicode\|SCSU]] are Unicode compression schemes. The [[April Fools' Day RFC]] of 2005 specified two [[parody]] UTF encodings, [[UTF-9]] and [[UTF-18]]. -В Юникод принципиально не включаются [[логотип]]ы компаний и продуктов, хотя они и встречаются в шрифтах (например, логотип [[Apple]] в кодировке [[MacRoman]] (0xF0) или логотип [[Microsoft Windows\|Windows]] в шрифте Wingdings (0xFF)). В юникодовских шрифтах логотипы должны размещаться только в области пользовательских символов. +==Adoption== -== ISO/IEC 10646 == -Консорциум Юникода работает в тесной связи с рабочей группой ISO/IEC/JTC1/SC2/WG2, которая занимается разработкой международного стандарта 10646 ([[ISO]]/[[IEC]] 10646). Между стандартом Юникода и ISO/IEC 10646 установлена синхронизация, хотя каждый стандарт использует свою терминологию и систему документации. +===Operating systems=== +Unicode has become the dominant scheme for internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use [[UCS-2]] (the fixed-width two-byte precursor to UTF-16) and later moved to [[UTF-16]] (the variable-width current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000]], [[Windows XP]], [[Windows Vista]], [[Windows 7]], [[Windows 8]] and [[Windows 10]]), which uses UTF-16 as the sole internal character encoding. The [[Java virtual machine\|Java]] and [[.NET Framework\|.NET]] bytecode environments, [[macOS]], and [[KDE]] also use it for internal representation. Partial support for Unicode can be installed on [[Windows 9x]] through the [[Microsoft Layer for Unicode]]. -Сотрудничество Консорциума Юникода с Международной организацией по стандартизации ({{lang-en\|International Organization for Standardization, ISO}}) началось в [[1991 год]]у. В [[1993 год]]у ISO выпустила стандарт DIS 10646.1. Для синхронизации с ним Консорциум утвердил стандарт Юникода версии 1.1, в который были внесены дополнительные символы из DIS 10646.1. В результате значения закодированных символов в Unicode 1.1 и DIS 10646.1 полностью совпали. +[[UTF-8]] (originally developed for [[Plan 9 from Bell Labs\|Plan 9]])<ref>{{cite web + \| url = https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt + \| title = UTF-8 history + \| first = Rob \| last = Pike \| authorlink = Rob Pike + \| date = 2003-04-30 +}}</ref> has become the main storage encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets. UTF-8 is also the most common Unicode encoding used in [[HTML]] documents on the [[World Wide Web]]. -В дальнейшем сотрудничество двух организаций продолжилось. В 2000 году стандарт Unicode 3.0 был синхронизирован с ISO/IEC 10646-1:2000. Предстоящая третья версия ISO/IEC 10646 будет синхронизирована с Unicode 4.0. Возможно, эти спецификации даже будут опубликованы как единый стандарт. +Multilingual text-rendering engines which use Unicode include [[Uniscribe]] and [[DirectWrite]] for Microsoft Windows, [[ATSUI]] and [[Core Text]] for macOS, and [[Pango]] for [[GTK+]] and the [[GNOME]] desktop. -Аналогично форматам UTF-16 и UTF-32 в стандарте Юникода, стандарт ISO/IEC 10646 также имеет две основные формы кодирования символов: UCS-2 (2 байта на символ, аналогично UTF-16) и UCS-4 (4 байта на символ, аналогично UTF-32). UCS значит ''универсальный набор кодированных символов'' ({{lang-en\|universal coded character set}}). UCS-2 можно считать подмножеством UTF-16 (UTF-16 без суррогатных пар), а UCS-4 является синонимом для UTF-32. +===Input methods=== +{{Main\|Unicode input}} -Различия стандартов Юникод и ISO/IEC 10646: -* небольшие различия в терминологии; -* ISO/IEC 10646 не включает разделы, необходимые для полноценной реализации поддержки Юникода: - нет данных о двоичном кодировании символов; - нет описания алгоритмов сравнения ({{lang-en\|collation}}) и отрисовки ({{lang-en\|rendering}}) символов; -** нет перечня свойств символов (например, нет перечня свойств, необходимых для реализации поддержки двунаправленного ({{lang-en\|bi-directional}}) письма). +Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire. -== Способы представления == -Юникод имеет несколько форм представления ({{lang-en\|Unicode transformation format, UTF}}): [[UTF-8]], [[UTF-16]] (UTF-16BE, UTF-16LE) и [[UTF-32]] (UTF-32BE, UTF-32LE). Была разработана также форма представления [[UTF-7]] для передачи по семибитным каналам, но из-за несовместимости с [[ASCII]] она не получила распространения и не включена в стандарт. 1 апреля 2005 года были предложены две [[День смеха\|шуточные]] формы представления: UTF-9 и UTF-18 ([http://tools.ietf.org/html/rfc4042 RFC{{nbsp}}4042]). +[[ISO/IEC 14755]],<ref>{{cite web\|url=https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf \|title=ISO/IEC JTC1/SC 18/WG 9 N \|date= \|accessdate=2012-06-04}}</ref> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the ''Basic method'', where a ''beginning sequence'' is followed by the hexadecimal representation of the code point and the ''ending sequence''. There is also a ''screen-selection entry method'' specified, where the characters are listed in a table in a screen, such as with a character map program. -В [[Microsoft]] [[Windows NT]] и основанных на ней системах [[Windows 2000]] и [[Windows XP]] в основном [[Юникод в операционных системах семейства Microsoft Windows\|используется]] форма UTF-16LE. В [[UNIX]]-подобных [[Операционная система\|операционных системах]] [[GNU/Linux]], [[BSD]] и [[Mac OS X]] принята форма UTF-8 для файлов и UTF-32 или UTF-8 для обработки символов в [[оперативная память\|оперативной памяти]]. +Online tools for finding the code point for a known character include Unicode Lookup<ref>{{cite web\|url=https://unicodelookup.com/\|title=Unicode Lookup\|last=Hedley\|first=Jonathan\|date=2009}}</ref> by Jonathan Hedley and Shapecatcher<ref>{{cite web\|url=http://shapecatcher.com/\|title=Unicode Character Recognition\|last=Milde\|first=Benjamin\|date=2011}}</ref> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on [[Shape context]], one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned. -[[Punycode]] — другая форма кодирования последовательностей Unicode-символов в так называемые ACE-последовательности, которые состоят только из алфавитно-цифровых символов, как это разрешено в доменных именах. +===Email=== +{{Main\|Unicode and email}} -=== UTF-8 === -{{Основная статья\|UTF-8}} -UTF-8 — представление Юникода, обеспечивающее наибольшую компактность и обратную совместимость с 7-битной системой [[ASCII]]; текст, состоящий только из символов с номерами меньше 128, при записи в UTF-8 превращается в обычный текст [[ASCII]] и может быть отображён любой программой, работающей с ASCII; и наоборот, текст, закодированный 7-битной ASCII может быть отображён программой, предназначенной для работы с UTF-8. Остальные символы Юникода изображаются последовательностями длиной от 2 до 4 байт, в которых первый байт всегда имеет маску <code>11xxxxxx</code>, а остальные — <code>10xxxxxx</code>. В UTF-8 не используются суррогатные пары. +[[MIME]] defines two different mechanisms for encoding non-ASCII characters in [[email]], depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the [[UTF-8]] character set and the [[Base64]] or the [[Quoted-printable]] transfer encoding are recommended, depending on whether much of the message consists of [[ASCII]] characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software. -Формат UTF-8 был изобретён [[2 сентября]] [[1992 год]]а [[Томпсон, Кен\|Кеном Томпсоном]] и [[Пайк, Роб\|Робом Пайком]] и реализован в ОС [[Plan 9]]<ref>http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt{{ref-en}}{{Недоступная ссылка\|date=Октябрь 2019 \|bot=InternetArchiveBot }}</ref>. Сейчас стандарт UTF-8 официально закреплён в документах RFC 3629 и ISO/IEC 10646 Annex D. +The adoption of Unicode in email has been very slow. Some East Asian text is still encoded in encodings such as [[ISO-2022]], and some devices, such as mobile phones, still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as [[Yahoo]], [[Google]] ([[Gmail]]), and [[Microsoft]] ([[Outlook.com]]) support it. -=== UTF-16 и UTF-32 === -{{Основная статья\|UTF-16\|UTF-32}} -UTF-16 — кодировка, позволяющая записывать символы Юникода в диапазонах U+0000...U+D7FF и U+E000...U+10FFFF (общим количеством 1 112 064). При этом каждый символ записывается одним или двумя словами (суррогатная пара). Кодировка UTF-16 описана в приложении Q к международному стандарту ISO/IEC 10646, а также ей посвящён документ IETF RFC 2781 под названием «UTF-16, an encoding of ISO 10646». +===Web=== +{{Main\|Unicode and HTML}} -UTF-32 — способ представления Юникода, при котором каждый символ занимает ровно 4 байта. Главное преимущество UTF-32 перед кодировками переменной длины заключается в том, что символы Юникод в ней непосредственно индексируемы, поэтому найти символ по номеру его позиции в файле можно чрезвычайно быстро, и получение любого символа ''n''-й позиции при этом является операцией, занимающей всегда одинаковое время. Это также делает замену символов в строках UTF-32 очень простой. Напротив, кодировки с переменной длиной требуют последовательного доступа к символу ''n''-й позиции, что может быть очень затратной по времени операцией. Главный недостаток UTF-32 — это неэффективное использование пространства, так как для хранения любого символа используется четыре байта. Символы, лежащие за пределами нулевой (базовой) плоскости кодового пространства, редко используются в большинстве текстов. Поэтому удвоение, в сравнении с UTF-16, занимаемого строками в UTF-32 пространства, зачастую не оправдано. +All [[W3C]] recommendations have used Unicode as their ''document character set'' since HTML 4.0. [[Web browser]]s have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from [[typeface\|font]] related issues; e.g. v 6 and older of Microsoft [[Internet Explorer]] did not render many code points unless explicitly told to use a font that contains them.<ref>{{cite web\|first=Alan \|last=Wood \|url=http://www.alanwood.net/unicode/explorer.html#ie5 \|title=Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support \|publisher=Alan Wood \|date= \|accessdate=2012-06-04}}</ref> -==== Порядок байтов ==== -{{Основная статья\|Порядок байтов}} -В потоке данных UTF-16 младший байт может записываться либо перед старшим ({{lang-en\|UTF-16 little-endian, UTF-16LE}}), либо после старшего ({{lang-en\|UTF-16 big-endian, UTF-16BE}}). Аналогично существует два варианта четырёхбайтной кодировки — UTF-32LE и UTF-32BE. +Although syntax rules may affect the order in which characters are allowed to appear, [[XML]] (including [[XHTML]]) documents, by definition,<ref>{{cite web\|title=Extensible Markup Language (XML) 1.1 (Second Edition)\|url=https://www.w3.org/TR/xml11\|accessdate=2013-11-01}}</ref> comprise characters from most of the Unicode code points, with the exception of: -=== Маркер последовательности байтов === -{{Основная статья\|Маркер последовательности байтов}} -Для указания на использование Юникода, в начале текстового файла или потока может передаваться [[Маркер последовательности байтов]] ({{lang-en\|byte order mark (BOM)}}) — символ U+FEFF (неразрывный пробел нулевой ширины). По его виду можно легко различить как формат представления Юникода, так и последовательность байтов. Маркер последовательности байтов может принимать следующий вид: -; UTF-8 : EF BB BF -; UTF-16BE : FE FF -; UTF-16LE : FF FE -; UTF-32BE : 00 00 FE FF -; UTF-32LE : FF FE 00 00 +* most of the [[C0 and C1 control codes\|C0 control codes]] +* the permanently unassigned code points D800–DFFF +* FFFE or FFFF -=== Юникод и традиционные кодировки === -Внедрение Юникода привело к изменению подхода к традиционным 8-битным кодировкам. Если раньше такая кодировка всегда задавалась непосредственно, то теперь она может задаваться таблицей соответствия между данной кодировкой и Юникодом. Фактически почти все 8-битные кодировки теперь можно рассматривать как форму представления некоторого подмножества Юникода. И это намного упростило создание программ, которые должны работать с множеством разных кодировок: теперь, чтобы добавить поддержку ещё одной кодировки, надо всего лишь добавить ещё одну таблицу перекодировки символов в Юникод. +HTML characters manifest either directly as [[byte]]s according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&#916;</code>, <code>&#1049;</code>, <code>&#1511;</code>, <code>&#1605;</code>, <code>&#3671;</code>, <code>&#12354;</code>, <code>&#21494;</code>, <code>&#33865;</code>, and <code>&#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말. -Кроме того, многие форматы данных позволяют вставлять любые символы Юникода, даже если документ записан в старой 8-битной кодировке. Например, в HTML можно использовать [[Мнемоники в HTML\|коды с амперсандом]]. +When specifying [[Uniform Resource Identifier\|URIs]], for example as [[URL]]s in [[HTTP]] requests, non-ASCII characters must be [[percent encoding\|percent-encoded]]. -=== Реализации === -Большинство современных операционных систем в той или иной степени обеспечивает поддержку Юникода. +===Fonts=== +{{Main\|Unicode font}} -В операционных системах семейства [[Windows NT]] для внутреннего представления имён файлов и других системных строк используется двухбайтовая кодировка UTF-16LE. Системные вызовы, принимающие строковые параметры, существуют в однобайтном и двухбайтном вариантах. Подробнее см. в статье [[Юникод в операционных системах семейства Microsoft Windows]]. +Unicode is not in principle concerned with fonts ''per se'', seeing them as implementation choices.<ref>{{cite journal \|url = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf \| title = The design of a Unicode font \| journal = Electronic Publishing \| volume = VOL. 6(3), 289–305 \| date = September 1993 \| page = 292 \|last1 = Bigelow \| first1=Charles \| last2 = Holmes \| first2 = Kris}}</ref> Any given character may have many [[allograph]]s, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode standard.<ref>{{cite web \| url= https://www.unicode.org/faq/font_keyboard.html \| title = Fonts and keyboards \| publisher = Unicode Consortium \| date = 28 June 2017 \| accessdate= 13 October 2019}}</ref> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire. -[[UNIX]]-подобные операционные системы, в том числе [[GNU/Linux]], [[BSD]], [[OS X]], используют для представления Юникода кодировку UTF-8. Большинство программ может работать с UTF-8 как с традиционными однобайтными кодировками, не обращая внимания на то, что символ представляется как несколько последовательных байт. Для работы с отдельными символами строки обычно перекодируются в UCS-4, так что каждому символу соответствует [[машинное слово]]. +Free and retail [[font]]s based on Unicode are widely available, since [[TrueType]] and [[OpenType]] support Unicode. These font formats map Unicode code points to glyphs, but TrueType font is restricted to 65,535 glyphs. -Одной из первых успешных коммерческих реализаций Юникода стала среда программирования [[Java]]. В ней принципиально отказались от 8-битного представления символов в пользу 16-битного. Это решение увеличило расход памяти, но позволило вернуть в программирование важную абстракцию: произвольный одиночный символ (тип <code>char</code>). В частности, программист мог работать со строкой, как с простым массивом. К сожалению, успех не был окончательным, Юникод перерос ограничение в 16 бит и к версии J2SE 5.0 произвольный символ снова стал занимать переменное число единиц памяти — один <code>char</code> или два (см. [[UTF-16\|суррогатная пара]]). +[[List of typefaces\|Thousands of fonts]] exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of Unicode fonts\|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., [[font substitution]]. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces. -Сейчас большинство языков программирования поддерживает строки Юникода, хотя их представление может различаться в зависимости от реализации. +===Newlines=== +Unicode partially addresses the [[newline]] problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of [[Newline#Unicode\|characters]] that conforming applications should recognize as line terminators. -== Методы ввода == -Поскольку ни одна [[раскладка клавиатуры]] не может позволить вводить все символы Юникода одновременно, от [[операционная система\|операционных систем]] и [[прикладное программное обеспечение\|прикладных программ]] требуется поддержка альтернативных методов ввода произвольных символов Юникода. +In terms of the newline, Unicode introduced {{unichar\|2028\|LINE SEPARATOR}} and {{unichar\|2029\|PARAGRAPH SEPARATOR}}. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding. -=== [[Microsoft Windows]] === -{{main\|Юникод в операционных системах семейства Microsoft Windows}} -Хотя, начиная с [[Windows 2000]], служебная программа «Таблица символов» (charmap.exe) поддерживает символы Юникода и позволяет копировать их в [[буфер обмена]], эта поддержка ограничена только базовой плоскостью (коды символов U+0000…U+FFFF). Символы с кодами от U+10000 «Таблица символов» не отображает. +==Issues== -Похожая таблица есть, например, в [[Microsoft Word]]. +===Philosophical and completeness criticisms=== +[[Han unification]] (the identification of forms in the [[East Asian language]]s which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the [[Ideographic Research Group]] (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification.<ref>[http://tronweb.super-nova.co.jp/characcodehist.html A Brief History of Character Codes], Steven J. Searle, originally written [https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html 1999], last updated 2004</ref> -Иногда можно набрать [[Шестнадцатеричная система счисления\|шестнадцатеричный]] код, нажать {{key\|[[Alt (клавиша)\|Alt]]\|X}}, и код будет заменён на соответствующий символ, например, в [[WordPad]], Microsoft Word. В редакторах {{key\|Alt\|X}} выполняет и обратное преобразование. +Unicode has been criticized for failing to separately encode older and alternative forms of [[kanji]] which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). Unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged.<ref name="dw2001">[https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html The secret life of Unicode: A peek at Unicode's soft underbelly], Suzanne Topping, 1 May 2001 ''(Internet Archive)''</ref>{{clarify\|date=April 2010\|reason="and, contains" and meaning of statement}} There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. An example of one is [[TRON (encoding)\|TRON]] (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it). -Во многих программах MS Windows, чтобы получить символ Unicode, нужно при нажатой клавише Alt набрать десятичное значение кода символа на цифровой клавиатуре. Например, полезными при наборе кириллических текстов будут комбинации Alt+0171 (<!-- защита от Викификатора --><nowiki>«</nowiki>), Alt+0187 (<nowiki>»</nowiki>) и Alt+0769 ([[знак ударения]]). Интересны также комбинации Alt+0133 (…) и Alt+0151 (—). +Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 92,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam. -=== [[Macintosh]] === -В [[Mac OS]] 8.5 и более поздних версиях поддерживается метод ввода, называемый «Unicode Hex Input». При зажатой клавише Option требуется набрать четырёхзначный шестнадцатеричный код требуемого символа. Этот метод позволяет вводить символы с кодами, большими U+FFFF, используя пары суррогатов; такие пары операционной системой будут автоматически заменены на одиночные символы. Этот метод ввода перед использованием нужно активизировать в соответствующем разделе системных настроек и затем выбрать как текущий метод ввода в меню клавиатуры. +Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations, in the form of [[variation Selectors\|Unicode variation sequences]]. For example, the Advanced Typographic tables of [[OpenType]] permit one of a number of alternative glyph representations to be selected when performing the character to glyph mapping process. In this case, information can be provided within plain text to designate which alternate character form to select. -Начиная с [[Mac OS X]] 10.2, существует также приложение «Character Palette», позволяющее выбирать символы из таблицы, в которой можно выделять символы определённого блока или символы, поддерживаемые конкретным шрифтом. +[[File:Cyrillic cursive.svg\|thumb\|right\|Various [[Cyrillic]] characters shown with and without italics]] +If the difference in the appropriate glyphs for two characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison between Russian (labeled standard) and Serbian characters at right, meaning that the differences are displayed through smart font technology or manually changing fonts. -=== [[GNU/Linux]] === -В [[GNOME]] также есть утилита «[[Таблица символов GNOME\|Таблица символов]]» (ранее gucharmap), позволяющая отображать символы определённого блока или системы письма и предоставляющая возможность поиска по названию или описанию символа. Когда код нужного символа известен, его можно ввести в соответствии со стандартом [[Международная организация по стандартизации\|ISO]]{{nbsp}}14755: при зажатых клавишах {{key\|Ctrl\|Shift}} ввести шестнадцатеричный код (начиная с некоторой версии GTK+, ввод кода нужно предварить нажатием клавиши ''«U»''). Вводимый шестнадцатеричный код может иметь до {{num\|32\|бит}} в длину, позволяя вводить любые символы Юникода без использования суррогатных пар. +===Mapping to legacy character sets=== +Unicode was designed to provide code-point-by-code-point [[round-trip format conversion]] to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. That has meant that inconsistent legacy architectures, such as [[combining character\|combining diacritics]] and [[precomposed character]]s, both exist in Unicode, giving more than one method of representing some text. This is most pronounced in the three different encoding forms for Korean [[Hangul]]. Since version 3.0, any precomposed characters that can be represented by a combining sequence of already existing characters can no longer be added to the standard in order to preserve interoperability between software using different versions of Unicode. -Все приложения [[X Window System\|X{{nbsp}}Window]], включая GNOME и [[KDE]], поддерживают ввод при помощи клавиши {{Key\|[[Compose]]}}. Для клавиатур, на которых нет отдельной клавиши [[Compose]], для этой цели можно назначить любую клавишу — например, {{Key\|[[Caps Lock]]}}. +[[Injective]] mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as [[Shift-JIS]] or [[EUC-JP]] and Unicode led to [[round-trip format conversion]] mismatches, particularly the mapping of the character JIS X 0208 '～' (1-33, WAVE DASH), heavily used in legacy database data, to either {{unichar\|FF5E\|FULLWIDTH TILDE}} (in [[Microsoft Windows]]) or {{unichar\|301C\|WAVE DASH}} (other vendors).<ref> +[http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc AFII contribution about WAVE DASH], {{Cite web\|url=http://www.ingrid.org/java/i18n/unicode.html\|archiveurl=https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html\|title=An Unicode vendor-specific character table for japanese\|date=2011-04-22\|archive-date=2011-04-22\|website=web.archive.org<!--\|access-date=2019-05-20-->}}</ref> -Консоль GNU/Linux также допускает ввод символа Юникода по его коду — для этого десятичный код символа нужно ввести цифрами расширенного блока клавиатуры при зажатой клавише {{Key\|[[Alt (клавиша)\|Alt]]}}. Можно вводить символы и по их шестнадцатеричному коду: для этого нужно зажать клавишу {{key\|AltGr}}, и для ввода цифр A—F использовать клавиши расширенного блока клавиатуры от {{Key\|NumLock}} до {{Key\|Enter}} (по часовой стрелке). Поддерживается также и ввод в соответствии с ISO{{nbsp}}14755. Для того чтобы перечисленные способы могли работать, нужно включить в консоли режим Юникода вызовом <code>unicode_start</code>(1) и выбрать подходящий шрифт вызовом <code>setfont</code>(8). +Some Japanese computer programmers objected to Unicode because it requires them to separate the use of {{unichar\|005C\|REVERSE SOLIDUS\|note=backslash}} and {{unichar\|00A5\|YEN SIGN}}, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage.<ref>[https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem ''ISO 646-* Problem''], Section 4.4.3.5 of ''Introduction to I18n'', Tomohiro KUBOTA, 2001</ref> (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in [[ISO 8859-1]], from long before Unicode. -[[Mozilla Firefox]] для Linux поддерживает ввод символов по ISO{{nbsp}}14755. +===Indic scripts=== +[[Indic script]]s such as [[Tamil script\|Tamil]] and [[Devanagari]] are each allocated only 128 code points, matching the [[ISCII]] standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (aka conjuncts) out of components. Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only.<ref>{{cite web +\| title = Arabic Presentation Forms-A +\| url = https://www.unicode.org/charts/PDF/UFB50.pdf +\| accessdate = 2010-03-20}} +</ref><ref>{{cite web +\| title = Arabic Presentation Forms-B +\| url = https://www.unicode.org/charts/PDF/UFE70.pdf +\| accessdate = 2010-03-20}}</ref><ref>{{cite web +\| title = Alphabetic Presentation Forms +\| url = https://www.unicode.org/charts/PDF/UFB00.pdf +\| accessdate = 2010-03-20}}</ref> Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for the [[Tibetan script]] in 2003 when the [[Standardization Administration of China]] proposed encoding 956 precomposed Tibetan syllables,<ref>{{Cite web \| author=China \| title=Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP \| url=https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf \| date=2 December 2002 }}</ref> but these were rejected for encoding by the relevant ISO committee ([[ISO/IEC JTC 1/SC 2]]).<ref>{{Cite web \| author= V. S. Umamaheswaran \| title=Resolutions of WG 2 meeting 44 \| url=https://www.unicode.org/L2/L2003/03390r-n2654.pdf \| at=Resolution M44.20 \| date=7 November 2003 }}</ref> -== Проблемы Юникода == -В Юникоде английское «a» и польское «a» — один и тот же символ. Точно так же одним и тем же символом (но отличающимся от «a» латинского) считаются русское «а» и сербское «а». Такой принцип кодирования не универсален; по-видимому, решения «на все случаи жизни» вообще не может существовать. -* Тексты на [[китайский язык\|китайском]], [[корейский язык\|корейском]] и [[японский язык\|японском]] языках имеют традиционное написание сверху вниз, начиная с правого верхнего угла. Переключение горизонтального и вертикального написания для этих языков не предусмотрено в Юникоде — это должно осуществляться средствами [[язык разметки\|языков разметки]] или внутренними механизмами [[текстовый процессор\|текстовых процессоров]]. -* Наличие или отсутствие в Юникоде разных начертаний одного и того же символа в зависимости от языка. Нужно следить, чтобы текст всегда был правильно помечен как относящийся к тому или другому языку. -: Так, [[китайское письмо\|китайские иероглифы]] могут иметь разные начертания в китайском, японском ([[кандзи]]) и корейском ([[ханча]]), но при этом в Юникоде обозначаются одним и тем же символом (так называемая CJK-унификация), хотя упрощённые и полные иероглифы всё же имеют разные коды. -: Аналогично, [[русский язык\|русский]] и [[сербский язык\|сербский]] <!-- защита от Викификатора --><nowiki>языки</nowiki> используют разное начертание курсивных букв ''п'' и ''т'' (в сербском они выглядят как <span style="text-decoration: overline; font-style: italic">и</span> и <span style="text-decoration: overline; font-style: italic">ш</span>, см. [[сербский курсив]]). -* Перевод из строчных букв в заглавные тоже зависит от языка. Например: в [[турецкий язык\|турецком]] существуют буквы [[i без точки\|İi и Iı]] — таким образом, турецкие правила изменения регистра конфликтуют с [[английский язык\|английскими]], которые предписывают «i» переводить в «I». Подобные проблемы есть и в других языках — например, в канадском диалекте французского языка регистр переводится немного не так, как во Франции<ref>[http://www.transl-gunsmoker.ru/2008/11/unicode.html Регистр в Unicode — это непросто]</ref>. -* Даже с [[арабские цифры\|арабскими цифрами]] есть определённые типографские тонкости: цифры бывают «прописными» и «[[минускульные цифры\|строчными]]», пропорциональными и [[моноширинный шрифт\|моноширинными]]<ref>В большинстве шрифтов для ПК реализованы «прописные» (маюскульные) моноширинные цифры.</ref> — для Юникода разницы между ними нет. Подобные нюансы остаются за программным обеспечением. +[[Thai alphabet]] support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the [[TIS-620\|Thai Industrial Standard 620]], which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation.<ref name="dw2001" /> Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word {{wiktth\|แสดง}} {{IPA-th\|sa dɛːŋ\|}} "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส. -Некоторые недостатки связаны не с самим Юникодом, а с возможностями обработчиков текста. -* Файлы нелатинского текста в Юникоде всегда занимают больше места, так как один символ кодируется не одним байтом, как в различных национальных кодировках, а последовательностью байтов (исключение составляет UTF-8 для языков, алфавит которых укладывается в ASCII, а также наличие в тексте символов двух и более языков, алфавит которых ''не'' укладывается в ASCII<ref>В некоторых случаях документ (не простой текст) в Юникоде может занимать существенно меньше места, чем документ в однобайтовой кодировке. Например, если некая веб-страница содержит примерно поровну русского и греческого текста, то в однобайтовой кодировке придётся либо русские, либо греческие буквы записывать, используя возможности формата документов, в виде кодов с амперсандом, которые занимают 6—7 байт на символ (при использовании десятичных кодов), то есть в среднем на букву придётся 3,5—4 байта, в то время как UTF-8 занимает только 2 байта на греческую или русскую букву.</ref>). Файл шрифта, необходимый для отображения всех символов таблицы Юникод, занимает сравнительно много места в памяти и требует бо́льших вычислительных ресурсов, чем шрифт только одного национального языка пользователя<ref>Один из файлов шрифтов Arial Unicode имеет размер 24 мегабайта; существует Times New Roman размером 120 мегабайт, он содержит количество символов, близкое к 65536.</ref>. С увеличением мощности компьютерных систем и удешевлением памяти и дискового пространства эта проблема становится всё менее существенной; тем не менее, она остаётся актуальной для портативных устройств, например, для мобильных телефонов. -* Хотя поддержка Юникода реализована в наиболее распространённых операционных системах, до сих пор не всё прикладное программное обеспечение поддерживает корректную работу с ним. В частности, не всегда обрабатываются метки порядка байтов ([[Byte order mark\|BOM]]) и плохо поддерживаются диакритические символы. Проблема является временной и есть следствие сравнительной новизны стандартов Юникода (в сравнении с однобайтовыми национальными кодировками). -* Производительность всех программ обработки строк (в том числе и сортировок в БД) снижается при использовании Юникода вместо однобайтовых кодировок. +===Combining characters=== +{{Main\|Combining character}} +{{See also\|Unicode normalization#Normalization}} -Некоторые редкие системы письма всё ещё не представлены должным образом в Юникоде. Изображение «длинных» надстрочных символов, простирающихся над несколькими буквами, как, например, в [[церковнославянский язык\|церковнославянском языке]], пока не реализовано. +Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[Macron (diacritic)\|macron]] and [[acute accent]], but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Similarly, [[dot (diacritic)\|underdots]], as needed in the [[romanization]] of [[Indo-Aryan languages\|Indic]], will often be placed incorrectly.{{Citation needed\|date=July 2011}}. Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded the problem can often be solved by using a specialist Unicode font such as [[Charis SIL]] that uses [[Graphite (SIL)\|Graphite]], [[OpenType]], or [[Apple Advanced Typography\|AAT]] technologies for advanced rendering features. -== «Юникод» или «Уникод»? == -«Unicode» — одновременно и имя собственное (или часть имени, например, Unicode Consortium), и имя нарицательное, происходящее из английского языка. +===Anomalies=== +{{main\|Unicode alias names and abbreviations}} +The Unicode standard has imposed rules intended to guarantee stability.<ref>[https://www.unicode.org/policies/stability_policy.html Unicode stability policy]</ref> Depending on the strictness of a rule, a change can be prohibited or allowed. For example, a "name" given to a code point cannot and will not change. But a "script" property is more flexible, by Unicode's own rules. In version 2.0, Unicode changed many code point "names" from version 1. At the same moment, Unicode stated that from then on, an assigned name to a code point will never change anymore. This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling {{sc2\|{{typo\|BRAKCET}}}} for {{sc2\|BRACKET}} in a character name). In 2006 a list of anomalies in character names was first published, and, as of April 2017, there were 94 characters with identified issues,<ref name="tn17">{{cite web \|url=https://unicode.org/notes/tn27/ \|title=Unicode Technical Note #27: Known Anomalies in Unicode Character Names \|date=10 April 2017 \|website=unicode.org}}</ref> for example: -На первый взгляд предпочтительнее использовать написание «Уникод». В [[Русский язык\|русском языке]] уже есть [[Морфема\|морфемы]] «уни-» (слова с латинским элементом «uni-» традиционно переводились и писались через «уни-»: универсальный, униполярный, унификация, униформа) и «код». Напротив, торговые марки, заимствованные из [[Английский язык\|английского языка]], обычно передаются посредством практической транскрипции, в которой деэтимологизированное сочетание букв «uni-» записывается в виде «юни-» («[[Юнилевер]]», «[[UNIX\|Юникс]]» и т. п.), то есть точно так же, как в случае с побуквенными сокращениями, вроде [[UNICEF]] «United Nations International Children’s Emergency Fund» — [[ЮНИСЕФ]]. +* {{unichar\|2118\|script capital p\|nlink=Weierstrass p}}: This is a small letter. The capital is {{unichar\|1D4AB\|MATHEMATICAL SCRIPT CAPITAL P}}<ref>[https://www.unicode.org/charts/PDF/U2100.pdf Unicode chart: "actually this has the form of a lowercase calligraphic p, despite its name"]</ref> +* {{unichar\|034F\|COMBINING GRAPHEME JOINER\|nlink=Combining grapheme joiner}}: Does not join graphemes.<ref name="tn17" /> +* {{unichar\|A015\|YI SYLLABLE WU\|nlink=Yi language}}: This is not a Yi syllable, but a Yi iteration mark. +* {{unichar\|FE18\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR {{typo\|BRAKCET}}}}: ''bracket'' is spelled incorrectly.<ref>[https://www.unicode.org/charts/PDF/UFE10.pdf "Misspelling of BRACKET in character name is a known defect"]</ref> -Написание «Юникод» уже твёрдо вошло в русскоязычные тексты. В [[Википедия\|Википедии]] используется более распространённый вариант. В [[MS Windows]] используется вариант «Юникод». +Spelling errors are resolved by using [[Unicode alias names and abbreviations]]. -На сайте Консорциума есть специальная страница, где рассматриваются проблемы передачи слова «Unicode» в различных языках и системах письма. Для русской кириллицы указан вариант «Юникод»<ref name=autogenerated1 />. +==See also== +* [[Comparison of Unicode encodings]] +* [[Cultural, political, and religious symbols in Unicode]] +* [[International Components for Unicode]] (ICU), now as ICU-<abbr title="technical committee">TC</abbr> a part of Unicode +* [[List of binary codes]] +* [[List of Unicode characters]] +* [[List of XML and HTML character entity references]] +* [[Open-source Unicode typefaces]] +* [[Standards related to Unicode]] +* [[Unicode symbols]] +* [[Universal Coded Character Set]] +* [[Lotus Multi-Byte Character Set]] (LMBCS), a parallel development with similar intentions -Формы, принятые иностранными организациями для русской передачи слова «Unicode», являются рекомендательными. +==Further reading== +{{refbegin}} +* ''The Unicode Standard, Version 3.0'', The Unicode Consortium, Addison-Wesley Longman, Inc., April 2000. {{ISBN\|0-201-61633-5}} +* ''The Unicode Standard, Version 4.0'', The Unicode Consortium, Addison-Wesley Professional, 27 August 2003. {{ISBN\|0-321-18578-1}} +* ''The Unicode Standard, Version 5.0, Fifth Edition'', The [[Unicode Consortium]], Addison-Wesley Professional, 27 October 2006. {{ISBN\|0-321-48091-0}} +* Julie D. Allen. ''The Unicode Standard, Version 6.0'', The [[Unicode Consortium]], Mountain View, 2011, {{ISBN\|9781936213016}}, ([https://www.unicode.org/versions/Unicode6.0.0/]). +* ''The Complete Manual of Typography'', James Felici, Adobe Press; 1st edition, 2002. {{ISBN\|0-321-12730-7}} +* ''Unicode: A Primer'', Tony Graham, M&T books, 2000. {{ISBN\|0-7645-4625-2}}. +* ''Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard'', Richard Gillam, Addison-Wesley Professional; 1st edition, 2002. {{ISBN\|0-201-70052-2}} +* ''Unicode Explained'', Jukka K. Korpela, O'Reilly; 1st edition, 2006. {{ISBN\|0-596-10121-X}} +{{refend}} -== См. также == -* [[Символы, представленные в Юникоде]] -* [[ASCII]] -* [[ISO 8859-1]] -* [[UTF-8]] -* [[UTF-16]] -* [[UTF-32]] -* [[Кириллица в Юникоде]] -* [[Дроби в Юникоде]] -* [[XeTeX]] -* [[Свободные универсальные шрифты]] -* [[Windows Glyph List 4]] -* [[Широкий символ]] -* Библиотека [[GLib]] содержит широкий набор функций для работы c символами и строками в кодировке Unicode +{{cite book \|author1=Yannis Haralambous \|author2=Martin Dürst \|editor1-last=Haralambous \|editor1-first=Yannis \|title=Proceedings of Graphemics in the 21st Century, Brest 2018 \|date=2019 \|publisher=Fluxus Editions \|location=Brest \|isbn=978-2-9570549-1-6 \|pages=167-183 \|url=http://www.fluxus-editions.fr/gla1-hara1.php \|ref=https://doi.org/10.36824/2018-graf-hara1 \|chapter=Unicode from a Linguistic Point of View}} - [[Проект:Внесение символов алфавитов народов России в Юникод]] +==Notes== +{{notelist\|group=note}} -== Примечания == -{{примечания\|2}} +==References== +{{reflist\|30em}} -== Ссылки == -* [http://www.unicode.org/ Официальный сайт Консорциума Юникода]{{ref-en}} -* {{dmoz\|Computers/Software/Globalization/Character_Encoding/Unicode/\|Unicode}}{{ref-en}} -* Статья «[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]»{{ref-ru}} на официальном сайте Консорциума -* [http://www.unicode.org/versions/latest/ Последняя версия стандарта Юникод]{{ref-en}} -* Последнюю версию стандарта ISO/IEC 10646 ищите в [http://standards.iso.org/ittf/PubliclyAvailableStandards/ списке доступных стандартов]{{ref-en}}. Документы, соответствующие стандарту Unicode 7.0: [http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012.zip ISO/IEC 10646] (файл ZIP){{ref-en}}, [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip Amendments 1] (файл ZIP){{ref-en}}, Amendments 2 (по состоянию 2014-08-06 ещё недоступен) -* [http://unicode-table.com/ Таблица символов Юникода с названиями и описаниями]{{ref-ru}}{{ref-en}}{{ref-de}} -* [http://www.unicode.org/versions/Unicode5.0.0/appC.pdf Связь Юникода версии 5.0.0 и ISO/IEC 10646] (файл PDF){{ref-en}} -* [http://www.cl.cam.ac.uk/~mgk25/unicode.html FAQ по UTF-8 и Unicode]{{ref-en}} -* [[Кириллица в Юникоде]]: http://www.unicode.org/charts/PDF/U0400.pdf, http://www.unicode.org/charts/PDF/U0500.pdf, http://www.unicode.org/charts/PDF/U2DE0.pdf, http://www.unicode.org/charts/PDF/UA640.pdf{{ref-en}}{{Недоступная ссылка\|date=Январь 2020 \|bot=InternetArchiveBot }} -* [http://www.i18nguy.com/surrogates.html Включение поддержки дополнительных символов Юникода в Windows]{{ref-en}} -* [http://www.fileformat.info/info/unicode/char/search.htm Поиск по символам Юникода]{{ref-en}} +==External links== +{{Sister project links\|n=no\|v=no\|q=no\|s=no\|voy=no\|m=Unicode\|mw=no\|species=no}} +* {{official website\|name=Official website}} {{middot}} {{official website\|url=https://unicode.org/main.html\|name=Official technical site}} +* {{DMOZ\|Computers/Software/Globalization/Character_Encoding/Unicode/}} +* [http://www.alanwood.net/unicode/ Alan Wood's Unicode Resources]{{snd}} Contains lists of word processors with Unicode capability; fonts and characters are grouped by type; characters are presented in lists, not grids. +* [https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont Unicode BMP Fallback Font] Displays the Unicode value of any character in a document, including in the Private Use Area, rather than the glyph itself. +{{Unicode navigation\|state=uncollapsed}} +{{Character encoding}} -{{Стандарты ISO}} -{{Шрифтовой дизайн}} +{{Authority control}} -[[Категория:Юникод\| ]] -[[Категория:Стандарты Интернета]] -[[Категория:Стандарты ISO]] +[[Category:Unicode\| ]] +[[Category:Character encoding]] +[[Category:Digital typography]] '
Новый размер страницы (`new_size`)	85511
Старый размер страницы (`old_size`)	118106
Изменение размера в правке (`edit_delta`)	-32595
Добавленные в правке строки (`added_lines`)	[ 0 => '{{Use dmy dates\|date=May 2019\|cs1-dates=y}}', 1 => '{{distinguish\|Unicode (telegraphy)}}', 2 => '{{For\|what the term "Unicode" means in Microsoft documentation\|UTF-16}}', 3 => '{{Short description\|Character encoding standard}}', 4 => '{{Infobox character encoding', 5 => '\| name = Unicode', 6 => '\| mime =', 7 => '\| alias = [[Universal Coded Character Set]] (UCS)', 8 => '\| image = New Unicode logo.svg', 9 => '\| caption = Logo of the [[Unicode Consortium]]', 10 => '\| standard = Unicode Standard', 11 => '\| lang = International', 12 => '\| status =', 13 => '\| encodings = [[UTF-8]], [[UTF-16]], [[GB 18030\|GB18030]]<br/>'''Less common''': [[UTF-32]], [[BOCU]], [[Standard Compression Scheme for Unicode\|SCSU]], [[UTF-7]]', 14 => '\| encodes =', 15 => '\| extends =', 16 => '\| prev = [[ISO 8859]], various others', 17 => '\| next =', 18 => '}}', 19 => '{{Contains special characters\| special = uncommon [[Unicode]] characters}}', 20 => ''''Unicode''' is an [[information technology]] [[technical standard\|standard]] for the consistent [[character encoding\|encoding]], representation, and handling of [[character (computing)\|text]] expressed in most of the world's [[writing system]]s. The standard is maintained by the [[Unicode Consortium]], and {{as of\|March 2020\|lc=y}}, there is a repertoire of {{unicodenover}} (these [[character (computing)\|characters]] consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic [[script (Unicode)\|scripts]], as well as multiple symbol sets and [[emoji]]. The character repertoire of the Unicode Standard is synchronized with [[ISO/IEC 10646]], and both are code-for-code identical.', 21 => '', 22 => '''The Unicode Standard'' consists of a set of code charts for visual reference, an encoding method and set of standard [[character encoding]]s, a set of reference [[data file]]s, and a number of related items, such as character properties, rules for [[Unicode normalization\|normalization]], decomposition, [[Unicode collation algorithm\|collation]], rendering, and [[bidirectional text]] display order (for the correct display of text containing both right-to-left scripts, such as [[Arabic script\|Arabic]] and [[Hebrew alphabet\|Hebrew]], and left-to-right scripts).<ref>{{Cite web \| title = The Unicode Standard: A Technical Introduction \| url = https://www.unicode.org/standard/principles.html \| accessdate = 2010-03-16}}</ref>', 23 => '', 24 => 'Unicode's success at unifying character sets has led to its widespread and predominant use in the [[internationalization and localization]] of computer [[software]]. The standard has been implemented in many recent technologies, including modern [[operating system]]s, [[XML]], [[Java (programming language)\|Java]] (and other programming languages), and the [[.NET Framework]].', 25 => '', 26 => '[[Comparison of Unicode encodings\|Unicode can be implemented]] by different [[character encoding]]s. The Unicode standard defines [[UTF-8]], [[UTF-16]], and [[UTF-32]], and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and [[Universal Coded Character Set\|UCS]]-2 (without full support for Unicode), a precursor of UTF-16; [[GB 18030\|GB18030]] is standardized in China and implements Unicode fully, while not an official Unicode standard.', 27 => '', 28 => 'UTF-8, the dominant encoding on the [[World Wide Web]] (used in over 94% of websites {{As of\|2019\|November\|df=\|lc=y}}),<ref>{{Cite web\|url=https://w3techs.com/technologies/cross/character_encoding/ranking\|title=Usage Survey of Character Encodings broken down by Ranking\|website=w3techs.com\|language=en\|access-date=2019-11-11}}</ref> uses one [[byte]]{{efn\|The Unicode Consortium uses the ambiguous term byte; The [[International Organization for Standardization]] (ISO), the [[International Electrotechnical Commission]] (IEC) and the [[Internet Engineering Task Force]] (IETF) use the more specific term [[Octet (computing)\|octet]] in current documents related to Unicode.\|group=note}}for the first 128 [[code point]]s, and up to 4 bytes for other characters.<ref>{{cite web\|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559\|work=The Unicode Standard\|title=Conformance \| date=March 2020\|accessdate=2020-03-15}}</ref> The first 128 Unicode code points represent the [[ASCII]] characters, which means that any ASCII text is also a UTF-8 text.', 29 => '', 30 => 'UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called [[Basic Multilingual Plane]] (BMP). With 1,112,064 possible Unicode code points corresponding to characters (see [[#Architecture and terminology\|below]]) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same [[16-bit]] encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text.', 31 => '', 32 => 'UTF-32 (also referred to as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used.', 33 => '', 34 => '==Origin and development==', 35 => 'Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the [[ISO/IEC 8859]] standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using [[Latin character]]s and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).', 36 => '', 37 => 'Unicode, in intent, encodes the underlying characters—[[grapheme]]s and grapheme-like units—rather than the variant [[glyph]]s (renderings) for such characters. In the case of [[Chinese characters]], this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see [[Han unification]]).', 38 => '', 39 => 'In text processing, Unicode takes the role of providing a unique ''code point''—a [[number]], not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, [[font]], or style) to other software, such as a [[web browser]] or [[word processor]]. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.', 40 => 'The first 256 code points were made identical to the content of [[ISO/IEC 8859-1]] so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "[[Halfwidth and Fullwidth Forms (Unicode block)\|fullwidth forms]]" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean ([[CJK]]) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see [[duplicate characters in Unicode]].', 41 => '==={{anchor\|Unicode 88}}History===', 42 => 'Based on experiences with the [[Xerox Character Code Standard]] (XCCS) since 1980,<ref name="unicode-88"/> the origins of Unicode date to 1987, when [[Joe Becker (Unicode)\|Joe Becker]] from [[Xerox]] with [[Lee Collins (software engineer)\|Lee Collins]] and [[Mark Davis (Unicode)\|Mark Davis]] from [[Apple Inc.\|Apple]], started investigating the practicalities of creating a universal character set.<ref>{{cite web \|title=Summary Narrative \|url=https://www.unicode.org/history/summary.html \|access-date=2010-03-15}}</ref> With additional input from Peter Fenwick and Dave Opstad,<ref name="unicode-88"/> Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".<ref name="unicode-88">{{Cite web \|url=https://unicode.org/history/unicode88.pdf \|title=Unicode 88 \|author-last=Becker \|author-first=Joseph D. \|author-link=Joseph D. Becker \|date=1998-09-10 \|orig-year=1988-08-29 \|edition=10th anniversary reprint \|website=unicode.org \|publisher=[[Unicode Consortium]] \|access-date=2016-10-25 \|url-status=live \|archive-url=https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf \|archive-date=2016-11-25 \|quote=In 1978, the initial proposal for a set of "Universal Signs" was made by [[Bob Belleville]] at [[Xerox PARC]]. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the [[Xerox Character Code Standard]] (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.<br/>Unicode arose as the result of eight years of working experience with XCCS. Its fundamental differences from XCCS were proposed by Peter Fenwick and Dave Opstad (pure 16-bit codes), and by [[Lee Collins (Unicode)\|Lee Collins]] (ideographic character unification). Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communication multilingual system products.}}</ref>', 43 => 'In this document, entitled ''Unicode 88'', Becker outlined a [[16-bit]] character model:<ref name="unicode-88"/>', 44 => '<blockquote>', 45 => 'Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.', 46 => '</blockquote>', 47 => 'His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:<ref name="unicode-88"/>', 48 => '<blockquote>', 49 => 'Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2<sup>14</sup> = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.', 50 => '</blockquote>', 51 => 'In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of [[Research Libraries Group\|RLG]], and Glenn Wright of [[Sun Microsystems]], and in 1990, Michel Suignard and Asmus Freytag from [[Microsoft]] and Rick McGowan of [[NeXT]] joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready.', 52 => 'The [[Unicode Consortium]] was incorporated in California on 3 January 1991,<ref>[https://unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates] on ''unicode.org.'' Retrieved February 28, 2017.</ref> and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992.', 53 => 'In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g., [[Egyptian hieroglyphs]]) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.<ref name=unicoderevisited>{{cite web\|last=Searle\|first=Stephen J\|title=Unicode Revisited\|url=http://tronweb.super-nova.co.jp/unicoderevisited.html\|accessdate=2013-01-18}}</ref>', 54 => 'The Microsoft TrueType specification version 1.0 from 1992 used the name ''Apple Unicode'' instead of ''Unicode'' for the Platform ID in the naming table.', 55 => '===Unicode Consortium===', 56 => '{{Main\|Unicode Consortium}}', 57 => 'The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including [[Adobe Inc.\|Adobe]], [[Apple Inc.\|Apple]], [[Google]], [[International Business Machines\|IBM]], [[Microsoft]], and [[Oracle Corporation]].<ref name="members">{{cite web', 58 => '\| title = The Unicode Consortium Members', 59 => '\| url = https://unicode.org/consortium/members.html', 60 => '\| accessdate = 2019-01-04}}</ref>', 61 => 'Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the [[Ministry of Endowments and Religious Affairs (Oman)]] is a full member with voting rights.<ref name="members" />', 62 => 'The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard [[#Unicode Transformation Format and Universal Character Set\|Unicode Transformation Format]] (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with [[multilingualism\|multilingual]] environments.', 63 => '===Scripts covered===', 64 => '{{Main\|Script (Unicode)}}', 65 => '[[File:Unicode sample.png\|thumb\|right\|200px\|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts -->', 66 => 'Unicode covers almost all scripts ([[writing system]]s) in current use today.<ref>{{cite web', 67 => '\| title = Character Code Charts', 68 => '\| url = https://www.unicode.org/charts/', 69 => '\| accessdate = 2010-03-17}}', 70 => '</ref>{{failed verification\|date=October 2013}}<ref>{{Cite web\|url=https://home.unicode.org/basic-info/faq/\|title=Unicode FAQ\|last=\|first=\|date=\|website=\|url-status=live\|archive-url=\|archive-date=\|access-date=2020-04-02}}</ref>', 71 => 'A total of 154 [[Script (Unicode)\|scripts]] are included in the latest version of Unicode (covering [[alphabet]]s, [[abugida]]s and [[Syllabary\|syllabaries]]), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and [[musical notation\|music]] (in the form of notes and rhythmic symbols), also occur.', 72 => 'The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran<ref>{{Cite web \| title=Roadmap to the BMP \| url=https://www.unicode.org/roadmaps/bmp/ \| publisher=[[Unicode Consortium]] \| accessdate=30 July 2018 }}</ref>) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the [https://www.unicode.org/roadmaps/ Unicode Roadmap] page of the [[Unicode Consortium]] Web site. For some scripts on the Roadmap, such as [[Jurchen script\|Jurchen]] and [[Khitan small script]], encoding proposals have been made and they are working their way through the approval process. For others scripts, such as [[Maya script\|Mayan]] (besides numbers) and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.', 73 => 'Some modern invented scripts which have not yet been included in Unicode (e.g., [[Tengwar]]) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., [[Klingon scripts\|Klingon]]) are listed in the [[ConScript Unicode Registry]], along with unofficial but widely used [[Private Use Areas]] code assignments.', 74 => 'There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals have been already included into Unicode.', 75 => 'The [https://linguistics.berkeley.edu/sei/ Script Encoding Initiative], a project run by Deborah Anderson at the [[University of California, Berkeley]] was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.<ref>{{cite web\|url=https://www.unicode.org/pending/about-sei.html \|title=About The Script Encoding Initiative \|publisher=The Unicode Consortium \|date= \|accessdate=2012-06-04}}</ref>', 76 => '==={{anchor\|1.0.0\|1.0.1\|1.1\|2.0\|2.1\|3.0\|3.1\|3.2\|4.0\|4.1\|5.0\|5.1\|5.2\|6.0\|6.1\|6.2\|6.3\|7.0\|8.0\|9.0\|10.0\|11.0\|12.0\|12.1\|13.0\|14.0}}Versions===', 77 => 'Unicode is developed in conjunction with the [[International Organization for Standardization]] and shares the character repertoire with [[ISO/IEC 10646]]: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but ''The Unicode Standard'' contains much more information for implementers, covering—in depth—topics such as bitwise encoding, [[Unicode collation algorithm\|collation]] and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting [[Bi-directional text\|bidirectional text]]. The two standards do use slightly different terminology.', 78 => '', 79 => 'The Unicode Consortium first published ''The Unicode Standard'' in 1991 (version 1.0), and has published new versions on a regular basis since then. The latest version of the Unicode Standard, version 13.0, was released in March 2020, and is available in electronic format from the consortium's website. The last version of the standard that was published completely in book form (including the code charts) was version 5.0 in 2006, but since version 5.2 (2009) the core specification of the standard has been published as a print-on-demand paperback.<ref name=version6.1PoD>{{cite web\|title=Unicode 6.1 Paperback Available\|url=https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html\|work=announcements_at_unicode.org\|accessdate=2012-05-30}}</ref> The entire text of each version of the standard, including the core specification, standard annexes and code charts, is freely available in [[PDF]] format on the Unicode website.', 80 => '', 81 => 'In April 2020, Unicode announced that the release of the forthcoming version 14.0 had been postponed by six months from its initial release of March 2021 due to the [[COVID-19 pandemic]].<ref>{{cite web\|title=Unicode 14.0 Delayed for 6 Months\|url=https://home.unicode.org/unicode-14-0-delayed-for-6-months/\|accessdate=2020-05-05}}</ref>', 82 => '', 83 => 'Thus far, the following major and minor versions of the Unicode standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below.<ref>{{cite web \| title = Enumerated Versions of The Unicode Standard \| url = https://www.unicode.org/versions/enumeratedversions.html \| accessdate = 2016-06-21}}</ref>', 84 => '\|+ Unicode versions', 85 => '!rowspan=2\| Version', 86 => '!rowspan=2\| Date', 87 => '!rowspan=2\| Book', 88 => '!rowspan=2\| Corresponding [[Universal Character Set\|ISO/IEC 10646]] edition', 89 => '!rowspan=2\| [[Script (Unicode)\|Scripts]]', 90 => '!colspan=2\| Characters', 91 => '! Total{{refn\|The number of characters listed for each version of Unicode is the total number of graphic and format characters (i.e., excluding [[Private Use Area\|private-use characters]], [[Unicode control characters\|control characters]], [[noncharacter\|noncharacters]] and [[surrogate code points]]).\|group=tablenote}}', 92 => '! Notable additions', 93 => '\|-', 94 => '\| 1.0.0', 95 => '\| October 1991', 96 => '\| {{ISBN\|0-201-56788-1}} (Vol. 1)', 97 => '\| 24', 98 => '\| 7,129', 99 => '\| Initial repertoire covers these scripts: [[Arabic script\|Arabic]], [[Armenian alphabet\|Armenian]], [[Bengali alphabet\|Bengali]], [[Zhuyin\|Bopomofo]], [[Cyrillic script\|Cyrillic]], [[Devanagari]], [[Georgian alphabet\|Georgian]], [[Greek alphabet\|Greek and Coptic]], [[Gujarati alphabet\|Gujarati]], [[Gurmukhi script\|Gurmukhi]], [[Hangul]], [[Hebrew alphabet\|Hebrew]], [[Hiragana]], [[Kannada alphabet\|Kannada]], [[Katakana]], [[Lao script\|Lao]], [[Latin script\|Latin]], [[Malayalam script\|Malayalam]], [[Oriya script\|Oriya]], [[Tamil script\|Tamil]], [[Telugu script\|Telugu]], [[Thai alphabet\|Thai]], and [[Tibetan script\|Tibetan]].<ref>{{cite web\| title = Unicode Data 1.0.0\|url= https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt\| accessdate = 2010-03-16}}</ref>', 100 => '\| June 1992', 101 => '\| {{ISBN\|0-201-60845-6}} (Vol. 2)', 102 => '\| 25', 103 => '\| 28,327<br />(21,204 added; 6 removed)', 104 => '\| The initial set of 20,902 [[CJK Unified Ideographs]] is defined.<ref>', 105 => '{{cite web', 106 => '\| url = https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt', 107 => '\| accessdate = 2010-03-16}}</ref>', 108 => '\| 1.1', 109 => '\| June 1993', 110 => '\| 24', 111 => '\| 34,168<br />(5,963 added; 89 removed; 33 reclassified as control characters)', 112 => '\| 4,306 more [[Hangul]] syllables added to original set of 2,350 characters. [[Tibetan script\|Tibetan]] removed.<ref>{{cite web', 113 => '\| url = https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt', 114 => '\| accessdate = 2010-03-16 }}', 115 => '</ref>', 116 => '\| 2.0', 117 => '\| July 1996', 118 => '\| {{ISBN\|0-201-48345-9}}', 119 => '\| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7', 120 => '\| 25', 121 => '\| 38,885<br />(11,373 added; 6,656 removed)', 122 => '\| Original set of [[Hangul]] syllables removed, and a new set of 11,172 Hangul syllables added at a new location. [[Tibetan script\|Tibetan]] added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 [[Private use (Unicode)\|Private Use Areas]] allocated.<ref>{{cite web', 123 => '\| title = Unicode Data-2.0.14', 124 => '\| url = https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt', 125 => '\| accessdate = 2010-03-16}}', 126 => '</ref>', 127 => '\| 2.1', 128 => '\| May 1998', 129 => '\| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18', 130 => '\| 38,887<br />(2 added)', 131 => '\| [[Euro sign]] and [[Specials (Unicode block)\|Object Replacement Character]] added.<ref>{{cite web', 132 => '\| title = Unicode Data-2.1.2', 133 => '\| url = https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt', 134 => '\| accessdate = 2010-03-16}}', 135 => '</ref>', 136 => '\| 3.0', 137 => '\| September 1999', 138 => '\| {{ISBN\|0-201-61633-5}}', 139 => '\| 38', 140 => '\| 49,194<br />(10,307 added)', 141 => '\| [[Cherokee syllabary\|Cherokee]], [[Ge'ez alphabet\|Ethiopic]], [[Khmer script\|Khmer]], [[Mongolian script\|Mongolian]], [[Burmese script\|Burmese]], [[Ogham]], [[Runic alphabet\|Runic]], [[Sinhala script\|Sinhala]], [[Syriac alphabet\|Syriac]], [[Tāna\|Thaana]], [[Canadian Aboriginal syllabics\|Unified Canadian Aboriginal Syllabics]], and [[Yi script\|Yi Syllables]] added, as well as a set of [[Braille]] patterns.<ref>{{cite web', 142 => '\| title = Unicode Data-3.0.0', 143 => '\| url = https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt', 144 => '\| accessdate = 2010-03-16}}', 145 => '</ref>', 146 => '\| 3.1', 147 => '\| March 2001', 148 => '\| 41', 149 => '\| 94,140<br />(44,946 added)', 150 => '\| [[Deseret alphabet\|Deseret]], [[Gothic alphabet\|Gothic]] and [[Old Italic alphabet\|Old Italic]] added, as well as sets of symbols for [[Modern musical symbols\|Western music]] and [[Byzantine music]], and 42,711 additional [[CJK Unified Ideographs]].<ref>{{cite web', 151 => '\| title =Unicode Data-3.1.0', 152 => '\| url =https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt', 153 => '\| accessdate = 2010-03-16 }}', 154 => '</ref>', 155 => '\| 3.2', 156 => '\| March 2002', 157 => '\| ISO/IEC 10646-1:2000 plus Amendment 1', 158 => '\| 45', 159 => '\| 95,156<br />(1,016 added)', 160 => '\| [[Philippines\|Philippine]] scripts [[Buhid script\|Buhid]], [[Hanunó'o script\|Hanunó'o]], [[Baybayin\|Tagalog]], and [[Tagbanwa script\|Tagbanwa]] added.<ref>{{cite web', 161 => '\| title = Unicode Data-3.2.0', 162 => '\| url = https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt', 163 => '\| accessdate = 2010-03-16}}', 164 => '</ref>', 165 => '\| 4.0', 166 => '\| April 2003', 167 => '\| {{ISBN\|0-321-18578-1}}', 168 => '\| 52', 169 => '\| 96,382<br />(1,226 added)', 170 => '\| [[Cypriot syllabary]], [[Limbu script\|Limbu]], [[Linear B]], [[Osmanya script\|Osmanya]], [[Shavian alphabet\|Shavian]], [[Tai Nüa language#Writing system\|Tai Le]], and [[Ugaritic alphabet\|Ugaritic]] added, as well as [[Hexagram (I Ching)\|Hexagram symbols]].<ref>{{cite web', 171 => '\| title = Unicode Data-4.0.0', 172 => '\| url = https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt', 173 => '\| accessdate = 2010-03-16}}', 174 => '</ref>', 175 => '\| 4.1', 176 => '\| March 2005', 177 => '\| ISO/IEC 10646:2003 plus Amendment 1', 178 => '\| 59', 179 => '\| 97,655<br />(1,273 added)', 180 => '\| [[Lontara alphabet\|Buginese]], [[Glagolitic alphabet\|Glagolitic]], [[Kharoṣṭhī\|Kharoshthi]], [[New Tai Lue alphabet\|New Tai Lue]], [[Old Persian cuneiform script\|Old Persian]], [[Sylheti Nagari\|Syloti Nagri]], and [[Tifinagh]] added, and [[Coptic alphabet\|Coptic]] was disunified from [[Greek alphabet\|Greek]]. Ancient [[Unicode numerals#Ancient Greek numerals\|Greek numbers]] and [[Musical notation#Ancient Greece\|musical symbols]] were also added.<ref>{{cite web\|url=https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt\|title=Unicode Data-4.1.0\|accessdate=2010-03-16}}', 181 => '</ref>', 182 => '\| 5.0', 183 => '\| July 2006', 184 => '\| {{ISBN\|0-321-48091-0}}', 185 => '\| ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3', 186 => '\| 64', 187 => '\| 99,024<br />(1,369 added)', 188 => '\| [[Balinese alphabet\|Balinese]], [[Cuneiform]], [[N'Ko alphabet\|N'Ko]], [[Phags-pa script\|Phags-pa]], and [[Phoenician alphabet\|Phoenician]] added.<ref>{{cite web', 189 => '\| url = https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt', 190 => '\| accessdate = 2010-03-17}}', 191 => '</ref>', 192 => '\| 5.1', 193 => '\| April 2008', 194 => '\| ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4', 195 => '\| 75', 196 => '\| 100,648<br />(1,624 added)', 197 => '\| [[Carian script\|Carian]], [[Cham alphabet\|Cham]], [[Kayah Li script\|Kayah Li]], [[Lepcha script\|Lepcha]], [[Lycian script\|Lycian]], [[Lydian script\|Lydian]], [[Ol Chiki script\|Ol Chiki]], [[Rejang script\|Rejang]], [[Saurashtra script\|Saurashtra]], [[Sundanese script\|Sundanese]], and [[Vai syllabary\|Vai]] added, as well as sets of symbols for the [[Phaistos Disc]], [[Mahjong\|Mahjong tiles]], and [[Dominoes\|Domino tiles]]. There were also important additions for [[Burmese script\|Burmese]], additions of letters and [[Scribal abbreviation]]s used in medieval [[manuscript]]s, and the addition of [[Capital ẞ]].<ref>{{cite web', 198 => '\| url = https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt', 199 => '\| accessdate = 2010-03-17 }}', 200 => '</ref>', 201 => '\| 5.2', 202 => '\| October 2009', 203 => '\| {{ISBN\|978-1-936213-00-9}}', 204 => '\| ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6', 205 => '\| 90', 206 => '\| 107,296<br />(6,648 added)', 207 => '\| [[Avestan alphabet\|Avestan]], [[Bamum script\|Bamum]], [[Egyptian hieroglyphs]] (the [[Gardiner's sign list\|Gardiner Set]], comprising 1,071 characters), [[Imperial Aramaic]], [[Inscriptional Pahlavi]], [[Inscriptional Parthian]], [[Javanese script\|Javanese]], [[Kaithi]], [[Fraser alphabet\|Lisu]], [[Meitei Mayek script\|Meetei Mayek]], [[South Arabian alphabet\|Old South Arabian]], [[Old Turkic script\|Old Turkic]], [[Samaritan script\|Samaritan]], [[Tai Tham script\|Tai Tham]] and [[Tai Viet script\|Tai Viet]] added. 4,149 additional [[CJK Unified Ideographs]] (CJK-C), as well as extended Jamo for [[Hangul\|Old Hangul]], and characters for [[Vedic Sanskrit]].<ref>{{cite web', 208 => '\| url = https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt', 209 => '\| accessdate = 2010-03-17}}', 210 => '</ref>', 211 => '\| 6.0', 212 => '\| October 2010', 213 => '\| {{ISBN\|978-1-936213-01-6}}', 214 => '\| ISO/IEC 10646:2010 plus the [[Indian rupee sign]]', 215 => '\| 93', 216 => '\| 109,384<br />(2,088 added)', 217 => '\| [[Batak alphabet\|Batak]], [[Brāhmī script\|Brahmi]], [[Mandaic alphabet\|Mandaic]], [[playing card]] symbols, [[Traffic sign\|transport]] and [[map]] symbols, [[alchemical symbol]]s, [[emoticons]] and [[emoji]]. 222 additional [[CJK Unified Ideographs]] (CJK-D) added.<ref>{{cite web', 218 => '\| url = https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt', 219 => '\| accessdate = 2010-10-11}}', 220 => '</ref>', 221 => '\| 6.1', 222 => '\| January 2012', 223 => '\| {{ISBN\|978-1-936213-02-3}}', 224 => '\| 100', 225 => '\| 110,116<br />(732 added)', 226 => '\| [[Chakma alphabet\|Chakma]], [[Meroitic alphabet\|Meroitic cursive]], [[Meroitic alphabet\|Meroitic hieroglyphs]], [[Pollard script\|Miao]], [[Śāradā script\|Sharada]], [[Sora Sompeng]], and [[Takri alphabet\|Takri]].<ref>{{cite web', 227 => '\| url = https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt', 228 => '\| accessdate = 2012-01-31}}', 229 => '</ref>', 230 => '\| 6.2', 231 => '\| September 2012', 232 => '\| {{ISBN\|978-1-936213-07-8}}', 233 => '\| ISO/IEC 10646:2012 plus the [[Turkish lira sign]]', 234 => '\| 100', 235 => '\| 110,117<br />(1 added)', 236 => '\| [[Turkish lira sign]].<ref>{{cite web', 237 => '\| url = https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt', 238 => '\| accessdate = 2012-09-26}}', 239 => '</ref>', 240 => '\| 6.3', 241 => '\| September 2013', 242 => '\| {{ISBN\|978-1-936213-08-5}}', 243 => '\| ISO/IEC 10646:2012 plus six characters', 244 => '\| 100', 245 => '\| 110,122<br />(5 added)', 246 => '\| 5 bidirectional formatting characters.<ref>{{cite web', 247 => '\| url = https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt', 248 => '\| accessdate = 2013-09-30}}', 249 => '</ref>', 250 => '\| 7.0', 251 => '\| June 2014', 252 => '\| {{ISBN\|978-1-936213-09-2}}', 253 => '\| ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the [[Ruble sign]]', 254 => '\| 123', 255 => '\| 112,956<br />(2,834 added)', 256 => '\| [[Bassa alphabet\|Bassa Vah]], [[Caucasian Albanian alphabet\|Caucasian Albanian]], [[Duployan shorthand\|Duployan]], [[Elbasan alphabet\|Elbasan]], [[Grantha alphabet\|Grantha]], [[Khojki]], [[Khudabadi alphabet\|Khudawadi]], [[Linear A]], [[Mahajani]], [[Manichaean alphabet\|Manichaean]], [[Mende script\|Mende Kikakui]], [[Modi alphabet\|Modi]], [[Mro script\|Mro]], [[Nabataean alphabet\|Nabataean]], [[Old North Arabian]], [[Old Permic alphabet\|Old Permic]], [[Pahawh Hmong]], [[Palmyrene script\|Palmyrene]], [[Pau Cin Hau script\|Pau Cin Hau]], [[Psalter Pahlavi]], [[Siddhaṃ alphabet\|Siddham]], [[Tirhuta]], [[Warang Citi]], and [[Dingbat]]s.<ref>{{cite web', 257 => '\| url = https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt', 258 => '\| accessdate = 2014-06-15}}', 259 => '</ref>', 260 => '\| 8.0', 261 => '\| June 2015', 262 => '\| {{ISBN\|978-1-936213-10-8}}', 263 => '\| ISO/IEC 10646:2014 plus Amendment 1, as well as the [[Georgian lari\|Lari sign]], nine CJK unified ideographs, and 41 emoji characters<ref>{{Cite web \| title=Unicode 8.0.0 \| url=https://www.unicode.org/versions/Unicode8.0.0/ \| publisher=Unicode Consortium \| accessdate=2015-06-17 }}</ref>', 264 => '\| 129', 265 => '\| 120,672<br />(7,716 added)', 266 => '\| [[Ahom alphabet\|Ahom]], [[Anatolian hieroglyphs]], [[Hatran alphabet\|Hatran]], [[Multani alphabet\|Multani]], [[Old Hungarian alphabet\|Old Hungarian]], [[SignWriting]], 5,771 [[CJK Unified Ideographs\|CJK unified ideographs]], a set of lowercase letters for [[Cherokee syllabary\|Cherokee]], and five emoji [[Fitzpatrick scale\|skin tone]] modifiers<ref>{{cite web', 267 => '\| url = https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt', 268 => '\| accessdate = 2015-06-17}}', 269 => '</ref>', 270 => '\| 9.0', 271 => '\| June 2016', 272 => '\| {{ISBN\|978-1-936213-13-9}}', 273 => '\| ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols<ref>{{Cite web \| title=Unicode 9.0.0 \| url=https://www.unicode.org/versions/Unicode9.0.0/ \| publisher=Unicode Consortium \| accessdate=2016-06-21 }}</ref>', 274 => '\| 135', 275 => '\| 128,172<br />(7,500 added)', 276 => '\| [[Adlam script\|Adlam]], [[Bhaiksuki alphabet\|Bhaiksuki]], [[Zhang-Zhung language#Scripts\|Marchen]], [[Prachalit Nepal alphabet\|Newa]], [[Osage alphabet\|Osage]], [[Tangut script\|Tangut]], and 72 [[emoji]]<ref>{{cite web', 277 => '\| url = https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt', 278 => '\| accessdate = 2016-06-21}}', 279 => '</ref><ref name=laobo>{{cite web\|first=Martim\|last=Lobao\|url=https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/ \|title=These Are The Two Emoji That Weren't Approved For Unicode 9 But Which Google Added To Android Anyway\|website=Android Police\|date= 7 June 2016\|accessdate=4 September 2016}}</ref>', 280 => '\| 10.0', 281 => '\| June 2017', 282 => '\| {{ISBN\|978-1-936213-16-0}}', 283 => '\| ISO/IEC 10646:2017 plus 56 [[emoji]] characters, 285 [[hentaigana]] characters, and 3 Zanabazar Square characters<ref>{{Cite web \| title=Unicode 10.0.0 \| url=https://www.unicode.org/versions/Unicode10.0.0/ \| publisher=Unicode Consortium \| accessdate=2017-06-20 }}</ref>', 284 => '\| 139', 285 => '\| 136,690<br />(8,518 added)', 286 => '\| [[Zanabazar Square alphabet\|Zanabazar Square]], [[Soyombo alphabet\|Soyombo]], [[Masaram Gondi script\|Masaram Gondi]], [[Nüshu script\|Nüshu]], [[hentaigana]] (non-standard [[hiragana]]), 7,494 [[CJK Unified Ideographs\|CJK unified ideographs]], and 56 [[emoji]]', 287 => '\| June 2018', 288 => '\| {{ISBN\|978-1-936213-19-1}}', 289 => '\| ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters.<ref>{{Cite web \| title=The Unicode Standard, Version 11.0.0 Appendix C \| url=https://www.unicode.org/versions/Unicode11.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2018-06-11 }}</ref>', 290 => '\| 137,374<br />(684 added)', 291 => '\| [[Dogri script\|Dogra]], [[Georgian scripts#Mkhedruli\|Georgian Mtavruli]] capital letters, [[Gunjala Gondi Lipi\|Gunjala Gondi]], [[Hanifi Rohingya script\|Hanifi Rohingya]], [[Indic Siyaq Numbers (Unicode block)\|Indic Siyaq numbers]], [[Makassarese language\|Makasar]], [[Medefaidrin]], [[Sogdian alphabet\|Old Sogdian and Sogdian]], [[Mayan numerals]], 5 urgently needed [[CJK Unified Ideographs\|CJK unified ideographs]], symbols for [[xiangqi]] (Chinese chess) and [[Star (classification)\|star ratings]], and 145 [[emoji]]<ref>{{Cite web\|url=http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html\|title=Announcing The Unicode Standard, Version 11.0\|website=blog.unicode.org\|access-date=2018-06-06}}</ref>', 292 => '\| 12.0', 293 => '\| March 2019', 294 => '\| {{ISBN\|978-1-936213-22-1}}', 295 => '\| ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters.<ref>{{Cite web \| title=The Unicode Standard, Version 12.0.0 Appendix C \| url=https://www.unicode.org/versions/Unicode12.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2019-03-05 }}</ref>', 296 => '\| 150', 297 => '\| 137,928<br />(554 added)', 298 => '\| [[Elymaic]], [[Nandinagari]], [[Nyiakeng Puachue Hmong]], [[Wancho script\|Wancho]], [[Pollard script\|Miao script]] additions for several Miao and Yi dialects in China, [[hiragana]] and [[katakana]] small letters for writing archaic Japanese, [[Tamil script\|Tamil]] historic fractions and symbols, [[Lao alphabet\|Lao]] letters for [[Pali]], Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, and 61 [[emoji]]<ref>{{Cite web\|url=http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html\|title=Announcing The Unicode Standard, Version 12.0\|website=blog.unicode.org\|access-date=2019-03-05}}</ref>', 299 => '\| 12.1', 300 => '\| May 2019', 301 => '\| {{ISBN\|978-1-936213-25-2}}', 302 => '\| 150', 303 => '\| 137,929<br />(1 added)', 304 => '\| Adds a single character at U+32FF for the square ligature form of the name of the [[Reiwa\|Reiwa era]].<ref>{{Cite web\|url=http://blog.unicode.org/2019/05/unicode-12-1-en.html\|title=Unicode Version 12.1 released in support of the Reiwa Era\|website=blog.unicode.org\|access-date=2019-05-07}}</ref>', 305 => '\| [http://www.unicode.org/versions/Unicode13.0.0/ 13.0]', 306 => '\| March 2020', 307 => '\| {{ISBN\|978-1-936213-26-9}}', 308 => '\| ISO/IEC 10646:2020<ref>{{Cite web \| title=The Unicode Standard, Version 13.0– Core Specification Appendix C \| url=https://www.unicode.org/versions/Unicode13.0.0/appC.pdf \| publisher=Unicode Consortium \| accessdate=2020-03-11 }}</ref>', 309 => '\| 154', 310 => '\| 143,859<br />(5,930 added)', 311 => '\| [[Khwarezmian_language#Writing_system\|Chorasmian]], [[Dhives akuru\|Dives Akuru]], [[Khitan small script]], [[Kurdish alphabets#Yezidi\|Yezidi]], 4,969 CJK unified ideographs added (including 4,939 in [[CJK Unified Ideographs Extension G\|Ext. G]]), Arabic script additions used to write [[Hausa language\|Hausa]], [[Wolof language\|Wolof]], and other languages in Africa and other additions used to write [[Hindko]] and [[Punjabi language\|Punjabi]] in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems from the 1970s and 1980s, and 55 emoji<ref>{{Cite web\|url=http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html\|title=Announcing The Unicode Standard, Version 13.0\|website=blog.unicode.org\|access-date=2020-03-11}}</ref>', 312 => '{{Reflist\|group=tablenote}}', 313 => '==<span id="Upluslink"></span><span id="codespace"></span> Architecture and terminology==', 314 => '{{See also\|Universal Character Set characters}}<!-- Template:U+ links to this paragraph -->', 315 => 'The Unicode Standard defines a ''codespace''<ref name="Glossary">{{cite web\|title = Glossary of Unicode Terms\|url=https://unicode.org/glossary/\|accessdate=2010-03-16}}</ref> of numerical values ranging from 0 through 10FFFF<sub>[[hexadecimal\|16]]</sub>,<ref>{{cite book\|title=The Unicode Standard, Version 13.0 \|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212\|year=2019\|page=19\|chapter=3.4 Characters and Encoding}}</ref> called ''[[code point\|code points]]''<ref name=":0">{{Cite book\|url=http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564\|title=The Unicode Standard Version 12.0 – Core Specification\|last=\|first=\|publisher=\|year=2019\|isbn=\|location=\|page=29\|chapter=2.4 Code Points and Characters}}</ref> and denoted as U+0000 through U+10FFFF ("U+" plus the code point value in [[hexadecimal]], prepended with [[leading zero\|leading zeros]] as necessary to result in a minimum of four digits, ''e. g.'', U+00F7 for the division sign, ÷, versus U+13254 for the [[Egyptian hieroglyph]] designating a [[List of hieroglyphs#O\|reed shelter]] or a [[c:Category:Winding wall (h hieroglyph)\|winding wall]] {{nowrap\|( [[File:Hiero O4.png\|text-bottom\|15px]] )}}<ref>{{Cite web\|url=https://www.unicode.org/versions/Unicode13.0.0/appA.pdf\|date=March 2020\|title=Appendix A: Notational Conventions\|publisher=Unicode Consortium\|work=The Unicode Standard}} In conformity with the bullet point relating to Unicode in [[MOS:ALLCAPS]], the formal Unicode names are not used in this paragraph.</ref>), respectively. Out of these 2<sup>16</sup> + 2<sup>20</sup> defined code points, the code points from U+D800 through U+DFFF, which are used to encode surrogate pairs in [[UTF-16]], are reserved by the Standard and may not be used to encode valid characters, resulting in a net total of 2<sup>16</sup> − 2<sup>11</sup> + 2<sup>20</sup> = 1,112,064 possible code points corresponding to valid Unicode characters. Not all of these code points necessarily correspond to visible characters; several, for example, are assigned to control codes such as [[carriage return]].', 316 => '===Code point planes and blocks===', 317 => '{{Main\|Plane (Unicode)}}', 318 => 'The Unicode codespace is divided into seventeen ''planes'', numbered 0 to 16:', 319 => '', 320 => '{{Planes (Unicode)}}', 321 => '', 322 => 'All code points in the BMP are accessed as a single code unit in [[UTF-16]] encoding and can be encoded in one, two or three bytes in [[UTF-8]]. Code points in Planes 1 through 16 (''supplementary planes'') are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.', 323 => 'Within each plane, characters are allocated within named ''[[Block (Unicode)\|blocks]]'' of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks.', 324 => '===General Category property===', 325 => 'Each code point has a single [[Character property (Unicode)#General Category\|General Category]] property. The major categories are denoted: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. In most cases other properties must be used to sufficiently specify the characteristics of a code point. The possible General Categories are:', 326 => '{{General Category (Unicode)}}', 327 => 'Code points in the range U+D800–U+DBFF (1,024 code points) are known as high-'''surrogate''' code points, and code points in the range U+DC00–U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point form a surrogate pair in [[UTF-16]] to represent code points greater than U+FFFF. These code points otherwise cannot be used (this rule is ignored often in practice especially when not using UTF-16).', 328 => 'A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six of these '''noncharacters''': U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.<ref name="stability-policy">{{cite web', 329 => '\| title = Unicode Character Encoding Stability Policy', 330 => '\| url = https://unicode.org/policies/stability_policy.html', 331 => '\| accessdate = 2010-03-16}}', 332 => '</ref> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the [[byte order mark]] assumes that U+FFFE will never be the first code point in a text.', 333 => 'Excluding surrogates and noncharacters leaves 1,111,998 code points available for use.', 334 => ''''Private-use''' code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard<ref>{{cite web', 335 => '\| title = Properties', 336 => '\| url = https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463', 337 => '\| accessdate = 2020-03-15 }}', 338 => '</ref> so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private-use areas in the Unicode codespace:', 339 => '* Private Use Area: U+E000–U+F8FF (6,400 characters)', 340 => '* Supplementary Private Use Area-A: U+F0000–U+FFFFD (65,534 characters)', 341 => '* Supplementary Private Use Area-B: U+100000–U+10FFFD (65,534 characters).', 342 => 'Graphic characters are characters defined by Unicode to have particular semantics, and either have a visible [[glyph]] shape or represent a visible space. As of Unicode 13.0 there are 143,696 graphic characters.', 343 => ''''Format''' characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. For example, {{unichar\|200C\|Zero-width non-joiner\|nlink=}} and {{unichar\|200D\|Zero-width joiner\|nlink=}} may be used to change the default shaping behavior of adjacent characters (e.g., to inhibit ligatures or request ligature formation). There are 163 format characters in Unicode 13.0.', 344 => 'Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as '''control''' codes, and correspond to the [[C0 and C1 control codes]] defined in [[ISO/IEC 6429]]. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated ([[Mojibake]]) legacy [[Windows-1252]] characters used by some English and Western European texts with Windows technologies.', 345 => 'Graphic characters, format characters, control code characters, and private use characters are known collectively as ''assigned characters''. '''Reserved''' code points are those code points which are available for use, but are not yet assigned. As of Unicode 13.0 there are 830,606 reserved code points.', 346 => '===Abstract characters===', 347 => 'The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of ''abstract characters'' that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.<ref>{{cite web', 348 => '\| title = Unicode Character Encoding Model', 349 => '\| url = https://unicode.org/reports/tr17/', 350 => '\| accessdate = 2010-03-16}}', 351 => '</ref> However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an [[ogonek]], a [[dot above]], and an [[acute accent]], which is required in [[Lithuanian language\|Lithuanian]], is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.<ref>{{cite web', 352 => '\| title = Unicode Named Sequences', 353 => '\| url = https://unicode.org/Public/UNIDATA/NamedSequences.txt', 354 => '\| accessdate = 2010-03-16}}', 355 => '</ref>', 356 => 'All graphic, format, and private use characters have a unique and immutable name by which they may be identified. This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy.<ref name="stability-policy" /> In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, {{unichar\|A015\|YI SYLLABLE WU}} has the formal alias {{sc2\|YI SYLLABLE ITERATION MARK}}, and {{unichar\|FE18\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''KC'''ET\|note=[[sic]]}} has the formal alias {{sc2\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''CK'''ET}}.<ref>{{cite web', 357 => '\| title = Unicode Name Aliases', 358 => '\| url = https://unicode.org/Public/UNIDATA/NameAliases.txt', 359 => '\| accessdate = 2010-03-16}}</ref>', 360 => '===Ready-made versus composite characters===', 361 => 'Unicode includes a mechanism for modifying characters that greatly extends the supported glyph repertoire. This covers the use of [[combining diacritical mark]]s that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains [[precomposed character\|precomposed]] versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, ''é'' can be represented in Unicode as [[#Upluslink\|U+]]0065 ({{sc2\|LATIN SMALL LETTER E}}) followed by U+0301 ({{sc2\|COMBINING ACUTE ACCENT}}), but it can also be represented as the precomposed character U+00E9 ({{sc2\|LATIN SMALL LETTER E WITH ACUTE}}). Thus, in many cases, users have multiple ways of encoding the same character. To deal with this, Unicode provides the mechanism of [[canonical equivalence]].', 362 => 'An example of this arises with [[Hangul]], the Korean alphabet. Unicode provides a mechanism for composing Hangul syllables with their individual subcomponents, known as [[Hangul Jamo]]. However, it also provides 11,172 combinations of precomposed syllables made from the most common jamo.', 363 => 'The [[CJK]] characters currently have codes only for their precomposed form. Still, most of those characters comprise simpler elements (called [[Radical_(Chinese_characters)\|radicals]]), so in principle Unicode could have decomposed them as it did with Hangul. This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable character (which might do away with some of the problems caused by [[Han unification]]). A similar idea is used by some [[input method]]s, such as [[Cangjie method\|Cangjie]] and [[Wubi method\|Wubi]]. However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does.', 364 => 'A set of [[Radical (Chinese character)\|radicals]] was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 12.2 of Unicode 5.2) warns against using [[Ideographic Description Sequences\|ideographic description sequences]] as an alternate representation for previously encoded characters:', 365 => '{{quote\|This process is different from a formal ''encoding'' of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase "an 'e' with an acute accent on it" than to the character sequence <U+0065, U+0301>.}}', 366 => '===Ligatures===', 367 => 'Many scripts, including [[Arabic script\|Arabic]] and [[Devanagari\|Devanāgarī]], have special orthographic rules that require certain combinations of letterforms to be combined into special [[ligature (typography)\|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard), which became the [[proof of concept]] for [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)\|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography\|AAT]] (by Apple).', 368 => 'Instructions are also embedded in fonts to tell the [[operating system]] how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally this approach is only effective in monospaced fonts, but may be used as a fallback rendering method when more complex methods fail.', 369 => '===Standardized subsets===', 370 => 'Several subsets of Unicode are standardized: Microsoft Windows since [[Windows NT 4.0]] supports [[WGL-4]] with 656 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets:<ref>[https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf CWA 13873:2000 – Multilingual European Subsets in ISO/IEC 10646-1] [[European Committee for Standardization\|CEN]] Workshop Agreement 13873</ref>', 371 => 'MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters)<ref>[https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html Multilingual European Character Set 2 (MES-2) Rationale], [[Markus Kuhn (computer scientist)\|Markus Kuhn]], 1998</ref> and MES-3A & MES-3B (two larger subsets, not shown here). Note that MES-2 includes every character in MES-1 and WGL-4.', 372 => '{\| class="wikitable"', 373 => '\|+ {{nobold\|'''WGL-4''', ''MES-1'' and MES-2}}', 374 => '! Row !! Cells !! Range(s)', 375 => '!rowspan="2"\| 00', 376 => '\| '''''20–7E'''''', 377 => '\| [[Basic Latin (Unicode block)\|Basic Latin]] (00–7F)', 378 => '\| '''''A0–FF'''''', 379 => '\| [[Latin-1 Supplement (Unicode block)\|Latin-1 Supplement]] (80–FF)', 380 => '!rowspan="2"\| 01', 381 => '\| '''''00–13,'' 14–15, ''16–2B,'' 2C–2D, ''2E–4D,'' 4E–4F, ''50–7E,'' 7F'''', 382 => '\| [[Latin Extended-A]] (00–7F)', 383 => '\| 8F, '''92,''' B7, DE-EF, '''FA–FF'''', 384 => '\| [[Latin Extended-B]] (80–FF <span title="U+024F">...</span>)', 385 => '!rowspan="3"\| 02', 386 => '\| 18–1B, 1E–1F', 387 => '\| Latin Extended-B (<span title="U+00180">...</span> 00–4F)', 388 => '\| 59, 7C, 92', 389 => '\| [[IPA Extensions]] (50–AF)', 390 => '\| BB–BD, '''C6, ''C7,'' C9,''' D6, '''''D8–DB,'' DC, ''DD,''''' DF, EE', 391 => '\| [[Spacing Modifier Letters]] (B0–FF)', 392 => '! 03', 393 => '\| 74–75, 7A, 7E, '''84–8A, 8C, 8E–A1, A3–CE,''' D7, DA–E1', 394 => '\| [[Greek and Coptic\|Greek]] (70–FF)', 395 => '! 04', 396 => '\| '''00–5F, 90–91,''' 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9', 397 => '\| [[Cyrillic (Unicode block)\|Cyrillic]] (00–FF)', 398 => '! 1E', 399 => '\| 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, '''80–85,''' 9B, '''F2–F3'''', 400 => '\| [[Latin Extended Additional]] (00–FF)', 401 => '! 1F', 402 => '\| 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE', 403 => '\| [[Greek Extended]] (00–FF)', 404 => '!rowspan="3"\| 20', 405 => '\| '''13–14, ''15,'' 17, ''18–19,'' 1A–1B, ''1C–1D,'' 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,''' 4A', 406 => '\| [[General Punctuation]] (00–6F)', 407 => '\| '''7F''', 82', 408 => '\| [[Superscripts and Subscripts]] (70–9F)', 409 => '\| '''A3–A4, A7, ''AC,''''' AF', 410 => '\| [[Currency Symbols (Unicode block)\|Currency Symbols]] (A0–CF)', 411 => '!rowspan="3"\| 21', 412 => '\| '''05, 13, 16, ''22, 26,'' 2E'''', 413 => '\| [[Letterlike Symbols]] (00–4F)', 414 => '\| '''''5B–5E'''''', 415 => '\| [[Number Forms]] (50–8F)', 416 => '\| '''''90–93,'' 94–95, A8'''', 417 => '\| [[Arrows (Unicode block)\|Arrows]] (90–FF)', 418 => '! 22', 419 => '\| 00, '''02,''' 03, '''06,''' 08–09, '''0F, 11–12, 15, 19–1A, 1E–1F,''' 27–28, '''29,''' 2A, '''2B, 48,''' 59, '''60–61, 64–65,''' 82–83, 95, 97', 420 => '\| [[Mathematical Operators]] (00–FF)', 421 => '! 23', 422 => '\| '''02, 0A, 20–21,''' 29–2A', 423 => '\| [[Miscellaneous Technical]] (00–FF)', 424 => '!rowspan="3"\| 25', 425 => '\| '''00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C'''', 426 => '\| [[Box Drawing]] (00–7F)', 427 => '\| '''80, 84, 88, 8C, 90–93'''', 428 => '\| [[Block Elements]] (80–9F)', 429 => '\| '''A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6'''', 430 => '\| [[Geometric Shapes]] (A0–FF)', 431 => '! 26', 432 => '\| '''3A–3C, 40, 42, 60, 63, 65–66, ''6A,'' 6B'''', 433 => '\| [[Miscellaneous Symbols]] (00–FF)', 434 => '! F0', 435 => '\| (01–02)<!--in WGL-4, but not in MES-2-->', 436 => '\| [[Private Use Area (Unicode block)\|Private Use Area]] (00–FF ...)', 437 => '! FB', 438 => '\| '''01–02'''', 439 => '\| [[Alphabetic Presentation Forms]] (00–4F)', 440 => '! FF', 441 => '\| FD', 442 => '\| [[Specials (Unicode block)\|Specials]]', 443 => 'Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "[[replacement character]]" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's [[Last Resort font]] will display a substitute glyph indicating the Unicode range of the character, and the [[SIL International]]'s [[Unicode fallback font\|Unicode Fallback]] font will display a box showing the hexadecimal scalar value of the character.', 444 => '', 445 => '==={{anchor\|UTF\|UCS}}Mapping and encodings===', 446 => '', 447 => 'Several mechanisms have been specified for storing a series of code points as a series of bytes.', 448 => '', 449 => '<!-- [[Unicode Transformation Format]] redirects here -->', 450 => 'Unicode defines two mapping methods: the ''Unicode Transformation Format'' (UTF) encodings, and the ''[[Universal Coded Character Set]]'' (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode ''code points'' to sequences of values in some fixed-size range, termed ''code units''. All UTF encodings map code points to a unique sequence of bytes.<ref>{{cite web\|title=UTF-8, UTF-16, UTF-32 & BOM\|url=https://unicode.org/faq/utf_bom.html\|website=Unicode.org FAQ\|accessdate=12 December 2016}}</ref> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.', 451 => '', 452 => 'UTF encodings include:', 453 => '', 454 => '* [[UTF-1]], a retired predecessor of UTF-8, maximizes compatibility with [[ISO/IEC 2022\|ISO 2022]], no longer part of ''The Unicode Standard''', 455 => '* [[UTF-7]], a 7-bit encoding sometimes used in e-mail, often considered obsolete (not part of ''The Unicode Standard'', but only documented as an informational [[Request for Comments\|RFC]], i.e., not on the Internet Standards Track)', 456 => '* [[UTF-8]], uses one to four bytes for each code point, maximizes compatibility with [[ASCII]]', 457 => '* [[UTF-EBCDIC]], similar to UTF-8 but designed for compatibility with [[EBCDIC]] (not part of ''The Unicode Standard'')', 458 => '* [[UTF-16]], uses one or two 16-bit code units per code point, cannot encode surrogates', 459 => '* [[UTF-32]], uses one 32-bit code unit per code point', 460 => 'UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the ''de facto'' standard encoding for interchange of Unicode text. It is used by [[FreeBSD]] and most recent [[Linux distributions]] as a direct replacement for legacy encodings in general text handling.', 461 => 'The UCS-2 and UTF-16 encodings specify the Unicode [[Byte Order Mark]] (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or [[endianness\|byte endianness]] detection). The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of [[ligature (typography)\|ligatures]]).', 462 => 'The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".<ref>{{Cite book \| title=The Unicode Standard, Version 6.2 \| publisher=The Unicode Consortium \| year=2013 \| isbn=978-1-936213-08-5 \| page=561 }}</ref> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit [[code page]]s. However {{IETF RFC\|3629}}, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM.', 463 => 'In UTF-32 and UCS-4, one [[32-bit]] code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the [[GNU Compiler Collection\|gcc]] compilers to generate software uses it as the standard "[[wide character]]" encoding. Some programming languages, such as [[Seed7]], use UTF-32 as internal representation for strings and characters. Recent versions of the [[Python (programming language)\|Python]] programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in [[high-level programming language\|high-level]] coded software.', 464 => '[[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]] (DNS). The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all scripts that are supported by Unicode. Earlier and now historical proposals include [[UTF-5]] and [[UTF-6]].', 465 => '[[GB 18030\|GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the [[People's Republic of China]] (PRC). [[Binary Ordered Compression for Unicode\|BOCU-1]] and [[Standard Compression Scheme for Unicode\|SCSU]] are Unicode compression schemes. The [[April Fools' Day RFC]] of 2005 specified two [[parody]] UTF encodings, [[UTF-9]] and [[UTF-18]].', 466 => '==Adoption==', 467 => '===Operating systems===', 468 => 'Unicode has become the dominant scheme for internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use [[UCS-2]] (the fixed-width two-byte precursor to UTF-16) and later moved to [[UTF-16]] (the variable-width current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000]], [[Windows XP]], [[Windows Vista]], [[Windows 7]], [[Windows 8]] and [[Windows 10]]), which uses UTF-16 as the sole internal character encoding. The [[Java virtual machine\|Java]] and [[.NET Framework\|.NET]] bytecode environments, [[macOS]], and [[KDE]] also use it for internal representation. Partial support for Unicode can be installed on [[Windows 9x]] through the [[Microsoft Layer for Unicode]].', 469 => '[[UTF-8]] (originally developed for [[Plan 9 from Bell Labs\|Plan 9]])<ref>{{cite web', 470 => ' \| url = https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt', 471 => ' \| title = UTF-8 history', 472 => ' \| first = Rob \| last = Pike \| authorlink = Rob Pike', 473 => ' \| date = 2003-04-30', 474 => '}}</ref> has become the main storage encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets. UTF-8 is also the most common Unicode encoding used in [[HTML]] documents on the [[World Wide Web]].', 475 => 'Multilingual text-rendering engines which use Unicode include [[Uniscribe]] and [[DirectWrite]] for Microsoft Windows, [[ATSUI]] and [[Core Text]] for macOS, and [[Pango]] for [[GTK+]] and the [[GNOME]] desktop.', 476 => '===Input methods===', 477 => '{{Main\|Unicode input}}', 478 => 'Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire.', 479 => '[[ISO/IEC 14755]],<ref>{{cite web\|url=https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf \|title=ISO/IEC JTC1/SC 18/WG 9 N \|date= \|accessdate=2012-06-04}}</ref> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the ''Basic method'', where a ''beginning sequence'' is followed by the hexadecimal representation of the code point and the ''ending sequence''. There is also a ''screen-selection entry method'' specified, where the characters are listed in a table in a screen, such as with a character map program.', 480 => 'Online tools for finding the code point for a known character include Unicode Lookup<ref>{{cite web\|url=https://unicodelookup.com/\|title=Unicode Lookup\|last=Hedley\|first=Jonathan\|date=2009}}</ref> by Jonathan Hedley and Shapecatcher<ref>{{cite web\|url=http://shapecatcher.com/\|title=Unicode Character Recognition\|last=Milde\|first=Benjamin\|date=2011}}</ref> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on [[Shape context]], one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned.', 481 => '===Email===', 482 => '{{Main\|Unicode and email}}', 483 => '[[MIME]] defines two different mechanisms for encoding non-ASCII characters in [[email]], depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the [[UTF-8]] character set and the [[Base64]] or the [[Quoted-printable]] transfer encoding are recommended, depending on whether much of the message consists of [[ASCII]] characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software.', 484 => 'The adoption of Unicode in email has been very slow. Some East Asian text is still encoded in encodings such as [[ISO-2022]], and some devices, such as mobile phones, still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as [[Yahoo]], [[Google]] ([[Gmail]]), and [[Microsoft]] ([[Outlook.com]]) support it.', 485 => '===Web===', 486 => '{{Main\|Unicode and HTML}}', 487 => 'All [[W3C]] recommendations have used Unicode as their ''document character set'' since HTML 4.0. [[Web browser]]s have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from [[typeface\|font]] related issues; e.g. v 6 and older of Microsoft [[Internet Explorer]] did not render many code points unless explicitly told to use a font that contains them.<ref>{{cite web\|first=Alan \|last=Wood \|url=http://www.alanwood.net/unicode/explorer.html#ie5 \|title=Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support \|publisher=Alan Wood \|date= \|accessdate=2012-06-04}}</ref>', 488 => 'Although syntax rules may affect the order in which characters are allowed to appear, [[XML]] (including [[XHTML]]) documents, by definition,<ref>{{cite web\|title=Extensible Markup Language (XML) 1.1 (Second Edition)\|url=https://www.w3.org/TR/xml11\|accessdate=2013-11-01}}</ref> comprise characters from most of the Unicode code points, with the exception of:', 489 => '* most of the [[C0 and C1 control codes\|C0 control codes]]', 490 => '* the permanently unassigned code points D800–DFFF', 491 => '* FFFE or FFFF', 492 => 'HTML characters manifest either directly as [[byte]]s according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&#916;</code>, <code>&#1049;</code>, <code>&#1511;</code>, <code>&#1605;</code>, <code>&#3671;</code>, <code>&#12354;</code>, <code>&#21494;</code>, <code>&#33865;</code>, and <code>&#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말.', 493 => 'When specifying [[Uniform Resource Identifier\|URIs]], for example as [[URL]]s in [[HTTP]] requests, non-ASCII characters must be [[percent encoding\|percent-encoded]].', 494 => '===Fonts===', 495 => '{{Main\|Unicode font}}', 496 => 'Unicode is not in principle concerned with fonts ''per se'', seeing them as implementation choices.<ref>{{cite journal \|url = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf \| title = The design of a Unicode font \| journal = Electronic Publishing \| volume = VOL. 6(3), 289–305 \| date = September 1993 \| page = 292 \|last1 = Bigelow \| first1=Charles \| last2 = Holmes \| first2 = Kris}}</ref> Any given character may have many [[allograph]]s, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode standard.<ref>{{cite web \| url= https://www.unicode.org/faq/font_keyboard.html \| title = Fonts and keyboards \| publisher = Unicode Consortium \| date = 28 June 2017 \| accessdate= 13 October 2019}}</ref> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire.', 497 => 'Free and retail [[font]]s based on Unicode are widely available, since [[TrueType]] and [[OpenType]] support Unicode. These font formats map Unicode code points to glyphs, but TrueType font is restricted to 65,535 glyphs.', 498 => '[[List of typefaces\|Thousands of fonts]] exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of Unicode fonts\|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., [[font substitution]]. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces.', 499 => '===Newlines===', 500 => 'Unicode partially addresses the [[newline]] problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of [[Newline#Unicode\|characters]] that conforming applications should recognize as line terminators.', 501 => 'In terms of the newline, Unicode introduced {{unichar\|2028\|LINE SEPARATOR}} and {{unichar\|2029\|PARAGRAPH SEPARATOR}}. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding.', 502 => '==Issues==', 503 => '===Philosophical and completeness criticisms===', 504 => '[[Han unification]] (the identification of forms in the [[East Asian language]]s which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the [[Ideographic Research Group]] (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification.<ref>[http://tronweb.super-nova.co.jp/characcodehist.html A Brief History of Character Codes], Steven J. Searle, originally written [https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html 1999], last updated 2004</ref>', 505 => 'Unicode has been criticized for failing to separately encode older and alternative forms of [[kanji]] which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). Unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged.<ref name="dw2001">[https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html The secret life of Unicode: A peek at Unicode's soft underbelly], Suzanne Topping, 1 May 2001 ''(Internet Archive)''</ref>{{clarify\|date=April 2010\|reason="and, contains" and meaning of statement}} There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. An example of one is [[TRON (encoding)\|TRON]] (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it).', 506 => 'Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 92,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam.', 507 => 'Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations, in the form of [[variation Selectors\|Unicode variation sequences]]. For example, the Advanced Typographic tables of [[OpenType]] permit one of a number of alternative glyph representations to be selected when performing the character to glyph mapping process. In this case, information can be provided within plain text to designate which alternate character form to select.', 508 => '[[File:Cyrillic cursive.svg\|thumb\|right\|Various [[Cyrillic]] characters shown with and without italics]]', 509 => 'If the difference in the appropriate glyphs for two characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison between Russian (labeled standard) and Serbian characters at right, meaning that the differences are displayed through smart font technology or manually changing fonts.', 510 => '===Mapping to legacy character sets===', 511 => 'Unicode was designed to provide code-point-by-code-point [[round-trip format conversion]] to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. That has meant that inconsistent legacy architectures, such as [[combining character\|combining diacritics]] and [[precomposed character]]s, both exist in Unicode, giving more than one method of representing some text. This is most pronounced in the three different encoding forms for Korean [[Hangul]]. Since version 3.0, any precomposed characters that can be represented by a combining sequence of already existing characters can no longer be added to the standard in order to preserve interoperability between software using different versions of Unicode.', 512 => '[[Injective]] mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as [[Shift-JIS]] or [[EUC-JP]] and Unicode led to [[round-trip format conversion]] mismatches, particularly the mapping of the character JIS X 0208 '～' (1-33, WAVE DASH), heavily used in legacy database data, to either {{unichar\|FF5E\|FULLWIDTH TILDE}} (in [[Microsoft Windows]]) or {{unichar\|301C\|WAVE DASH}} (other vendors).<ref>', 513 => '[http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc AFII contribution about WAVE DASH], {{Cite web\|url=http://www.ingrid.org/java/i18n/unicode.html\|archiveurl=https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html\|title=An Unicode vendor-specific character table for japanese\|date=2011-04-22\|archive-date=2011-04-22\|website=web.archive.org<!--\|access-date=2019-05-20-->}}</ref>', 514 => 'Some Japanese computer programmers objected to Unicode because it requires them to separate the use of {{unichar\|005C\|REVERSE SOLIDUS\|note=backslash}} and {{unichar\|00A5\|YEN SIGN}}, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage.<ref>[https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem ''ISO 646-* Problem''], Section 4.4.3.5 of ''Introduction to I18n'', Tomohiro KUBOTA, 2001</ref> (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in [[ISO 8859-1]], from long before Unicode.', 515 => '===Indic scripts===', 516 => '[[Indic script]]s such as [[Tamil script\|Tamil]] and [[Devanagari]] are each allocated only 128 code points, matching the [[ISCII]] standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (aka conjuncts) out of components. Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only.<ref>{{cite web', 517 => '\| title = Arabic Presentation Forms-A', 518 => '\| url = https://www.unicode.org/charts/PDF/UFB50.pdf', 519 => '\| accessdate = 2010-03-20}}', 520 => '</ref><ref>{{cite web', 521 => '\| title = Arabic Presentation Forms-B', 522 => '\| url = https://www.unicode.org/charts/PDF/UFE70.pdf', 523 => '\| accessdate = 2010-03-20}}</ref><ref>{{cite web', 524 => '\| title = Alphabetic Presentation Forms', 525 => '\| url = https://www.unicode.org/charts/PDF/UFB00.pdf', 526 => '\| accessdate = 2010-03-20}}</ref> Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for the [[Tibetan script]] in 2003 when the [[Standardization Administration of China]] proposed encoding 956 precomposed Tibetan syllables,<ref>{{Cite web \| author=China \| title=Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP \| url=https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf \| date=2 December 2002 }}</ref> but these were rejected for encoding by the relevant ISO committee ([[ISO/IEC JTC 1/SC 2]]).<ref>{{Cite web \| author= V. S. Umamaheswaran \| title=Resolutions of WG 2 meeting 44 \| url=https://www.unicode.org/L2/L2003/03390r-n2654.pdf \| at=Resolution M44.20 \| date=7 November 2003 }}</ref>', 527 => '[[Thai alphabet]] support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the [[TIS-620\|Thai Industrial Standard 620]], which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation.<ref name="dw2001" /> Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word {{wiktth\|แสดง}} {{IPA-th\|sa dɛːŋ\|}} "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส.', 528 => '===Combining characters===', 529 => '{{Main\|Combining character}}', 530 => '{{See also\|Unicode normalization#Normalization}}', 531 => 'Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[Macron (diacritic)\|macron]] and [[acute accent]], but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Similarly, [[dot (diacritic)\|underdots]], as needed in the [[romanization]] of [[Indo-Aryan languages\|Indic]], will often be placed incorrectly.{{Citation needed\|date=July 2011}}. Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded the problem can often be solved by using a specialist Unicode font such as [[Charis SIL]] that uses [[Graphite (SIL)\|Graphite]], [[OpenType]], or [[Apple Advanced Typography\|AAT]] technologies for advanced rendering features.', 532 => '===Anomalies===', 533 => '{{main\|Unicode alias names and abbreviations}}', 534 => 'The Unicode standard has imposed rules intended to guarantee stability.<ref>[https://www.unicode.org/policies/stability_policy.html Unicode stability policy]</ref> Depending on the strictness of a rule, a change can be prohibited or allowed. For example, a "name" given to a code point cannot and will not change. But a "script" property is more flexible, by Unicode's own rules. In version 2.0, Unicode changed many code point "names" from version 1. At the same moment, Unicode stated that from then on, an assigned name to a code point will never change anymore. This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling {{sc2\|{{typo\|BRAKCET}}}} for {{sc2\|BRACKET}} in a character name). In 2006 a list of anomalies in character names was first published, and, as of April 2017, there were 94 characters with identified issues,<ref name="tn17">{{cite web \|url=https://unicode.org/notes/tn27/ \|title=Unicode Technical Note #27: Known Anomalies in Unicode Character Names \|date=10 April 2017 \|website=unicode.org}}</ref> for example:', 535 => '* {{unichar\|2118\|script capital p\|nlink=Weierstrass p}}: This is a small letter. The capital is {{unichar\|1D4AB\|MATHEMATICAL SCRIPT CAPITAL P}}<ref>[https://www.unicode.org/charts/PDF/U2100.pdf Unicode chart: "actually this has the form of a lowercase calligraphic p, despite its name"]</ref>', 536 => '* {{unichar\|034F\|COMBINING GRAPHEME JOINER\|nlink=Combining grapheme joiner}}: Does not join graphemes.<ref name="tn17" />', 537 => '* {{unichar\|A015\|YI SYLLABLE WU\|nlink=Yi language}}: This is not a Yi syllable, but a Yi iteration mark.', 538 => '* {{unichar\|FE18\|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR {{typo\|BRAKCET}}}}: ''bracket'' is spelled incorrectly.<ref>[https://www.unicode.org/charts/PDF/UFE10.pdf "Misspelling of BRACKET in character name is a known defect"]</ref>', 539 => 'Spelling errors are resolved by using [[Unicode alias names and abbreviations]].', 540 => '==See also==', 541 => '* [[Comparison of Unicode encodings]]', 542 => '* [[Cultural, political, and religious symbols in Unicode]]', 543 => '* [[International Components for Unicode]] (ICU), now as ICU-<abbr title="technical committee">TC</abbr> a part of Unicode', 544 => '* [[List of binary codes]]', 545 => '* [[List of Unicode characters]]', 546 => '* [[List of XML and HTML character entity references]]', 547 => '* [[Open-source Unicode typefaces]]', 548 => '* [[Standards related to Unicode]]', 549 => '* [[Unicode symbols]]', 550 => '* [[Universal Coded Character Set]]', 551 => '* [[Lotus Multi-Byte Character Set]] (LMBCS), a parallel development with similar intentions', 552 => '==Further reading==', 553 => '{{refbegin}}', 554 => '* ''The Unicode Standard, Version 3.0'', The Unicode Consortium, Addison-Wesley Longman, Inc., April 2000. {{ISBN\|0-201-61633-5}}', 555 => '* ''The Unicode Standard, Version 4.0'', The Unicode Consortium, Addison-Wesley Professional, 27 August 2003. {{ISBN\|0-321-18578-1}}', 556 => '* ''The Unicode Standard, Version 5.0, Fifth Edition'', The [[Unicode Consortium]], Addison-Wesley Professional, 27 October 2006. {{ISBN\|0-321-48091-0}}', 557 => '* Julie D. Allen. ''The Unicode Standard, Version 6.0'', The [[Unicode Consortium]], Mountain View, 2011, {{ISBN\|9781936213016}}, ([https://www.unicode.org/versions/Unicode6.0.0/]).', 558 => '* ''The Complete Manual of Typography'', James Felici, Adobe Press; 1st edition, 2002. {{ISBN\|0-321-12730-7}}', 559 => '* ''Unicode: A Primer'', Tony Graham, M&T books, 2000. {{ISBN\|0-7645-4625-2}}.', 560 => '* ''Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard'', Richard Gillam, Addison-Wesley Professional; 1st edition, 2002. {{ISBN\|0-201-70052-2}}', 561 => '* ''Unicode Explained'', Jukka K. Korpela, O'Reilly; 1st edition, 2006. {{ISBN\|0-596-10121-X}}', 562 => '{{refend}}', 563 => '{{cite book \|author1=Yannis Haralambous \|author2=Martin Dürst \|editor1-last=Haralambous \|editor1-first=Yannis \|title=Proceedings of Graphemics in the 21st Century, Brest 2018 \|date=2019 \|publisher=Fluxus Editions \|location=Brest \|isbn=978-2-9570549-1-6 \|pages=167-183 \|url=http://www.fluxus-editions.fr/gla1-hara1.php \|ref=https://doi.org/10.36824/2018-graf-hara1 \|chapter=Unicode from a Linguistic Point of View}}', 564 => '==Notes==', 565 => '{{notelist\|group=note}}', 566 => '==References==', 567 => '{{reflist\|30em}}', 568 => '==External links==', 569 => '{{Sister project links\|n=no\|v=no\|q=no\|s=no\|voy=no\|m=Unicode\|mw=no\|species=no}}', 570 => ' {{official website\|name=Official website}} {{middot}} {{official website\|url=https://unicode.org/main.html\|name=Official technical site}}', 571 => '* {{DMOZ\|Computers/Software/Globalization/Character_Encoding/Unicode/}}', 572 => '* [http://www.alanwood.net/unicode/ Alan Wood's Unicode Resources]{{snd}} Contains lists of word processors with Unicode capability; fonts and characters are grouped by type; characters are presented in lists, not grids.', 573 => '* [https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont Unicode BMP Fallback Font] Displays the Unicode value of any character in a document, including in the Private Use Area, rather than the glyph itself.', 574 => '{{Unicode navigation\|state=uncollapsed}}', 575 => '{{Character encoding}}', 576 => '{{Authority control}}', 577 => '[[Category:Unicode\| ]]', 578 => '[[Category:Character encoding]]', 579 => '[[Category:Digital typography]]' ]
Удалённые в правке строки (`removed_lines`)	[ 0 => '[[Файл:New Unicode logo.svg\|x200px\|thumb\|right\|Логотип Unicode Consortium]]', 1 => ''''Юнико́д'''<ref name=autogenerated1>{{cite web\|url=http://www.unicode.org/standard/UnicodeTranscriptions.html\|title=Unicode Transcriptions\|publisher=\|date=\|accessdate=10 мая 2010\|lang=en\|archiveurl=https://web.archive.org/web/20060408204540/http://www.unicode.org/standard/UnicodeTranscriptions.html\|archivedate=2006-04-08\|deadlink=yes}}</ref> (чаще всего) или '''Унико́д'''<ref>[http://www.paratype.ru/help/term/terms.asp?code=361 Уникод в словаре Paratype]</ref> ({{lang-en\|Unicode}}) — стандарт [[Набор символов\|кодирования символов]], включающий в себя знаки почти всех письменных [[язык]]ов мира<ref name="unicode-techintro">{{cite web\|url=http://www.unicode.org/standard/principles.html\|title=The Unicode® Standard: A Technical Introduction\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100310120125/http://www.unicode.org/standard/principles.html\|archivedate=2010-03-10\|deadlink=yes}}</ref>. В настоящее время стандарт является преобладающим в [[Интернет\|Интернете]].', 2 => 'Стандарт предложен в [[1991 год]]у некоммерческой организацией «Консорциум Юникода» ({{lang-en\|Unicode Consortium, Unicode Inc.}})<ref>{{cite web\|url=http://www.unicode.org/history/publicationdates.html\|title=History of Unicode Release and Publication Dates\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100110085403/http://www.unicode.org/history/publicationdates.html\|archivedate=2010-01-10\|deadlink=yes}}</ref><ref>{{cite web\|url=http://www.unicode.org/consortium/consort.html\|title=The Unicode Consortium\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627085503/http://www.unicode.org/consortium/consort.html\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Применение этого стандарта позволяет закодировать очень большое число символов из разных систем письменности: в документах, закодированных по стандарту Юникод, могут соседствовать китайские [[иероглиф]]ы, математические символы, буквы [[греческий алфавит\|греческого алфавита]], [[латинский алфавит\|латиницы]] и [[кириллица\|кириллицы]], символы музыкальной нотной нотации, при этом становится ненужным переключение [[кодовая страница\|кодовых страниц]]<ref name="unicode-foreword">{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf\|title=Foreword\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627141434/http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. ', 3 => 'Стандарт состоит из двух основных частей: универсального набора символов ({{lang-en\|Universal character set, UCS}}) и семейства кодировок ({{lang-en\|Unicode transformation format, UTF}}). Универсальный набор символов перечисляет допустимые по стандарту Юникод символы и присваивает каждому символу код в виде неотрицательного целого числа, записываемого обычно в шестнадцатеричной форме с префиксом <code>U+</code>, например, <code>U+040F</code>. Семейство кодировок определяет способы преобразования кодов символов для передачи в потоке или в файле.', 4 => 'Коды в стандарте Юникод разделены на несколько областей. Область с кодами от U+0000 до U+007F содержит символы набора [[ASCII]], и коды этих символов совпадают с их кодами в ASCII. Далее расположены области символов других систем письменности, знаки пунктуации и технические символы. Часть кодов зарезервирована для использования в будущем<ref name='unicode-02'>{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf\|title=General Structure\|accessdate=2010-07-05\|archiveurl=https://web.archive.org/web/20100627093139/http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>. Под символы кириллицы выделены области знаков с кодами от U+0400 до U+052F, от U+2DE0 до U+2DFF, от U+A640 до U+A69F (см. [[Кириллица в Юникоде]])<ref>{{cite web\|url=http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf\|title=European Alphabetic Scripts\|accessdate=2010-07-04\|archiveurl=https://web.archive.org/web/20100627140856/http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf\|archivedate=2010-06-27\|deadlink=yes}}</ref>.', 5 => '== Предпосылки создания и развитие Юникода ==', 6 => '{{цитата\|Unicode — это уникальный код для любого символа, независимо от платформы, независимо от программы, независимо от языка.\|автор=Консорциум Юникода<ref>[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]</ref>}}', 7 => 'К концу 1980-х годов стандартом стали 8-битные кодировки, их существовало уже большое множество, и постоянно появлялись новые. Это объяснялось как расширением круга поддерживаемых языков, так и стремлением создавать кодировки, частично совместимые между собой (характерный пример — появление [[альтернативная кодировка\|альтернативной кодировки для русского языка]], обусловленное эксплуатацией западных программ, созданных для кодировки [[CP437]]). В результате появилось несколько проблем:', 8 => '# проблема неправильной раскодировки;', 9 => '# проблема ограниченности набора символов;', 10 => '# проблема преобразования одной кодировки в другую;', 11 => '# проблема дублирования шрифтов.', 12 => ''''Проблема неправильной раскодировки''' вызывала появление в документе символов иностранных языков, не предполагавшихся в документе, или появление не предполагавшихся [[псевдографика\|псевдографических]] символов, прозванных русскоязычными пользователями «кракозябрами». Проблема во многом была вызвана отсутствием стандартизированной формы указания кодировки для файла или потока. Проблему можно было решить либо последовательным внедрением стандарта указания кодировки, либо внедрением общей для всех языков кодировки.<ref name='unicode-foreword' />', 13 => ''''Проблема ограниченности набора символов'''<ref name='unicode-foreword' />. Проблему можно было решить либо переключением шрифтов внутри документа, либо внедрением «широкой» кодировки. Переключение шрифтов издавна практиковалось в [[текстовый процессор\|текстовых процессорах]], причём часто использовались [[нестандартные шрифты\|шрифты с нестандартной кодировкой]], т. н. «dingbat fonts». В итоге при попытке переноса документа в другую систему все нестандартные символы превращались в «кракозябры».', 14 => ''''Проблема преобразования одной кодировки в другую'''. Проблему можно было решить либо составлением таблиц перекодировки для каждой пары кодировок, либо использованием промежуточного преобразования в третью кодировку, включающую все символы всех кодировок<ref>{{cite web\|url=http://www.unicode.org/history/unicode88.pdf\|title=Unicode 88\|accessdate=2010-07-08\|archiveurl=https://web.archive.org/web/20170906035012/http://unicode.org/history/unicode88.pdf\|archivedate=2017-09-06\|deadlink=yes}}</ref>.', 15 => ''''Проблема дублирования шрифтов'''. Для каждой кодировки создавался свой шрифт, даже если наборы символов в кодировках совпадали частично или полностью. Проблему можно было решить путём создания «больших» шрифтов, из которых впоследствии выбирались бы нужные для данной кодировки символы. Однако это требовало создания единого реестра символов, чтобы определять, чему что соответствует.', 16 => 'Была признана необходимость создания единой «широкой» кодировки. Кодировки с переменной длиной символа, широко использующиеся в Восточной Азии, были признаны слишком сложными в использовании, поэтому было решено использовать символы фиксированной ширины. Использование 32-битных символов казалось слишком расточительным, поэтому было решено использовать 16-битные.', 17 => 'Первая версия Юникода представляла собой кодировку с фиксированным размером символа в 16 бит, то есть общее число кодов было 2<sup>16</sup> ({{formatnum:65536}}). С тех пор символы стали обозначать четырьмя шестнадцатеричными цифрами (например, <code>U+04F0</code>). При этом в Юникоде планировалось кодировать не все существующие символы, а только те, которые необходимы в повседневном обиходе. Редко используемые символы должны были размещаться в «области пользовательских символов» ({{lang\|en\|private use area}}), которая первоначально занимала коды <code>U+D800…U+F8FF</code>. Чтобы использовать Юникод также и в качестве промежуточного звена при преобразовании разных кодировок друг в друга, в него включили все символы, представленные во всех наиболее известных кодировках.', 18 => 'В дальнейшем, однако, было принято решение кодировать все символы и в связи с этим значительно расширить кодовую область. Одновременно с этим, коды символов стали рассматриваться не как 16-битные значения, а как абстрактные числа, которые в компьютере могут представляться множеством разных способов (см. [[#Способы представления\|способы представления]]).', 19 => 'Поскольку в ряде компьютерных систем (например, [[Windows NT]]<ref name="windows-nt">{{cite web\|url=http://support.microsoft.com/kb/99884\|title=Unicode and Microsoft Windows NT\|work=Microsoft Support\|lang=en\|archiveurl=https://web.archive.org/web/20090926092654/http://support.microsoft.com/kb/99884\|archivedate=2009-09-26\|accessdate=2009-11-12\|deadlink=yes}}</ref>) фиксированные 16-битные символы уже использовались в качестве кодировки по умолчанию, было решено все наиболее важные знаки кодировать только в пределах первых {{formatnum:65536}} позиций (так называемая {{lang-en\|basic multilingual plane, BMP}}). Остальное пространство используется для «дополнительных символов» ({{lang-en\|supplementary characters}}): систем письма вымерших языков или очень редко используемых [[китай]]ских иероглифов, математических и музыкальных символов.', 20 => 'Для совместимости со старыми 16-битными системами была изобретена система [[UTF-16]], где первые {{formatnum:65536}} позиций, за исключением позиций из интервала U+D800…U+DFFF, отображаются непосредственно как 16-битные числа, а остальные представляются в виде «суррогатных пар» (первый элемент пары из области U+D800…U+DBFF, второй элемент пары из области U+DC00…U+DFFF). Для суррогатных пар была использована часть кодового пространства (2048 позиций), отведённого «для частного использования».', 21 => 'Поскольку в UTF-16 можно отобразить только 2<sup>20</sup>+2<sup>16</sup>−2048 ({{formatnum:1112064}}) символов, то это число и было выбрано в качестве окончательной величины кодового пространства Юникода (диапазон кодов: 0x000000-0x10FFFF).', 22 => 'Хотя кодовая область Юникода была расширена за пределы 2<sup>16</sup> уже в версии 2.0, первые символы в «верхней» области были размещены только в версии 3.1.', 23 => 'Роль этой кодировки в веб-секторе постоянно растёт. На начало 2010 доля веб-сайтов, использующих Юникод, составила около 50 %<ref>{{cite web\|url=http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov\|title=Unicode используется почти на 50% веб-сайтов\|lang=ru\|archiveurl=https://web.archive.org/web/20100611042601/http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov\|archivedate=2010-06-11\|accessdate=2010-02-09\|deadlink=yes}}</ref>.', 24 => '== Версии Юникода ==', 25 => 'Работа по доработке стандарта продолжается. Новые версии выпускаются по мере изменения и пополнения таблиц символов. Параллельно выпускаются новые документы [[Международная организация по стандартизации\|ISO]]/IEC 10646.', 26 => 'Первый стандарт выпущен в 1991 году, последний на данный момент — в 2020. Версии стандарта 1.0—5.0 публиковались как книги и имеют [[ISBN]]<ref>[http://www.unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates]</ref><ref>[http://www.unicode.org/versions/enumeratedversions.html Enumerated Versions]</ref>.', 27 => 'Номер версии стандарта составлен из трёх цифр (например, 3.1.1). Третью цифру меняют при внесении в стандарт небольших изменений, не добавляющих новых символов (исключение — версия 1.0.1, в которой добавлены {{iw\|Унифицированные идеограммы ККЯ\|унифицированные идеограммы китайского, японского и корейского письма\|en\|CJK Unified Ideographs}})<ref>[http://www.unicode.org/versions/index.html About Versions]</ref>.', 28 => 'База данных символов Юникода ([http://www.unicode.org/ucd/ Unicode Character Database]) доступна для всех версий на официальном сайте как в простом текстовом, так и в XML-формате. Файлы распространяются под BSD-подобной [http://www.unicode.org/copyright.html лицензией].', 29 => '{{Временная линия Юникода}}', 30 => '\|+ Версии Юникода', 31 => '! Номер версии', 32 => '! Дата публикации', 33 => '! [[Международный стандартный книжный номер\|ISBN]] книги', 34 => '! Издание ISO/IEC 10646', 35 => '! Количество [[Письменность\|письменностей]]', 36 => '! Количество символов<ref group="A">'''Включая''' символы графические ({{lang-en\|graphic}}), управляющие ({{lang-en\|control}}) и символы форматирования ({{lang-en\|format}}); '''не включая''' [[Области для частного использования\|символы для частного использования]] ({{Lang-en\|private-use}}), несимвольные знаки ({{Lang-en\|noncharacters}}) и суррогаты ({{lang-en\|surrogate code points}}).</ref>', 37 => '! Изменения', 38 => '\| 1.0.0<ref>{{cite web\|title=Unicode® 1.0\|url=http://www.unicode.org/versions/Unicode1.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 39 => '\| Октябрь 1991', 40 => '\| ISBN 0-201-56788-1 (Vol.1)', 41 => '\| {{formatnum:24}}', 42 => '\| {{formatnum:7161}}', 43 => '\| Изначально Юникод содержал символы следующих письменностей: [[арабское письмо]], [[армянское письмо]], [[бенгальское письмо]], [[Чжуинь\|чжуиньское письмо]], [[кириллица]], [[деванагари]], [[грузинское письмо]], [[Греческий алфавит\|греческое и коптское письмо]], [[Гуджарати (письмо)\|гуджарати]], [[гурмукхи]], [[хангыль]], [[Еврейский алфавит\|еврейское письмо]], [[хирагана]], [[Каннада (письмо)\|каннада]], [[катакана]], [[лаосское письмо]], [[Латинский алфавит\|латиница]], [[Малаялам (письмо)\|малаялам]], [[Ория (письмо)\|ория]], [[тамильское письмо]], [[Телугу (письмо)\|телугу]], [[тайское письмо]] и [[тибетское письмо]]<ref>{{cite web', 44 => '\| title = Unicode Data 1.0.0', 45 => '\| url = http://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt', 46 => '\| lang = en', 47 => '\| accessdate = 2017-12-04', 48 => '}}</ref>', 49 => '\| Июнь 1992', 50 => '\| ISBN 0-201-60845-6 (Vol.2)', 51 => '\| {{formatnum:25}}', 52 => '\| {{formatnum:28359}}', 53 => '\| Добавлены {{formatnum:20902}} {{iw\|Унифицированные идеограммы ККЯ\|унифицированные идеограммы китайского, японского и корейского письма\|en\|CJK Unified Ideographs}}<ref>{{cite web', 54 => '\| url = http://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt', 55 => '\| lang = en', 56 => '\| accessdate = 2017-12-04', 57 => '}}</ref>', 58 => '\| 1.1<ref>{{cite web\|title=Unicode® 1.1\|url=http://www.unicode.org/versions/Unicode1.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 59 => '\| Июнь 1993', 60 => '\| {{formatnum:24}}', 61 => '\| {{formatnum:34233}}', 62 => '\| Добавлено {{formatnum:4306}} слогов [[Хангыль\|хангыля]], дополнивших уже имеющиеся в кодировке {{formatnum:2350}} символов. Удалены символы [[Тибетское письмо\|тибетского письма]]<ref>{{cite web', 63 => '\| url = http://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt', 64 => '\| lang = en', 65 => '\| accessdate = 2017-12-04', 66 => '}}</ref>', 67 => '\| 2.0<ref>{{cite web\|title=Unicode 2.0.0\|url=http://www.unicode.org/versions/Unicode2.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 68 => '\| Июль 1996', 69 => '\| ISBN 0-201-48345-9', 70 => '\| ISO/IEC 10646-1:1993 и Amendments 5, 6, 7', 71 => '\| {{formatnum:25}}', 72 => '\| {{formatnum:38950}}', 73 => '\| Удалены добавленные ранее слоги [[Хангыль\|хангыля]], и добавлены {{formatnum:11172}} новых слога хангыля с новыми кодами. Возвращены удалённые ранее символы [[Тибетское письмо\|тибетского письма]]; символы получили новые коды и были размещены в разных таблицах. Введён механизм суррогатных ({{lang-en\|surrogate}}) символов. Выделено место для плоскостей ({{lang-en\|planes}}) [[Области для частного использования\|15 и 16]]<ref>{{cite web', 74 => '\| title = Unicode Data 2.0.14', 75 => '\| url = http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt', 76 => '\| lang = en', 77 => '\| accessdate = 2017-12-04', 78 => '}}</ref>', 79 => '\| 2.1<ref>{{cite web\|title=Unicode 2.1.0\|url=http://www.unicode.org/versions/Unicode2.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 80 => '\| Май 1998', 81 => '\| ISO/IEC 10646-1:1993, Amendments 5, 6, 7, два символа из Amendment 18', 82 => '\| {{formatnum:38952}}', 83 => '\| Добавлен [[символ евро]]<ref>{{cite web', 84 => '\| title = Unicode Data 2.1.2', 85 => '\| url = http://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt', 86 => '\| lang = en', 87 => '\| accessdate = 2017-12-04', 88 => '}}</ref>', 89 => '\| 3.0<ref>{{cite web\|title=Unicode 3.0.0\|url=http://www.unicode.org/versions/Unicode3.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 90 => '\| Сентябрь 1999', 91 => '\| ISBN 0-201-61633-5', 92 => '\| {{formatnum:38}}', 93 => '\| {{formatnum:49259}}', 94 => '\| Добавлены письмо [[Чероки (письмо)\|чероки]], [[эфиопское письмо]], [[кхмерское письмо]], [[монгольские письменности]], [[бирманское письмо]], [[огамическое письмо]], [[руны]], [[сингальское письмо]], [[сирийское письмо]], [[Тана (письмо)\|тана]], [[канадское слоговое письмо]] и [[письмо и]], а также символы [[Шрифт Брайля\|шрифта Брайля]]<ref>{{cite web', 95 => '\| title = Unicode Data 3.0.0', 96 => '\| url = http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt', 97 => '\| lang = en', 98 => '\| accessdate = 2017-12-04', 99 => '}}</ref>', 100 => '\| 3.1<ref>{{cite web\|title=Unicode 3.1.0\|url=http://www.unicode.org/versions/Unicode3.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 101 => '\| Март 2001', 102 => '\| {{formatnum:41}}', 103 => '\| {{formatnum:94205}}', 104 => '\| Добавлены [[Дезеретский алфавит\|дезеретское письмо]], [[готское письмо]] и {{iw\|древнеиталийское письмо\|\|en\|Old Italic alphabet}}, а также символы [[Современная музыкальная нотация\|западной]] и [[Византийская музыка\|византийской]] музыки, {{formatnum:42711}} {{iw\|Унифицированные идеограммы ККЯ\|унифицированных идеограмм китайского, японского и корейского письма\|en\|CJK Unified Ideographs}}. Выделено место для плоскостей [[Плоскость (Юникод)#Дополнительная многоязычная плоскость\|1]], [[Плоскость (Юникод)#Дополнительная идеографическая плоскость\|2]] и [[Плоскость (Юникод)#Специализированная дополнительная плоскость\|14]]<ref>{{cite web', 105 => '\| title = Unicode Data 3.1.0', 106 => '\| url = http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt', 107 => '\| lang = en', 108 => '\| accessdate = 2017-12-04', 109 => '}}</ref>', 110 => '\| 3.2<ref>{{cite web\|title=Unicode 3.2.0\|url=http://www.unicode.org/versions/Unicode3.2.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 111 => '\| Март 2002', 112 => '\| ISO/IEC 10646-1:2000 и Amendment 1', 113 => '\| {{formatnum:45}}', 114 => '\| {{formatnum:95221}}', 115 => '\| Добавлены [[Бухид (письмо)\|письмо бухид]], {{iw\|Письмо хануноо\|хануноо\|en\|Hanunó'o script}}, [[байбайин]] и [[Тагбанва (письмо)\|письмо тагбанва]]<ref>{{cite web', 116 => '\| title = Unicode Data 3.2.0', 117 => '\| url = http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt', 118 => '\| lang = en', 119 => '\| accessdate = 2017-12-04', 120 => '}}</ref>', 121 => '\| 4.0<ref>{{cite web\|title=Unicode 4.0.0\|url=http://www.unicode.org/versions/Unicode4.0.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 122 => '\| Апрель 2003', 123 => '\| ISBN 0-321-18578-1', 124 => '\| {{formatnum:52}}', 125 => '\| {{formatnum:96447}}', 126 => '\| Добавлены [[кипрское письмо]], [[Лимбу (письмо)\|письмо лимбу]], [[линейное письмо Б]], [[сомалийское письмо]], [[Алфавит Шоу\|алфавит шоу]], [[Тай-ныа#Письменность\|письмо лы]] и [[угаритское письмо]], а также символы [[Гексаграмма (Ицзин)\|гексаграмм]]<ref>{{cite web', 127 => '\| title = Unicode Data 4.0.0', 128 => '\| url = http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt', 129 => '\| lang = en', 130 => '\| accessdate = 2017-12-04', 131 => '}}</ref>', 132 => '\| 4.1<ref>{{cite web\|title=Unicode 4.1.0\|url=http://www.unicode.org/versions/Unicode4.1.0/\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 133 => '\| Март 2005', 134 => '\| ISO/IEC 10646:2003 и Amendment 1', 135 => '\| {{formatnum:59}}', 136 => '\| {{formatnum:97720}}', 137 => '\| Добавлены [[Лонтара\|письмо лонтара]], [[глаголица]], [[Кхароштхи\|письмо кхароштхи]], [[новое письмо лы]], [[древнеперсидская клинопись]], [[силхетское нагари]] и [[древнеливийское письмо]]. Символы [[Коптское письмо\|коптского письма]] были отделены от символов [[Греческий алфавит\|греческого письма]]. Также добавлены [[Аттическая система счисления\|символы старых греческих цифр]], музыкальные символы Древней Греции и [[символ гривны]] (валюты [[Украина\|Украины]])<ref>{{cite web', 138 => '\| title = Unicode Data 4.1.0', 139 => '\| url = http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt', 140 => '\| lang = en', 141 => '\| accessdate = 2017-12-04', 142 => '}}</ref>', 143 => '\| 5.0<ref>{{cite web\|title=Unicode 5.0.0\|url=http://www.unicode.org/versions/Unicode5.0.0/\|date=2006-07-14\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 144 => '\| Июль 2006', 145 => '\| ISBN 0-321-48091-0', 146 => '\| ISO/IEC 10646:2003, Amendments 1, 2, четыре символа из Amendment 3', 147 => '\| {{formatnum:64}}', 148 => '\| {{formatnum:99089}}', 149 => '\| Добавлены [[балийское письмо]], [[клинопись]], [[Нко (письмо)\|письмо нко]], [[монгольское квадратное письмо]] и [[финикийское письмо]]<ref>{{cite web', 150 => '\| url = http://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt', 151 => '\| lang = en', 152 => '\| accessdate = 2017-12-04', 153 => '}}</ref>', 154 => '\| 5.1<ref>{{cite web\|title=Unicode 5.1.0\|url=http://www.unicode.org/versions/Unicode5.1.0/\|date=2008-04-04\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 155 => '\| Апрель 2008', 156 => '\| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4', 157 => '\| {{formatnum:75}}', 158 => '\| {{formatnum:100713}}', 159 => '\| Добавлены [[карийское письмо]], [[чамская письменность]], [[Кая-ли\|письмо кая-ли]], [[Лепча (письмо)\|письмо лепча]], [[Ликийский алфавит\|ликийское письмо]], [[Лидийский алфавит\|лидийское письмо]], [[Ол-чики\|письмо ол-чики]], [[реджангское письмо]], [[Саураштра (письмо)\|письмо саураштра]], [[сунданское письмо]],[[Древнетюркское письмо]] и [[Ваи (письмо)\|письмо ваи]]. Добавлены [[Фестский диск\|символы фестского диска]], символы костей для [[маджонг]]а и [[домино]], [[заглавная буква эсцет]] (ẞ), а также буквы латиницы, использовавшиеся в средневековых [[Рукопись\|рукописях]] для {{iw\|аббревация\|аббревиации\|en\|Scribal abbreviation}}. Новыми символами дополнен набор символов [[Бирманское письмо\|бирманского письма]]<ref>{{cite web', 160 => '\| url = http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt', 161 => '\| lang = en', 162 => '\| accessdate = 2017-12-04', 163 => '}}</ref>', 164 => '\| 5.2<ref>{{cite web\|title=Unicode® 5.2.0\|url=http://www.unicode.org/versions/Unicode5.2.0/\|date=2009-10-01\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 165 => '\| Октябрь 2009', 166 => '\|', 167 => '\| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4, 5, 6', 168 => '\| {{formatnum:90}}', 169 => '\| {{formatnum:107361}}', 170 => '\| Добавлены [[Авестийский алфавит\|авестийское письмо]], [[Бамум (письменность)\|письмо бамум]], [[египетское иероглифическое письмо]] (по {{iw\|список Гардинера\|списку Гардинера\|en\|Gardiner's sign list}}, содержащему {{formatnum:1071}} символ), [[имперское арамейское письмо]], {{iw\|пахлевийское эпиграфическое письмо\|\|en\|Inscriptional Pahlavi}}, {{iw\|парфянское эпиграфическое письмо\|\|en\|Inscriptional Parthian}}, [[яванское письмо]], [[Кайтхи\|письмо кайтхи]], [[Алфавит Фрейзера\|письмо лису]], [[Манипури (письмо)\|письмо манипури]], [[южноаравийское письмо]], [[древнетюркское письмо]], [[самаритянское письмо]], [[Ланна (письмо)\|письмо ланна]] и {{iw\|письмо тай-вьет\|\|en\|Tai Viet script}}. Добавлены {{formatnum:4149}} новых {{iw\|унифицированные идеограммы китайского, японского, корейского письма\|унифицированных идеограмм китайского, японского и корейского письма\|en\|CJK Unified Ideographs}} (CJK-C), символы [[Ведийский язык\|ведийского письма]], [[символ тенге]] (валюты [[Казахстан]]а), а также расширен набор символов чамо [[Хангыль\|старого хангыля]]<ref>{{cite web', 171 => '\| url = http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt', 172 => '\| lang = en', 173 => '\| accessdate = 2017-12-04', 174 => '}}</ref>', 175 => '\| 6.0<ref>{{cite web\|title=Unicode® 6.0.0\|url=http://www.unicode.org/versions/Unicode6.0.0/\|date=2010-10-11\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 176 => '\| Октябрь 2010', 177 => '\|', 178 => '\| ISO/IEC 10646:2010 и [[символ индийской рупии]]', 179 => '\| {{formatnum:93}}', 180 => '\| {{formatnum:109449}}', 181 => '\| Добавлены [[батакское письмо]], [[Брахми\|письмо брахми]], [[мандейское письмо]]. Добавлены символы [[Игральные карты\|игральных карт]], [[Дорожный знак\|дорожных знаков]], [[Географическая карта\|географических карт]], [[Алхимические символы\|алхимии]], [[эмотикон]]а и [[эмодзи]], а также {{formatnum:222}} {{iw\|унифицированные идеограммы китайского, японского и корейского письма\|\|en\|CJK Unified Ideographs}} (CJK-D)<ref>{{cite web', 182 => '\| url = http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt', 183 => '\| lang = en', 184 => '\| accessdate = 2017-12-04', 185 => '}}</ref>', 186 => '\| 6.1<ref>{{cite web\|title=Unicode® 6.1.0\|url=http://www.unicode.org/versions/Unicode6.1.0/\|date=2012-01-31\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 187 => '\| Январь 2012', 188 => '\|', 189 => '\| {{formatnum:100}}', 190 => '\| {{formatnum:110181}}', 191 => '\| Добавлены [[Чакма (письмо)\|письмо чакма]], [[Мероитское письмо\|мероитский курсив и мероитские иероглифы]], [[Письмо Полларда\|письмо мяо]], [[Шарада (письмо)\|письмо шарада]], {{iw\|письмо соранг-сомпенг\|\|en\|Sora Sompeng}} и [[Такри\|письмо такри]]<ref>{{cite web', 192 => '\| url = http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt', 193 => '\| lang = en', 194 => '\| accessdate = 2017-12-04', 195 => '}}</ref>', 196 => '\| 6.2<ref>{{cite web\|title=Unicode® 6.2.0\|url=http://www.unicode.org/versions/Unicode6.2.0/\|date=2012-09-26\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-07\|language=en}}</ref>', 197 => '\| Сентябрь 2012', 198 => '\|', 199 => '\| ISO/IEC 10646:2012 и [[символ турецкой лиры]]', 200 => '\| {{formatnum:100}}', 201 => '\| {{formatnum:110182}}', 202 => '\| Добавлен [[символ турецкой лиры]] (валюты [[Турция\|Турции]])<ref>{{cite web', 203 => '\| url = http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt', 204 => '\| lang = en', 205 => '\| accessdate = 2017-12-04', 206 => '}}</ref>', 207 => '\| 6.3<ref>{{cite web\|title=Unicode® 6.3.0\|url=http://www.unicode.org/versions/Unicode6.3.0/\|date=2012-09-30\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-07\|language=en}}</ref>', 208 => '\| Сентябрь 2013', 209 => '\|', 210 => '\| ISO/IEC 10646:2012 и шесть символов', 211 => '\| {{formatnum:100}}', 212 => '\| {{formatnum:110187}}', 213 => '\| Добавлено пять символов для форматирования двунаправленного текста<ref>{{cite web', 214 => '\| url = http://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt', 215 => '\| lang = en', 216 => '\| accessdate = 2017-12-04', 217 => '}}</ref>', 218 => '\| 7.0<ref>{{cite web\|title=Unicode® 7.0.0\|url=http://www.unicode.org/versions/Unicode7.0.0/\|date=2014-06-16\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 219 => '\| 16 июня 2014', 220 => '\|', 221 => '\| ISO/IEC 10646:2012, Amendments 1, 2 и [[символ российского рубля\|символ рубля]]', 222 => '\| {{formatnum:123}}', 223 => '\| {{formatnum:113021}}', 224 => '\| Добавлены [[Басса (письмо)\|письмо басса]], [[агванское письмо]], [[Система Дюплойе\|стенография Дюплойе]], [[эльбасанское письмо]], [[Грантха\|письмо грантха]], {{iw\|письмо ходжики\|\|en\|Khojki}}, {{iw\|письменность худавади\|\|en\|Khudabadi alphabet}}, [[линейное письмо А]], {{iw\|письмо махаджани\|\|en\|Mahajani}}, [[манихейское письмо]], [[Кикакуи\|письмо кикакуи]], [[Моди (письмо)\|письмо моди]], {{iw\|письмо мро\|\|en\|Mro script}}, [[набатейское письмо]], [[Северноаравийские языки\|северноаравийское письмо]], [[древнепермское письмо]], [[Пахау\|письмо пахау]], [[Пальмирский алфавит\|пальмирское письмо]], {{iw\|письмо по чин хо\|\|en\|Pau Cin Hau}}, {{iw\|письмо псалтирь пехлеви\|\|en\|Psalter Pahlavi}}, [[сиддхаматрика]], [[Тирхута\|письмо тирхута]], [[варанг-кшити]] и {{iw\|дингбат\|орнамент дингбат\|en\|Dingbat}}, а также [[символ российского рубля]] и [[символ азербайджанского маната]]<ref>{{cite web', 225 => '\| url = http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt', 226 => '\| lang = en', 227 => '\| accessdate = 2017-12-04', 228 => '}}</ref>', 229 => '\| 8.0<ref>{{cite web\|title=Unicode® 8.0.0\|url=http://www.unicode.org/versions/Unicode8.0.0/\|date=2015-06-17\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 230 => '\| 17 июня 2015', 231 => '\|', 232 => '\|ISO/IEC 10646:2014, Amendment 1, [[символ лари]], 9 унифицированных идеограмм ККЯ, 41 [[эмодзи]]', 233 => '\|129', 234 => '\| {{formatnum:120737}}', 235 => '\|Добавлены [[Ахом (письмо)\|письмо ахом]], [[анатолийские иероглифы]], [[Хатран\|письмо хатран]], [[Мултани\|письмо мултани]], [[венгерские руны]], [[SignWriting]], {{formatnum:5776}} [[Унифицированные идеограммы ККЯ — расширение E]], строчные буквы письма [[чероки]], буквы латиницы для немецкой диалектологии, 41 [[эмодзи]], а также пять символов изменения [[Шкала Фитцпатрика\|цвета кожи]] для эмотиконов. Добавлен [[символ лари]] (валюты [[Грузия\|Грузии]])<ref>{{cite web', 236 => '\| url = http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt', 237 => '\| lang = en', 238 => '\| accessdate = 2017-12-04', 239 => '}}</ref>', 240 => '\| 9.0<ref>{{cite web\|title=Unicode® 9.0.0\|url=http://www.unicode.org/versions/Unicode9.0.0/\|date=2016-06-21\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 241 => '\| 21 июня 2016', 242 => '\|', 243 => '\|ISO/IEC 10646:2014, Amendments 1, 2, адлам, нева, японские символы для ТВ, 74 [[эмодзи]] и символов', 244 => '\|135', 245 => '\| {{formatnum:128237}}', 246 => '\|Добавлены [[Адлам\|письмо адлам]], [[Бхайкшуки\|письмо бхайкшуки]], [[Марчен\|письмо марчен]], [[Нева (письмо)\|письмо нева]], [[Осейдж (письмо)\|письмо осейдж]], [[тангутское письмо]], а также 72 [[эмодзи]] и японские символы для телевидения<ref>{{cite web', 247 => '\| url = http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt', 248 => '\| lang = en', 249 => '\| accessdate = 2017-12-06', 250 => '}}</ref>', 251 => '\| 10.0<ref>{{cite web\|title=Unicode® 10.0.0\|url=http://www.unicode.org/versions/Unicode10.0.0/\|date=2017-06-27\|work={{нп5\|Unicode Consortium}}\|accessdate=2017-12-08\|language=en}}</ref>', 252 => '\| 20 июня 2017', 253 => '\|', 254 => '\|ISO/IEC 10646:2017, 56 [[эмодзи]], 285 символов [[Хэнтайгана\|хэнтайганы]], 3 символа квадратного письма Дзанабадзара', 255 => '\|139', 256 => '\| {{formatnum:136755}}', 257 => '\|Добавлены [[Монгольские письменности#Горизонтальное квадратное письмо\|квадратное письмо Дзанабадзара]], [[Соёмбо (письмо)\|письмо соёмбо]], [[гонди Масарама]], [[Нюй-шу\|письмо нюй-шу]], [[Хэнтайгана\|письмо хэнтайгана]], {{formatnum:7494}} [[Унифицированные идеограммы ККЯ — расширение F]], а также 56 [[эмодзи]] и символ [[биткойн]]а<ref>{{cite web', 258 => '\| title = Unicode Data 10.0.0', 259 => '\| url = http://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt', 260 => '\| lang = en', 261 => '\| accessdate = 2017-12-07', 262 => '}}</ref>', 263 => '\| Июнь 2018', 264 => '\| ', 265 => '\| ISO/IEC 10646:2017', 266 => '\| {{formatnum:137439}}', 267 => '\|Добавлены догра, [[грузинское письмо]] мтаврули, гунджалское гонди, [[ханифи]], индийские цифры сийяк, [[Макасарский язык\|макасарское]] письмо, медефайдрин, (древне-)[[согдийское письмо]], [[цифры майя]], 5 идеограмм ККЯ, символы [[сянци]] и половин звёздочек для оценки, а также 145 [[эмодзи]], четыре символа изменения причёски для эмотиконов и символ [[копилефт]]а<ref>{{Cite web\|url=http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt\|title=Unicode Data 11.0.0\|author=\|website=\|date=\|publisher=\|accessdate=2019-04-12\|lang=en}}</ref><ref>[http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html The Unicode Blog: Announcing The Unicode® Standard, Version 11.0]</ref><ref>[http://www.unicode.org/versions/Unicode11.0.0/ Unicode 11.0.0]</ref>', 268 => '\|12.0', 269 => '\|Март 2019', 270 => '\|', 271 => '\|ISO/IEC 10646:2017, Amendments 1, 2, а также 62 дополнительных символов', 272 => '\|150', 273 => '\|{{formatnum:137993}}', 274 => '\|Добавлены элимайское письмо, {{Не переведено 3\|надинагари\|3=en\|4=Nandinagari}}, хмонг, ванчо, дополнения для [[Письмо Полларда\|письма Полларда]], малая [[кана]] для старых японских текстов, исторические дроби и символы [[Тамильское письмо\|тамильского письма]], буквы [[Лаосское письмо\|лаосского письма]] для [[пали]], буквы латиницы для транслитерации угаритского, управляющие символы форматирования египетских иероглифов, а также 61 [[эмодзи]]<ref>[http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html The Unicode Blog: Announcing The Unicode® Standard, Version 12.0]</ref><ref>[http://www.unicode.org/versions/Unicode12.0.0/ Unicode 12.0.0]</ref>', 275 => '\|12.1', 276 => '\|Май 2019', 277 => '\|', 278 => '\|150', 279 => '\|{{formatnum:137994}}', 280 => '\|Добавлен квадратный символ эпохи [[рэйва]]<ref>[http://blog.unicode.org/2019/05/unicode-12-1-en.html The Unicode Blog: Unicode Version 12.1 released in support of the Reiwa Era]</ref><ref>[http://www.unicode.org/versions/Unicode12.1.0/ Unicode 12.1.0]</ref>', 281 => '\|13.0', 282 => '\|Март 2020', 283 => '\|', 284 => '\|', 285 => '\|154', 286 => '\|{{formatnum:143859}}', 287 => '\|Добавлены [[хорезмийское письмо]], письмо [[дивес акуру]], [[малое киданьское письмо]], [[езидское письмо]], {{formatnum:4969}} идеограмм ККЯ (включая {{formatnum:4939}} [[Унифицированные идеограммы ККЯ — расширение G]]), а также 55 [[эмодзи]], символы [[Creative Commons]] и символы для унаследованной вычислительной техники. Выделено место для [[Плоскость (Юникод)#Третичная идеографическая плоскость\|плоскости 3]]<ref>[http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html The Unicode Blog: Announcing The Unicode Standard, Version 13.0]</ref><ref>[http://www.unicode.org/versions/Unicode13.0.0/ Unicode 13.0.0]</ref>', 288 => '\|-', 289 => '\|colspan="7" \| '''Примечания'''', 290 => '<references group="A" />', 291 => '== Кодовое пространство ==', 292 => 'Хотя формы записи UTF-8 и UTF-32 позволяют кодировать до 2<sup>31</sup> ({{formatnum:2147483648}}) кодовых позиций, было принято решение использовать лишь {{formatnum:1112064}} для совместимости с UTF-16. Впрочем, даже и этого в данный момент более чем достаточно — в версии 13.0 используется всего {{formatnum:143859}} кодовых позиций.', 293 => 'Кодовое пространство разбито на 17 ''[[Плоскость (Юникод)\|плоскостей]]'' ({{lang-en\|planes}}) по 2<sup>16</sup> ({{formatnum:65536}}) символов. Нулевая плоскость ({{lang-en2\|plane{{nbsp}}0}}) называется ''базовой'' ({{lang-en2\|basic}}) и содержит символы наиболее употребительных письменностей. Остальные плоскости — дополнительные ({{lang-en2\|supplementary}}). Первая плоскость ({{lang-en2\|plane{{nbsp}}1}}) используется в основном для исторических письменностей, вторая ({{lang-en2\|plane{{nbsp}}2}}) — для редко используемых иероглифов [[CJK\|китайского письма (ККЯ)]], третья ({{lang-en2\|plane{{nbsp}}3}}) зарезервирована для архаичных китайских иероглифов<ref>[http://unicode.org/roadmaps/tip/ Roadmap to the TIP (Tertiary Ideographic Plane)]</ref>. Плоскость 14 отведена для символов, используемых по особому назначению. Плоскости 15 и 16 выделены для частного употребления<ref name='unicode-02' />.', 294 => 'Для обозначения символов Unicode используется запись вида «U+''xxxx''» (для кодов 0…FFFF), или «U+''xxxxx''» (для кодов 10000…FFFFF), или «U+''xxxxxx''» (для кодов 100000…10FFFF), где ''xxx'' — [[шестнадцатеричная система счисления\|шестнадцатеричные]] цифры. Например, символ «я» (U+044F) имеет код 044F{{sub\|16}}{{nbsp}}= 1103{{sub\|[[десятичная система счисления\|10]]}}.', 295 => '{\| class="wikitable sortable collapsible collapsed"', 296 => '\|-', 297 => '! colspan="3" \| Плоскости Юникода', 298 => '\|-', 299 => '! Плоскость !! Название !! Диапазон символов', 300 => '\|-', 301 => '\| 0 \|\| Базовая многоязыковая плоскость ({{lang-en2\|Basic multilingual plane, BMP}}) \|\| U+0000…U+FFFF', 302 => '\|-', 303 => '\| 1 \|\| Дополнительная многоязыковая плоскость ({{lang-en2\|Supplementary multilingual plane, SMP}}) \|\| U+10000…U+1FFFF', 304 => '\|-', 305 => '\| 2 \|\| Дополнительная иероглифическая плоскость ({{lang-en2\|Supplementary ideographic plane, SIP}}) \|\| U+20000…U+2FFFF', 306 => '\|-', 307 => '\| 3 \|\| Третичная иероглифическая плоскость ({{lang-en2\|Tertiary ideographic plane, TIP}}) \|\| U+30000…U+3FFFF', 308 => '\|-', 309 => '\| 4—13 \|\| не используются \|\| U+40000…U+DFFFF', 310 => '\|-', 311 => '\| 14 \|\| Дополнительная плоскость особого назначения ({{lang-en2\|Supplementary special-purpose plane, SSP}}) \|\| U+E0000…U+EFFFF', 312 => '\|-', 313 => '\| 15—16 \|\| Дополнительные области для частного использования ({{lang-en2\|Supplementary private use area, SPUA-A/B}}) \|\| U+F0000…U+10FFFF', 314 => '\|-', 315 => '\|}', 316 => '== Система кодирования ==', 317 => 'Универсальная система кодирования (Юникод) представляет собой набор графических символов и способ их кодирования для [[компьютер]]ной обработки текстовых данных.', 318 => 'Графические символы — это символы, имеющие видимое изображение. Графическим символам противопоставляются управляющие символы и символы форматирования.', 319 => 'Графические символы включают в себя следующие группы:', 320 => '* буквы, содержащиеся хотя бы в одном из обслуживаемых [[алфавит]]ов;', 321 => '* цифры;', 322 => '* знаки пунктуации;', 323 => '* специальные знаки ([[математика\|математические]], технические, [[идеограмма\|идеограммы]] и пр.);', 324 => '* разделители.', 325 => 'Юникод — это система для линейного представления текста. Символы, имеющие дополнительные над- или подстрочные элементы, могут быть представлены в виде построенной по определённым правилам последовательности кодов (составной вариант, composite character) или в виде единого символа (монолитный вариант, precomposed character). С 2014 года считается, что все буквы крупных письменностей в Юникод внесены, и если символ доступен в составном варианте, дублировать его в монолитном виде не нужно.', 326 => '=== Политика консорциума ===', 327 => 'Консорциум не создаёт нового, а констатирует сложившийся порядок вещей<ref name="emoji">[http://www.unicode.org/faq/emoji_dingbats.html FAQ — Emoji{{nbsp}}& Dingbats]</ref>. Например, картинки «[[эмодзи]]» были добавлены потому, что японские операторы мобильной связи широко их использовали. Для этого добавление символа проходит через сложный процесс<ref name="emoji" />. И, например, [[символ российского рубля]] прошёл его за три месяца, как только получил официальный статус, причём до этого он много лет де-факто использовался и его отказывались включить в Юникод.', 328 => '[[Товарный знак\|Товарные знаки]] кодируют только в порядке исключения. Так, в Юникоде нет флага [[Windows]] или яблока [[Apple]].', 329 => 'Как только символ появился в кодировке, он никогда не сдвинется и не исчезнет. Если же потребуется изменить порядок символов, это делается не переменой позиций, а национальным порядком сортировки. Есть и другие, более тонкие гарантии стабильности — например, не будут меняться таблицы нормализации<ref>[http://www.unicode.org/policies/stability_policy.html Unicode Character Encoding Stability Policy]</ref>.', 330 => '=== Объединение и дублирование символов ===', 331 => 'Один и тот же символ может иметь несколько форм; в Юникод эти формы входят одной кодовой позицией:', 332 => '* если это сложилось исторически. Например, у [[арабское письмо\|арабских букв]] есть четыре формы: обособленная, в начале, в середине и в конце<ref>Впоследствии конкретным формам арабских букв отвели отдельные позиции. Но всё равно рекомендуется писать по-арабски «общими» вариантами букв.</ref>;', 333 => '* либо если в одном языке принята одна форма, а в другом — другая. [[Болгарская кириллица (типографика)\|Болгарская кириллица]] отличается от русской, а китайские иероглифы — от японских.', 334 => 'С другой стороны, если исторически в шрифтах у разных форм начертания были разные символы, то они остаются разными и в Юникоде. Например, строчная греческая [[Сигма (буква)\|сигма]] имеет две формы, и в Юникоде у них разные коды; буква [[Расширенная латиница\|расширенной латиницы]]{{nbsp}}[[Å (латиница)\|Å]] ({{nobr\|A с кружком}}) и знак [[ангстрем]]а{{nbsp}}Å, греческая буква{{nbsp}}[[Мю\|μ]] и обозначение приставки «[[микро-]]»{{nbsp}}µ — тоже имеют разные кодовые позиции.', 335 => 'Конечно же, похожие символы в неродственных письменностях также ставятся в разные кодовые позиции. Например, буква{{nbsp}}А в [[Латиница\|латинице]], [[Кириллица\|кириллице]], [[Греческий алфавит\|греческом]] и [[Письмо чероки\|чероки]] — разные символы.', 336 => 'Крайне редко один и тот же символ ставится в две разные кодовые позиции для упрощения обработки текста. [[Штрих (математика)\|Математический штрих]] и такой же штрих для индикации [[Мягкий звук\|мягкости звуков]] — разные символы, второй считается буквой.', 337 => '== Комбинируемые символы ==', 338 => '[[Файл:Diacritic-j.png\|right\|thumb\|Представление символа «Й» (U+0419) в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306).]]', 339 => 'Cимволы в Юникоде подразделяются на базовые ({{lang-en\|base characters}}) и комбинируемые ({{lang-en\|combining characters}}). Комбинируемые символы обычно следуют за базовым и изменяют его отображение определённым образом. К комбинируемым символам, например, относятся [[диакритические знаки]], знаки ударения. Например, русскую букву «Й» в Юникоде можно записать в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306), отображаемого над базовым.', 340 => 'Комбинируемые символы помечены в таблицах символов Юникода особыми категориями:', 341 => '* Nonspacing Mark — безинтервальный (непротяжённый) знак; таковые обычно отображаются над или под базовым символом и не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке;', 342 => '* Enclosing Mark — обрамляющий знак; эти символы также не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке, но отображаются сразу с нескольких сторон базового символа;', 343 => '* Spacing Combining Mark — интервальный (протяжённый) комбинируемый знак; таковые, как и базовый символ, занимают отдельную горизонтальную позицию (интервал) в отображаемой строке.', 344 => 'Особый тип комбинируемых символов — селекторы варианта начертания ({{lang-en\|variation selectors}}). Они действуют только на те базовые символы, для которых такие варианты определены. К примеру, в версии Юникода 5.0 варианты начертания определены для ряда математических символов, для символов традиционного [[монгольский алфавит\|монгольского алфавита]] и для символов [[Монгольское квадратное письмо\|монгольского квадратного письма]].', 345 => '== Алгоритмы нормализации ==', 346 => 'Из-за наличия в Юникоде комбинируемых символов одни и те же знаки письменности можно представить различными кодами. Так, например, букву "Й" в примере выше можно записать как отдельным символом, так и сочетанием базового и комбинированного. Из-за этого сравнение строк байт за байтом становится невозможным. Алгоритмы нормализации ({{lang-en\|normalization forms}}) решают эту проблему, выполняя приведение символов к определённому стандартному виду. Приведение осуществляется путём замены символов на эквивалентные с использованием таблиц и правил. «Декомпозицией» называется замена (разложение) одного символа на несколько составляющих символов, а «композицией», наоборот, — замена (соединение) нескольких составляющих символов на один символ.', 347 => 'В стандарте Юникода определены четыре алгоритма нормализации текста: NFD, NFC, NFKD и NFKC.', 348 => '=== NFD ===', 349 => 'NFD, {{lang-en\|'''n'''ormalization '''f'''orm '''D'''}} («D» от {{lang-en\|'''d'''ecomposition}}), форма нормализации D — каноническая декомпозиция — алгоритм, согласно которому выполняется рекурсивное разложение составных символов ({{lang-en\|precomposed characters}}) на последовательность из одного или нескольких простых символов в соответствии с таблицами декомпозиции. Рекурсивное потому, что в процессе разложения составной символ может быть разложен на несколько других, некоторые из которых тоже являются составными, и к которым применяется дальнейшее разложение.', 350 => 'Примеры:', 351 => '{\| style="text-align:center"', 352 => '\|', 353 => '{\| class="wikitable" style="text-align:center"', 354 => ' \| Ω', 355 => ' \|-', 356 => ' \| <small>U+2126</small>', 357 => ' \|}', 358 => '\| colspan="2" \| →', 359 => '\|', 360 => '{\| class="wikitable"', 361 => ' \| Ω', 362 => ' \|-', 363 => ' \| <small>U+03A9</small>', 364 => ' \|}', 365 => '\|}', 366 => '{\| style="text-align:center"', 367 => '\|', 368 => '{\| class="wikitable" style="text-align:center"', 369 => ' \| <big>Å</big>', 370 => ' \|-', 371 => ' \| <small>U+00C5</small>', 372 => ' \|}', 373 => '\| colspan="2" \| →', 374 => '\|', 375 => '{\| class="wikitable"', 376 => ' \| <big>A</big>', 377 => ' \|-', 378 => ' \| <small>U+0041</small>', 379 => ' \|}', 380 => '\|', 381 => '{\| class="wikitable"', 382 => ' \| <big> ̊</big>', 383 => ' \|-', 384 => ' \| <small>U+030A</small>', 385 => ' \|}', 386 => '\|}', 387 => '{\| style="text-align:center"', 388 => '\|', 389 => '{\| class="wikitable" style="text-align:center"', 390 => ' \| <big>ṩ</big>', 391 => ' \|-', 392 => ' \| <small>U+1E69</small>', 393 => ' \|}', 394 => '\| colspan="2" \| →', 395 => '\|', 396 => '{\| class="wikitable"', 397 => ' \| <big>s</big>', 398 => ' \|-', 399 => ' \| <small>U+0073</small>', 400 => ' \|}', 401 => '\|', 402 => '{\| class="wikitable"', 403 => ' \| <big> ̣</big>', 404 => ' \|-', 405 => ' \| <small>U+0323</small>', 406 => ' \|}', 407 => '\|', 408 => '{\| class="wikitable"', 409 => ' \| <big> ̇</big>', 410 => ' \|-', 411 => ' \| <small>U+0307</small>', 412 => ' \|}', 413 => '\|}', 414 => '{\| style="text-align:center"', 415 => '\|', 416 => '{\| class="wikitable" style="text-align:center"', 417 => ' \| colspan="2" \| <big>ḍ̇</big>', 418 => ' \|-', 419 => ' \| <small>U+1E0B</small> \|\| <small>U+0323</small>', 420 => ' \|}', 421 => '\| colspan="2" \| →', 422 => '\|', 423 => '{\| class="wikitable"', 424 => ' \| <big>d</big>', 425 => ' \|-', 426 => ' \| <small>U+0064</small>', 427 => ' \|}', 428 => '\|', 429 => '{\| class="wikitable"', 430 => ' \| <big> ̣</big>', 431 => ' \|-', 432 => ' \| <small>U+0323</small>', 433 => ' \|}', 434 => '\|', 435 => '{\| class="wikitable"', 436 => ' \| <big> ̇</big>', 437 => ' \|-', 438 => ' \| <small>U+0307</small>', 439 => ' \|}', 440 => '\|}', 441 => '{\| style="text-align:center"', 442 => '\|', 443 => '{\| class="wikitable" style="text-align:center"', 444 => ' \| colspan="3" \| <big>q̣̇</big>', 445 => ' \|-', 446 => ' \| <small>U+0071</small> \|\| <small>U+0307</small> \|\| <small>U+0323</small>', 447 => ' \|}', 448 => '\| colspan="2" \| →', 449 => '\|', 450 => '{\| class="wikitable"', 451 => ' \| <big>q</big>', 452 => ' \|-', 453 => ' \| <small>U+0071</small>', 454 => ' \|}', 455 => '\|', 456 => '{\| class="wikitable"', 457 => ' \| <big> ̣</big>', 458 => ' \|-', 459 => ' \| <small>U+0323</small>', 460 => ' \|}', 461 => '\|', 462 => '{\| class="wikitable"', 463 => ' \| <big> ̇</big>', 464 => ' \|-', 465 => ' \| <small>U+0307</small>', 466 => ' \|}', 467 => '\|}', 468 => '=== NFC ===', 469 => 'NFC, {{lang-en\|'''n'''ormalization '''f'''orm '''C'''}} («C» от {{lang-en\|'''c'''omposition}}), форма нормализации C — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и каноническая композиция. Сначала каноническая декомпозиция (алгоритм NFD) приводит текст к форме D. Затем каноническая композиция — операция, обратная NFD, обрабатывает текст от начала к концу с учётом следующих правил:', 470 => '* символ <code>S</code> считается ''начальным'', если имеет нулевой класс комбинируемости ({{lang-en\|combining class of zero}}) согласно таблице символов Юникода;', 471 => '* в любой последовательности символов, начинающейся с символа <code>S</code>, символ <code>C</code> блокируется от <code>S</code>, только если между <code>S</code> и <code>C</code> есть какой-либо символ <code>B</code>, который либо является начальным, либо имеет одинаковый или больший класс комбинируемости, чем <code>C</code>. Это правило распространяется только на строки, прошедшие каноническую декомпозицию;', 472 => '* символ считается ''первичным'' композитом, если имеет каноническую декомпозицию в таблице символов Юникода (или каноническую декомпозицию для [[Хангыль\|хангыля]] и он не входит в [http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table список исключений]);', 473 => '* символ <code>X</code> может быть первично совмещён с символом <code>Y</code>, если и только если существует первичный композит <code>Z</code>, канонически эквивалентный последовательности <<code>X</code>, <code>Y</code>>;', 474 => '* если очередной символ <code>C</code> не блокируется последним встреченным начальным базовым символом <code>L</code> и он может быть успешно первично совмещён с ним, то <code>L</code> заменяется на композит <code>L-C</code>, а <code>C</code> удаляется.', 475 => 'Пример:', 476 => '{\| style="text-align:center"', 477 => '\|', 478 => '{\| class="wikitable" style="text-align:center"', 479 => ' \| <big>o</big>', 480 => ' \|-', 481 => ' \| <small>U+006F</small>', 482 => ' \|}', 483 => '\|', 484 => '{\| class="wikitable" style="text-align:center"', 485 => ' \| <big> ̂</big>', 486 => ' \|-', 487 => ' \| <small>U+0302</small>', 488 => ' \|}', 489 => '\| colspan="2" \| →', 490 => '\|', 491 => '{\| class="wikitable"', 492 => ' \| <big>ô</big>', 493 => ' \|-', 494 => ' \| <small>U+00F4</small>', 495 => ' \|}', 496 => '\|}', 497 => '=== NFKD ===', 498 => 'NFKD, {{lang-en\|'''n'''ormalization '''f'''orm '''KD'''}}, форма нормализации KD — совместимая декомпозиция — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и замены символов текста по таблицам совместимой декомпозиции. Таблицы совместимой декомпозиции предусматривают замену на почти эквивалентные символы<ref>[http://habrahabr.ru/post/45489/ Нормализация Unicode]</ref>:', 499 => '* похожих на буквы (ℍ и ℌ);', 500 => '* обведённых кружками (①);', 501 => '* с изменёнными размерами (ｶ и カ);', 502 => '* повёрнутых (︷ и {);', 503 => '* степеней (⁹ и ₉);', 504 => '* дробей (¼);', 505 => '* других (™).', 506 => 'Примеры:', 507 => '{\| style="text-align:center;"', 508 => '\|', 509 => '{\| class="wikitable" style="text-align:center;"', 510 => ' \| <big>ℍ</big>', 511 => ' \|-', 512 => ' \| <small>U+210d</small>', 513 => ' \|}', 514 => '\| colspan="2" \| →', 515 => '\|', 516 => '{\| class="wikitable" style="text-align:center;"', 517 => ' \| <big>H</big>', 518 => ' \|-', 519 => ' \| <small>U+0048</small>', 520 => ' \|}', 521 => '\|}', 522 => '{\| style="text-align:center;"', 523 => '\|', 524 => '{\| class="wikitable" style="text-align:center;"', 525 => ' \| <big>①</big>', 526 => ' \|-', 527 => ' \| <small>U+2460</small>', 528 => ' \|}', 529 => '\| colspan="2" \| →', 530 => '\|', 531 => '{\| class="wikitable" style="text-align:center;"', 532 => ' \| <big>1</big>', 533 => ' \|-', 534 => ' \| <small>U+0031</small>', 535 => ' \|}', 536 => '\|}', 537 => '{\|', 538 => '\|', 539 => '{\| class="wikitable" style="text-align:center;"', 540 => ' \| <big>ｶ</big>', 541 => ' \|-', 542 => ' \| <small>U+FF76</small>', 543 => ' \|}', 544 => '\| colspan="2" \| →', 545 => '\|', 546 => '{\| class="wikitable" style="text-align:center;"', 547 => ' \| <big>カ</big>', 548 => ' \|-', 549 => ' \| <small>U+30AB</small>', 550 => ' \|}', 551 => '\|}', 552 => '{\|', 553 => '\|', 554 => '{\| class="wikitable" style="text-align:center;"', 555 => ' \| <big>︷</big>', 556 => ' \|-', 557 => ' \| <small>U+FE37</small>', 558 => ' \|}', 559 => '\| colspan="2" \| →', 560 => '\|', 561 => '{\| class="wikitable" style="text-align:center;"', 562 => ' \| <big>{</big>', 563 => ' \|-', 564 => ' \| <small>U+007B</small>', 565 => ' \|}', 566 => '\|}', 567 => '{\|', 568 => '\|', 569 => '{\| class="wikitable" style="text-align:center;"', 570 => ' \| <big>⁹</big>', 571 => ' \|-', 572 => ' \| <small>U+2079</small>', 573 => ' \|}', 574 => '\| colspan="2" \| →', 575 => '\|', 576 => '{\| class="wikitable" style="text-align:center;"', 577 => ' \| <big>9</big>', 578 => ' \|-', 579 => ' \| <small>U+0039</small>', 580 => ' \|}', 581 => '\|}', 582 => '{\|', 583 => '\|', 584 => '{\| class="wikitable" style="text-align:center;"', 585 => ' \| <big>¼</big>', 586 => ' \|-', 587 => ' \| <small>U+00BC</small>', 588 => ' \|}', 589 => '\| colspan="2" \| →', 590 => '\|', 591 => '{\| class="wikitable" style="text-align:center;"', 592 => ' \| <big>1</big> \|\| <big> ⁄ </big> \|\| <big>4</big>', 593 => ' \|-', 594 => ' \| <small>U+0031</small> \|\| <small>U+2044</small> \|\| <small>U+0034</small>', 595 => ' \|}', 596 => '\|}', 597 => '{\|', 598 => '\|', 599 => '{\| class="wikitable" style="text-align:center;"', 600 => ' \| <big>™</big>', 601 => ' \|-', 602 => ' \| <small>U+2122</small>', 603 => ' \|}', 604 => '\| colspan="2" \| →', 605 => '\|', 606 => '{\| class="wikitable" style="text-align:center;"', 607 => ' \| <big>T</big> \|\| <big>M</big>', 608 => ' \|-', 609 => ' \| <small>U+0054</small> \|\| <small>U+004D</small>', 610 => ' \|}', 611 => '\|}', 612 => '=== NFKC ===', 613 => 'NFKC, {{lang-en\|'''n'''ormalization '''f'''orm '''KC'''}}, форма нормализации KC — алгоритм, согласно которому последовательно выполняются совместимая декомпозиция (алгоритм NFKD) и каноническая композиция (алгоритм NFC).', 614 => '=== Примеры ===', 615 => '{\| class="standard"', 616 => ' !Исходный текст\|\|NFD\|\|NFC\|\|NFKD\|\|NFKC', 617 => ' \|-', 618 => ' \| <!-- fi -->', 619 => '{\| class="wikitable" style="text-align:center;"', 620 => '\| <big>ﬁ</big>', 621 => '\| <small>U+FB01</small>', 622 => '\|}', 623 => '\|', 624 => '{\| class="wikitable" style="text-align:center;"', 625 => '\| <big>ﬁ</big>', 626 => '\| <small>U+FB01</small>', 627 => '\|}', 628 => '\|', 629 => '{\| class="wikitable" style="text-align:center;"', 630 => '\| <big>ﬁ</big>', 631 => '\| <small>U+FB01</small>', 632 => '\|}', 633 => '\|', 634 => '{\| class="wikitable" style="text-align:center;"', 635 => '\| <big>f</big> \|\| <big>i</big>', 636 => '\| <small>U+0066</small> \|\| <small>U+0069</small>', 637 => '\|}', 638 => '\|', 639 => '{\| class="wikitable" style="text-align:center;"', 640 => '\| <big>f</big> \|\| <big>i</big>', 641 => '\| <small>U+0066</small> \|\| <small>U+0069</small>', 642 => '\|}', 643 => ' \|-', 644 => ' \| <!-- 2^5 -->', 645 => '{\| class="wikitable" style="text-align:center;"', 646 => '\| <big>2</big> \|\| <big>⁵</big>', 647 => '\| <small>U+0032</small> \|\| <small>U+2075</small>', 648 => '\|}', 649 => '\|', 650 => '{\| class="wikitable" style="text-align:center;"', 651 => '\| <big>2</big> \|\| <big>⁵</big>', 652 => '\| <small>U+0032</small> \|\| <small>U+2075</small>', 653 => '\|}', 654 => '\|', 655 => '{\| class="wikitable" style="text-align:center;"', 656 => '\| <big>2</big> \|\| <big>⁵</big>', 657 => '\| <small>U+0032</small> \|\| <small>U+2075</small>', 658 => '\|}', 659 => '\|', 660 => '{\| class="wikitable" style="text-align:center;"', 661 => '\| <big>2</big> \|\| <big>5</big>', 662 => '\| <small>U+0032</small> \|\| <small>U+0035</small>', 663 => '\|}', 664 => '\|', 665 => '{\| class="wikitable" style="text-align:center;"', 666 => '\| <big>2</big> \|\| <big>5</big>', 667 => '\| <small>U+0032</small> \|\| <small>U+0035</small>', 668 => '\|}', 669 => ' \|-', 670 => ' \| <!-- "s" (looks like "f") with two dots -->', 671 => '{\| class="wikitable" style="text-align:center;"', 672 => '\| colspan="2" \| <big>ẛ̣</big>', 673 => '\| <small>U+1E9B</small> \|\| <small>U+0323</small>', 674 => '\|}', 675 => '\|', 676 => '{\| class="wikitable" style="text-align:center;"', 677 => '\| <big>ſ</big> \|\| <big>̣</big> \|\| <big>̇</big>', 678 => '\| <small>U+017F</small> \|\| <small>U+0323</small> \|\| <small>U+0307</small>', 679 => '\|}', 680 => '\|', 681 => '{\| class="wikitable" style="text-align:center;"', 682 => '\| <big>ẛ</big> \|\| <big>̣</big>', 683 => '\| <small>U+1E9B</small> \|\| <small>U+0323</small>', 684 => '\|}', 685 => '\|', 686 => '{\| class="wikitable" style="text-align:center;"', 687 => '\| <big>s</big> \|\| <big>̣</big> \|\| <big>̇</big>', 688 => '\| <small>U+0073</small> \|\| <small>U+0323</small> \|\| <small>U+0307</small>', 689 => '\|}', 690 => '\|', 691 => '{\| class="wikitable" style="text-align:center;"', 692 => '\| <big>ṩ</big>', 693 => '\| <small>U+1E69</small>', 694 => '\|}', 695 => ' \|-', 696 => ' \| <!-- "й" -->', 697 => '{\| class="wikitable" style="text-align:center;"', 698 => '\| <big>й</big>', 699 => '\| <small>U+0439</small>', 700 => '\|}', 701 => '\|', 702 => '{\| class="wikitable" style="text-align:center;"', 703 => '\| <big>и</big> \|\| <big> ̆</big>', 704 => '\| <small>U+0438</small> \|\| <small>U+0306</small>', 705 => '\|}', 706 => '\|', 707 => '{\| class="wikitable" style="text-align:center;"', 708 => '\| <big>й</big>', 709 => '\| <small>U+0439</small>', 710 => '\|}', 711 => '\|', 712 => '{\| class="wikitable" style="text-align:center;"', 713 => '\| <big>и</big> \|\| <big> ̆</big>', 714 => '\| <small>U+0438</small> \|\| <small>U+0306</small>', 715 => '\|}', 716 => '\|', 717 => '{\| class="wikitable" style="text-align:center;"', 718 => '\| <big>й</big>', 719 => '\| <small>U+0439</small>', 720 => '\|}', 721 => ' \|-', 722 => ' \| <!-- "ё" -->', 723 => '{\| class="wikitable" style="text-align:center;"', 724 => '\| <big>ё</big>', 725 => '\| <small>U+0451</small>', 726 => '\|}', 727 => '\|', 728 => '{\| class="wikitable" style="text-align:center;"', 729 => '\| <big>е</big> \|\| <big>̈</big>', 730 => '\| <small>U+0435</small> \|\| <small>U+0308</small>', 731 => '\|}', 732 => '\|', 733 => '{\| class="wikitable" style="text-align:center;"', 734 => '\| <big>ё</big>', 735 => '\| <small>U+0451</small>', 736 => '\|}', 737 => '\|', 738 => '{\| class="wikitable" style="text-align:center;"', 739 => '\| <big>е</big> \|\| <big>̈</big>', 740 => '\| <small>U+0435</small> \|\| <small>U+0308</small>', 741 => '\|}', 742 => '\|', 743 => '{\| class="wikitable" style="text-align:center;"', 744 => '\| <big>ё</big>', 745 => '\| <small>U+0451</small>', 746 => '\|}', 747 => ' \|-', 748 => ' \| <!-- "А" -->', 749 => '{\| class="wikitable" style="text-align:center;"', 750 => '\| <big>А</big>', 751 => '\| <small>U+0410</small>', 752 => '\|}', 753 => '\|', 754 => '{\| class="wikitable" style="text-align:center;"', 755 => '\| <big>А</big>', 756 => '\| <small>U+0410</small>', 757 => '\|}', 758 => '\|', 759 => '{\| class="wikitable" style="text-align:center;"', 760 => '\| <big>А</big>', 761 => '\|-', 762 => '\| <small>U+0410</small>', 763 => '\|}', 764 => '\|', 765 => '{\| class="wikitable" style="text-align:center;"', 766 => '\| <big>А</big>', 767 => '\|-', 768 => '\| <small>U+0410</small>', 769 => '\|}', 770 => '\|', 771 => '{\| class="wikitable" style="text-align:center;"', 772 => '\| <big>А</big>', 773 => '\|-', 774 => '\| <small>U+0410</small>', 775 => '\|}', 776 => ' \|-', 777 => ' \| <!-- "が" -->', 778 => '{\| class="wikitable" style="text-align:center;"', 779 => '\| <big>が</big>', 780 => '\|-', 781 => '\| <small>U+304C</small>', 782 => '\|}', 783 => '\|', 784 => '{\| class="wikitable" style="text-align:center;"', 785 => '\| <big>か</big> \|\| <big>゙</big>', 786 => '\|-', 787 => '\| <small>U+304B</small> \|\| <small>U+3099</small>', 788 => '\|}', 789 => '\|', 790 => '{\| class="wikitable" style="text-align:center;"', 791 => '\| <big>が</big>', 792 => '\|-', 793 => '\| <small>U+304C</small>', 794 => '\|}', 795 => '\|', 796 => '{\| class="wikitable" style="text-align:center;"', 797 => '\| <big>か</big> \|\| <big>゙</big>', 798 => '\|-', 799 => '\| <small>U+304B</small> \|\| <small>U+3099</small>', 800 => '\|}', 801 => '\|', 802 => '{\| class="wikitable" style="text-align:center;"', 803 => '\| <big>が</big>', 804 => '\|-', 805 => '\| <small>U+304C</small>', 806 => '\|}', 807 => ' \|-', 808 => ' \| <!-- "VIII" -->', 809 => '{\| class="wikitable" style="text-align:center;"', 810 => '\| <big>Ⅷ</big>', 811 => '\|-', 812 => '\| <small>U+2167</small>', 813 => '\|}', 814 => '\|', 815 => '{\| class="wikitable" style="text-align:center;"', 816 => '\| <big>Ⅷ</big>', 817 => '\|-', 818 => '\| <small>U+2167</small>', 819 => '\|}', 820 => '\|', 821 => '{\| class="wikitable" style="text-align:center;"', 822 => '\| <big>Ⅷ</big>', 823 => '\|-', 824 => '\| <small>U+2167</small>', 825 => '\|}', 826 => '\|', 827 => '{\| class="wikitable" style="text-align:center;"', 828 => '\| <big>V</big> \|\| <big>I</big> \|\| <big>I</big> \|\| <big>I</big>', 829 => '\|-', 830 => '\| <small>U+0056</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small>', 831 => '\|}', 832 => '\|', 833 => '{\| class="wikitable" style="text-align:center;"', 834 => '\| <big>V</big> \|\| <big>I</big> \|\| <big>I</big> \|\| <big>I</big>', 835 => '\|-', 836 => '\| <small>U+0056</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small> \|\| <small>U+0049</small>', 837 => '\|}', 838 => ' \|-', 839 => ' \| <!-- "ç" -->', 840 => '{\| class="wikitable" style="text-align:center;"', 841 => '\| <big>ç</big>', 842 => '\|-', 843 => '\| <small>U+00E7</small>', 844 => '\|}', 845 => '\|', 846 => '{\| class="wikitable" style="text-align:center;"', 847 => '\| <big>c</big> \|\| <big>̧</big>', 848 => '\|-', 849 => '\| <small>U+0063</small> \|\| <small>U+0327</small>', 850 => '\|}', 851 => '\|', 852 => '{\| class="wikitable" style="text-align:center;"', 853 => '\| <big>ç</big>', 854 => '\|-', 855 => '\| <small>U+00E7</small>', 856 => '\|}', 857 => '\|', 858 => '{\| class="wikitable" style="text-align:center;"', 859 => '\| <big>c</big> \|\| <big>̧</big>', 860 => '\|-', 861 => '\| <small>U+0063</small> \|\| <small>U+0327</small>', 862 => '\|}', 863 => '\|', 864 => '{\| class="wikitable" style="text-align:center;"', 865 => '\| <big>ç</big>', 866 => '\|-', 867 => '\| <small>U+00E7</small>', 868 => '\|}', 869 => '== Двунаправленное письмо ==', 870 => 'Стандарт Юникод поддерживает письменности языков как с направлением написания слева направо ({{lang-en\|left-to-right, LTR}}), так и с написанием справа налево ({{lang-en\|right-to-left, RTL}}) — например, [[арабское письмо\|арабское]] и [[еврейский алфавит\|еврейское]] письмо. В обоих случаях символы хранятся в «естественном» порядке; их отображение с учётом нужного направления письма обеспечивается приложением.', 871 => 'Кроме того, Юникод поддерживает комбинированные тексты, сочетающие фрагменты с разным направлением письма. Данная возможность называется ''двунаправленность'' ({{lang-en\|bidirectional text, BiDi}}). Некоторые упрощённые обработчики текста (например, в сотовых телефонах) могут поддерживать Юникод, но не иметь поддержки двунаправленности. Все символы Юникода поделены на несколько категорий: пишущиеся слева направо, пишущиеся справа налево, и пишущиеся в любом направлении. Символы последней категории (в основном это [[знаки пунктуации]]) при отображении принимают направление окружающего их текста.', 872 => '== Представленные символы ==', 873 => '[[Файл:Roadmap to Unicode BMP multilingual.svg\|lang=ru\|right\|500px\|thumb\|Схема [[Плоскость (Юникод)#Основная многоязычная плоскость\|основной мультиязычной плоскости]] Юникода]]', 874 => '{{Main\|Плоскость (Юникод)}}', 875 => 'Юникод включает практически все современные [[письменность\|письменности]], в том числе:', 876 => '{{columns-list\|2\|', 877 => '* [[арабское письмо\|арабскую]],', 878 => '* [[армянское письмо\|армянскую]],', 879 => '* [[бенгальское письмо\|бенгальскую]],', 880 => '* [[Бирманское письмо\|бирманскую]],', 881 => '* [[Глаголица\|глаголицу]],', 882 => '* [[Греческий алфавит\|греческую]],', 883 => '* [[грузинское письмо\|грузинскую]],', 884 => '* [[деванагари]],', 885 => '* [[еврейский алфавит\|еврейскую]],', 886 => '* [[Кириллица\|кириллицу]],', 887 => '* [[китайское письмо\|китайскую]] (китайские иероглифы активно используются в [[японский язык\|японском языке]], а также изредка в [[корейский язык\|корейском]]),', 888 => '* [[коптское письмо\|коптскую]],', 889 => '* [[Кхмерское письмо\|кхмерскую]],', 890 => '* [[Латинский алфавит\|латинскую]],', 891 => '* [[Тамильское письмо\|тамильскую]],', 892 => '* [[Хангыль\|корейскую (хангыль)]],', 893 => '* [[письмо чероки\|чероки]],', 894 => '* [[Эфиопское письмо\|эфиопскую]],', 895 => '* [[японское письмо\|японскую]] (которая включает в себя, кроме [[кана\|слоговой азбуки]], ещё и [[кандзи\|китайские иероглифы]])', 896 => '}}', 897 => 'и другие.', 898 => 'С академическими целями добавлены многие исторические письменности, в том числе: [[руны\|германские руны]], [[Древнетюркское письмо\|древнетюркские руны]], [[древнегреческий язык\|древнегреческая письменность]], [[египетские иероглифы]], [[клинопись]], [[письменность майя]], [[этрусский алфавит]].', 899 => 'В Юникоде представлен широкий набор [[таблица математических символов\|математических]] и [[музыка]]льных символов, а также [[пиктограмма\|пиктограмм]].', 900 => '[[государственный флаг\|Государственные флаги]] не включены в Юникод напрямую. Для их кодирования используются пары из 26 буквенных символов, предназначенных для представления двухбуквенных кодов стран по стандарту [[ISO 3166-1 alpha-2]]. Эти буквы закодированы в диапазоне от {{unichar\|1F1E6\|regional indicator symbol letter a\|html=}} до {{unichar\|1F1FF\|regional indicator symbol letter z\|html=}}.', 901 => 'В Юникод принципиально не включаются [[логотип]]ы компаний и продуктов, хотя они и встречаются в шрифтах (например, логотип [[Apple]] в кодировке [[MacRoman]] (0xF0) или логотип [[Microsoft Windows\|Windows]] в шрифте Wingdings (0xFF)). В юникодовских шрифтах логотипы должны размещаться только в области пользовательских символов.', 902 => '== ISO/IEC 10646 ==', 903 => 'Консорциум Юникода работает в тесной связи с рабочей группой ISO/IEC/JTC1/SC2/WG2, которая занимается разработкой международного стандарта 10646 ([[ISO]]/[[IEC]] 10646). Между стандартом Юникода и ISO/IEC 10646 установлена синхронизация, хотя каждый стандарт использует свою терминологию и систему документации.', 904 => 'Сотрудничество Консорциума Юникода с Международной организацией по стандартизации ({{lang-en\|International Organization for Standardization, ISO}}) началось в [[1991 год]]у. В [[1993 год]]у ISO выпустила стандарт DIS 10646.1. Для синхронизации с ним Консорциум утвердил стандарт Юникода версии 1.1, в который были внесены дополнительные символы из DIS 10646.1. В результате значения закодированных символов в Unicode 1.1 и DIS 10646.1 полностью совпали.', 905 => 'В дальнейшем сотрудничество двух организаций продолжилось. В 2000 году стандарт Unicode 3.0 был синхронизирован с ISO/IEC 10646-1:2000. Предстоящая третья версия ISO/IEC 10646 будет синхронизирована с Unicode 4.0. Возможно, эти спецификации даже будут опубликованы как единый стандарт.', 906 => 'Аналогично форматам UTF-16 и UTF-32 в стандарте Юникода, стандарт ISO/IEC 10646 также имеет две основные формы кодирования символов: UCS-2 (2 байта на символ, аналогично UTF-16) и UCS-4 (4 байта на символ, аналогично UTF-32). UCS значит ''универсальный набор кодированных символов'' ({{lang-en\|universal coded character set}}). UCS-2 можно считать подмножеством UTF-16 (UTF-16 без суррогатных пар), а UCS-4 является синонимом для UTF-32.', 907 => 'Различия стандартов Юникод и ISO/IEC 10646:', 908 => '* небольшие различия в терминологии;', 909 => '* ISO/IEC 10646 не включает разделы, необходимые для полноценной реализации поддержки Юникода:', 910 => ' нет данных о двоичном кодировании символов;', 911 => ' нет описания алгоритмов сравнения ({{lang-en\|collation}}) и отрисовки ({{lang-en\|rendering}}) символов;', 912 => '** нет перечня свойств символов (например, нет перечня свойств, необходимых для реализации поддержки двунаправленного ({{lang-en\|bi-directional}}) письма).', 913 => '== Способы представления ==', 914 => 'Юникод имеет несколько форм представления ({{lang-en\|Unicode transformation format, UTF}}): [[UTF-8]], [[UTF-16]] (UTF-16BE, UTF-16LE) и [[UTF-32]] (UTF-32BE, UTF-32LE). Была разработана также форма представления [[UTF-7]] для передачи по семибитным каналам, но из-за несовместимости с [[ASCII]] она не получила распространения и не включена в стандарт. 1 апреля 2005 года были предложены две [[День смеха\|шуточные]] формы представления: UTF-9 и UTF-18 ([http://tools.ietf.org/html/rfc4042 RFC{{nbsp}}4042]).', 915 => 'В [[Microsoft]] [[Windows NT]] и основанных на ней системах [[Windows 2000]] и [[Windows XP]] в основном [[Юникод в операционных системах семейства Microsoft Windows\|используется]] форма UTF-16LE. В [[UNIX]]-подобных [[Операционная система\|операционных системах]] [[GNU/Linux]], [[BSD]] и [[Mac OS X]] принята форма UTF-8 для файлов и UTF-32 или UTF-8 для обработки символов в [[оперативная память\|оперативной памяти]].', 916 => '[[Punycode]] — другая форма кодирования последовательностей Unicode-символов в так называемые ACE-последовательности, которые состоят только из алфавитно-цифровых символов, как это разрешено в доменных именах.', 917 => '=== UTF-8 ===', 918 => '{{Основная статья\|UTF-8}}', 919 => 'UTF-8 — представление Юникода, обеспечивающее наибольшую компактность и обратную совместимость с 7-битной системой [[ASCII]]; текст, состоящий только из символов с номерами меньше 128, при записи в UTF-8 превращается в обычный текст [[ASCII]] и может быть отображён любой программой, работающей с ASCII; и наоборот, текст, закодированный 7-битной ASCII может быть отображён программой, предназначенной для работы с UTF-8. Остальные символы Юникода изображаются последовательностями длиной от 2 до 4 байт, в которых первый байт всегда имеет маску <code>11xxxxxx</code>, а остальные — <code>10xxxxxx</code>. В UTF-8 не используются суррогатные пары.', 920 => 'Формат UTF-8 был изобретён [[2 сентября]] [[1992 год]]а [[Томпсон, Кен\|Кеном Томпсоном]] и [[Пайк, Роб\|Робом Пайком]] и реализован в ОС [[Plan 9]]<ref>http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt{{ref-en}}{{Недоступная ссылка\|date=Октябрь 2019 \|bot=InternetArchiveBot }}</ref>. Сейчас стандарт UTF-8 официально закреплён в документах RFC 3629 и ISO/IEC 10646 Annex D.', 921 => '=== UTF-16 и UTF-32 ===', 922 => '{{Основная статья\|UTF-16\|UTF-32}}', 923 => 'UTF-16 — кодировка, позволяющая записывать символы Юникода в диапазонах U+0000...U+D7FF и U+E000...U+10FFFF (общим количеством 1 112 064). При этом каждый символ записывается одним или двумя словами (суррогатная пара). Кодировка UTF-16 описана в приложении Q к международному стандарту ISO/IEC 10646, а также ей посвящён документ IETF RFC 2781 под названием «UTF-16, an encoding of ISO 10646».', 924 => 'UTF-32 — способ представления Юникода, при котором каждый символ занимает ровно 4 байта. Главное преимущество UTF-32 перед кодировками переменной длины заключается в том, что символы Юникод в ней непосредственно индексируемы, поэтому найти символ по номеру его позиции в файле можно чрезвычайно быстро, и получение любого символа ''n''-й позиции при этом является операцией, занимающей всегда одинаковое время. Это также делает замену символов в строках UTF-32 очень простой. Напротив, кодировки с переменной длиной требуют последовательного доступа к символу ''n''-й позиции, что может быть очень затратной по времени операцией. Главный недостаток UTF-32 — это неэффективное использование пространства, так как для хранения любого символа используется четыре байта. Символы, лежащие за пределами нулевой (базовой) плоскости кодового пространства, редко используются в большинстве текстов. Поэтому удвоение, в сравнении с UTF-16, занимаемого строками в UTF-32 пространства, зачастую не оправдано.', 925 => '==== Порядок байтов ====', 926 => '{{Основная статья\|Порядок байтов}}', 927 => 'В потоке данных UTF-16 младший байт может записываться либо перед старшим ({{lang-en\|UTF-16 little-endian, UTF-16LE}}), либо после старшего ({{lang-en\|UTF-16 big-endian, UTF-16BE}}). Аналогично существует два варианта четырёхбайтной кодировки — UTF-32LE и UTF-32BE.', 928 => '=== Маркер последовательности байтов ===', 929 => '{{Основная статья\|Маркер последовательности байтов}}', 930 => 'Для указания на использование Юникода, в начале текстового файла или потока может передаваться [[Маркер последовательности байтов]] ({{lang-en\|byte order mark (BOM)}}) — символ U+FEFF (неразрывный пробел нулевой ширины). По его виду можно легко различить как формат представления Юникода, так и последовательность байтов. Маркер последовательности байтов может принимать следующий вид:', 931 => '; UTF-8 : EF BB BF', 932 => '; UTF-16BE : FE FF', 933 => '; UTF-16LE : FF FE', 934 => '; UTF-32BE : 00 00 FE FF', 935 => '; UTF-32LE : FF FE 00 00', 936 => '=== Юникод и традиционные кодировки ===', 937 => 'Внедрение Юникода привело к изменению подхода к традиционным 8-битным кодировкам. Если раньше такая кодировка всегда задавалась непосредственно, то теперь она может задаваться таблицей соответствия между данной кодировкой и Юникодом. Фактически почти все 8-битные кодировки теперь можно рассматривать как форму представления некоторого подмножества Юникода. И это намного упростило создание программ, которые должны работать с множеством разных кодировок: теперь, чтобы добавить поддержку ещё одной кодировки, надо всего лишь добавить ещё одну таблицу перекодировки символов в Юникод.', 938 => 'Кроме того, многие форматы данных позволяют вставлять любые символы Юникода, даже если документ записан в старой 8-битной кодировке. Например, в HTML можно использовать [[Мнемоники в HTML\|коды с амперсандом]].', 939 => '=== Реализации ===', 940 => 'Большинство современных операционных систем в той или иной степени обеспечивает поддержку Юникода.', 941 => 'В операционных системах семейства [[Windows NT]] для внутреннего представления имён файлов и других системных строк используется двухбайтовая кодировка UTF-16LE. Системные вызовы, принимающие строковые параметры, существуют в однобайтном и двухбайтном вариантах. Подробнее см. в статье [[Юникод в операционных системах семейства Microsoft Windows]].', 942 => '[[UNIX]]-подобные операционные системы, в том числе [[GNU/Linux]], [[BSD]], [[OS X]], используют для представления Юникода кодировку UTF-8. Большинство программ может работать с UTF-8 как с традиционными однобайтными кодировками, не обращая внимания на то, что символ представляется как несколько последовательных байт. Для работы с отдельными символами строки обычно перекодируются в UCS-4, так что каждому символу соответствует [[машинное слово]].', 943 => 'Одной из первых успешных коммерческих реализаций Юникода стала среда программирования [[Java]]. В ней принципиально отказались от 8-битного представления символов в пользу 16-битного. Это решение увеличило расход памяти, но позволило вернуть в программирование важную абстракцию: произвольный одиночный символ (тип <code>char</code>). В частности, программист мог работать со строкой, как с простым массивом. К сожалению, успех не был окончательным, Юникод перерос ограничение в 16 бит и к версии J2SE 5.0 произвольный символ снова стал занимать переменное число единиц памяти — один <code>char</code> или два (см. [[UTF-16\|суррогатная пара]]).', 944 => 'Сейчас большинство языков программирования поддерживает строки Юникода, хотя их представление может различаться в зависимости от реализации.', 945 => '== Методы ввода ==', 946 => 'Поскольку ни одна [[раскладка клавиатуры]] не может позволить вводить все символы Юникода одновременно, от [[операционная система\|операционных систем]] и [[прикладное программное обеспечение\|прикладных программ]] требуется поддержка альтернативных методов ввода произвольных символов Юникода.', 947 => '=== [[Microsoft Windows]] ===', 948 => '{{main\|Юникод в операционных системах семейства Microsoft Windows}}', 949 => 'Хотя, начиная с [[Windows 2000]], служебная программа «Таблица символов» (charmap.exe) поддерживает символы Юникода и позволяет копировать их в [[буфер обмена]], эта поддержка ограничена только базовой плоскостью (коды символов U+0000…U+FFFF). Символы с кодами от U+10000 «Таблица символов» не отображает.', 950 => 'Похожая таблица есть, например, в [[Microsoft Word]].', 951 => 'Иногда можно набрать [[Шестнадцатеричная система счисления\|шестнадцатеричный]] код, нажать {{key\|[[Alt (клавиша)\|Alt]]\|X}}, и код будет заменён на соответствующий символ, например, в [[WordPad]], Microsoft Word. В редакторах {{key\|Alt\|X}} выполняет и обратное преобразование.', 952 => 'Во многих программах MS Windows, чтобы получить символ Unicode, нужно при нажатой клавише Alt набрать десятичное значение кода символа на цифровой клавиатуре. Например, полезными при наборе кириллических текстов будут комбинации Alt+0171 (<!-- защита от Викификатора --><nowiki>«</nowiki>), Alt+0187 (<nowiki>»</nowiki>) и Alt+0769 ([[знак ударения]]). Интересны также комбинации Alt+0133 (…) и Alt+0151 (—).', 953 => '=== [[Macintosh]] ===', 954 => 'В [[Mac OS]] 8.5 и более поздних версиях поддерживается метод ввода, называемый «Unicode Hex Input». При зажатой клавише Option требуется набрать четырёхзначный шестнадцатеричный код требуемого символа. Этот метод позволяет вводить символы с кодами, большими U+FFFF, используя пары суррогатов; такие пары операционной системой будут автоматически заменены на одиночные символы. Этот метод ввода перед использованием нужно активизировать в соответствующем разделе системных настроек и затем выбрать как текущий метод ввода в меню клавиатуры.', 955 => 'Начиная с [[Mac OS X]] 10.2, существует также приложение «Character Palette», позволяющее выбирать символы из таблицы, в которой можно выделять символы определённого блока или символы, поддерживаемые конкретным шрифтом.', 956 => '=== [[GNU/Linux]] ===', 957 => 'В [[GNOME]] также есть утилита «[[Таблица символов GNOME\|Таблица символов]]» (ранее gucharmap), позволяющая отображать символы определённого блока или системы письма и предоставляющая возможность поиска по названию или описанию символа. Когда код нужного символа известен, его можно ввести в соответствии со стандартом [[Международная организация по стандартизации\|ISO]]{{nbsp}}14755: при зажатых клавишах {{key\|Ctrl\|Shift}} ввести шестнадцатеричный код (начиная с некоторой версии GTK+, ввод кода нужно предварить нажатием клавиши ''«U»''). Вводимый шестнадцатеричный код может иметь до {{num\|32\|бит}} в длину, позволяя вводить любые символы Юникода без использования суррогатных пар.', 958 => 'Все приложения [[X Window System\|X{{nbsp}}Window]], включая GNOME и [[KDE]], поддерживают ввод при помощи клавиши {{Key\|[[Compose]]}}. Для клавиатур, на которых нет отдельной клавиши [[Compose]], для этой цели можно назначить любую клавишу — например, {{Key\|[[Caps Lock]]}}.', 959 => 'Консоль GNU/Linux также допускает ввод символа Юникода по его коду — для этого десятичный код символа нужно ввести цифрами расширенного блока клавиатуры при зажатой клавише {{Key\|[[Alt (клавиша)\|Alt]]}}. Можно вводить символы и по их шестнадцатеричному коду: для этого нужно зажать клавишу {{key\|AltGr}}, и для ввода цифр A—F использовать клавиши расширенного блока клавиатуры от {{Key\|NumLock}} до {{Key\|Enter}} (по часовой стрелке). Поддерживается также и ввод в соответствии с ISO{{nbsp}}14755. Для того чтобы перечисленные способы могли работать, нужно включить в консоли режим Юникода вызовом <code>unicode_start</code>(1) и выбрать подходящий шрифт вызовом <code>setfont</code>(8).', 960 => '[[Mozilla Firefox]] для Linux поддерживает ввод символов по ISO{{nbsp}}14755.', 961 => '== Проблемы Юникода ==', 962 => 'В Юникоде английское «a» и польское «a» — один и тот же символ. Точно так же одним и тем же символом (но отличающимся от «a» латинского) считаются русское «а» и сербское «а». Такой принцип кодирования не универсален; по-видимому, решения «на все случаи жизни» вообще не может существовать.', 963 => '* Тексты на [[китайский язык\|китайском]], [[корейский язык\|корейском]] и [[японский язык\|японском]] языках имеют традиционное написание сверху вниз, начиная с правого верхнего угла. Переключение горизонтального и вертикального написания для этих языков не предусмотрено в Юникоде — это должно осуществляться средствами [[язык разметки\|языков разметки]] или внутренними механизмами [[текстовый процессор\|текстовых процессоров]].', 964 => '* Наличие или отсутствие в Юникоде разных начертаний одного и того же символа в зависимости от языка. Нужно следить, чтобы текст всегда был правильно помечен как относящийся к тому или другому языку.', 965 => ': Так, [[китайское письмо\|китайские иероглифы]] могут иметь разные начертания в китайском, японском ([[кандзи]]) и корейском ([[ханча]]), но при этом в Юникоде обозначаются одним и тем же символом (так называемая CJK-унификация), хотя упрощённые и полные иероглифы всё же имеют разные коды. ', 966 => ': Аналогично, [[русский язык\|русский]] и [[сербский язык\|сербский]] <!-- защита от Викификатора --><nowiki>языки</nowiki> используют разное начертание курсивных букв ''п'' и ''т'' (в сербском они выглядят как <span style="text-decoration: overline; font-style: italic">и</span> и <span style="text-decoration: overline; font-style: italic">ш</span>, см. [[сербский курсив]]). ', 967 => '* Перевод из строчных букв в заглавные тоже зависит от языка. Например: в [[турецкий язык\|турецком]] существуют буквы [[i без точки\|İi и Iı]] — таким образом, турецкие правила изменения регистра конфликтуют с [[английский язык\|английскими]], которые предписывают «i» переводить в «I». Подобные проблемы есть и в других языках — например, в канадском диалекте французского языка регистр переводится немного не так, как во Франции<ref>[http://www.transl-gunsmoker.ru/2008/11/unicode.html Регистр в Unicode — это непросто]</ref>.', 968 => '* Даже с [[арабские цифры\|арабскими цифрами]] есть определённые типографские тонкости: цифры бывают «прописными» и «[[минускульные цифры\|строчными]]», пропорциональными и [[моноширинный шрифт\|моноширинными]]<ref>В большинстве шрифтов для ПК реализованы «прописные» (маюскульные) моноширинные цифры.</ref> — для Юникода разницы между ними нет. Подобные нюансы остаются за программным обеспечением.', 969 => 'Некоторые недостатки связаны не с самим Юникодом, а с возможностями обработчиков текста.', 970 => '* Файлы нелатинского текста в Юникоде всегда занимают больше места, так как один символ кодируется не одним байтом, как в различных национальных кодировках, а последовательностью байтов (исключение составляет UTF-8 для языков, алфавит которых укладывается в ASCII, а также наличие в тексте символов двух и более языков, алфавит которых ''не'' укладывается в ASCII<ref>В некоторых случаях документ (не простой текст) в Юникоде может занимать существенно меньше места, чем документ в однобайтовой кодировке. Например, если некая веб-страница содержит примерно поровну русского и греческого текста, то в однобайтовой кодировке придётся либо русские, либо греческие буквы записывать, используя возможности формата документов, в виде кодов с амперсандом, которые занимают 6—7 байт на символ (при использовании десятичных кодов), то есть в среднем на букву придётся 3,5—4 байта, в то время как UTF-8 занимает только 2 байта на греческую или русскую букву.</ref>). Файл шрифта, необходимый для отображения всех символов таблицы Юникод, занимает сравнительно много места в памяти и требует бо́льших вычислительных ресурсов, чем шрифт только одного национального языка пользователя<ref>Один из файлов шрифтов Arial Unicode имеет размер 24 мегабайта; существует Times New Roman размером 120 мегабайт, он содержит количество символов, близкое к 65536.</ref>. С увеличением мощности компьютерных систем и удешевлением памяти и дискового пространства эта проблема становится всё менее существенной; тем не менее, она остаётся актуальной для портативных устройств, например, для мобильных телефонов.', 971 => '* Хотя поддержка Юникода реализована в наиболее распространённых операционных системах, до сих пор не всё прикладное программное обеспечение поддерживает корректную работу с ним. В частности, не всегда обрабатываются метки порядка байтов ([[Byte order mark\|BOM]]) и плохо поддерживаются диакритические символы. Проблема является временной и есть следствие сравнительной новизны стандартов Юникода (в сравнении с однобайтовыми национальными кодировками).', 972 => '* Производительность всех программ обработки строк (в том числе и сортировок в БД) снижается при использовании Юникода вместо однобайтовых кодировок.', 973 => 'Некоторые редкие системы письма всё ещё не представлены должным образом в Юникоде. Изображение «длинных» надстрочных символов, простирающихся над несколькими буквами, как, например, в [[церковнославянский язык\|церковнославянском языке]], пока не реализовано.', 974 => '== «Юникод» или «Уникод»? ==', 975 => '«Unicode» — одновременно и имя собственное (или часть имени, например, Unicode Consortium), и имя нарицательное, происходящее из английского языка.', 976 => 'На первый взгляд предпочтительнее использовать написание «Уникод». В [[Русский язык\|русском языке]] уже есть [[Морфема\|морфемы]] «уни-» (слова с латинским элементом «uni-» традиционно переводились и писались через «уни-»: универсальный, униполярный, унификация, униформа) и «код». Напротив, торговые марки, заимствованные из [[Английский язык\|английского языка]], обычно передаются посредством практической транскрипции, в которой деэтимологизированное сочетание букв «uni-» записывается в виде «юни-» («[[Юнилевер]]», «[[UNIX\|Юникс]]» и т. п.), то есть точно так же, как в случае с побуквенными сокращениями, вроде [[UNICEF]] «United Nations International Children’s Emergency Fund» — [[ЮНИСЕФ]].', 977 => 'Написание «Юникод» уже твёрдо вошло в русскоязычные тексты. В [[Википедия\|Википедии]] используется более распространённый вариант. В [[MS Windows]] используется вариант «Юникод».', 978 => 'На сайте Консорциума есть специальная страница, где рассматриваются проблемы передачи слова «Unicode» в различных языках и системах письма. Для русской кириллицы указан вариант «Юникод»<ref name=autogenerated1 />.', 979 => 'Формы, принятые иностранными организациями для русской передачи слова «Unicode», являются рекомендательными.', 980 => '== См. также ==', 981 => '* [[Символы, представленные в Юникоде]]', 982 => '* [[ASCII]]', 983 => '* [[ISO 8859-1]]', 984 => '* [[UTF-8]]', 985 => '* [[UTF-16]]', 986 => '* [[UTF-32]]', 987 => '* [[Кириллица в Юникоде]]', 988 => '* [[Дроби в Юникоде]]', 989 => '* [[XeTeX]]', 990 => '* [[Свободные универсальные шрифты]]', 991 => '* [[Windows Glyph List 4]]', 992 => '* [[Широкий символ]]', 993 => '* Библиотека [[GLib]] содержит широкий набор функций для работы c символами и строками в кодировке Unicode', 994 => '* [[Проект:Внесение символов алфавитов народов России в Юникод]]', 995 => '== Примечания ==', 996 => '{{примечания\|2}}', 997 => '== Ссылки ==', 998 => '* [http://www.unicode.org/ Официальный сайт Консорциума Юникода]{{ref-en}}', 999 => '* {{dmoz\|Computers/Software/Globalization/Character_Encoding/Unicode/\|Unicode}}{{ref-en}}', 1000 => '* Статья «[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]»{{ref-ru}} на официальном сайте Консорциума', 1001 => '* [http://www.unicode.org/versions/latest/ Последняя версия стандарта Юникод]{{ref-en}}', 1002 => '* Последнюю версию стандарта ISO/IEC 10646 ищите в [http://standards.iso.org/ittf/PubliclyAvailableStandards/ списке доступных стандартов]{{ref-en}}. Документы, соответствующие стандарту Unicode 7.0: [http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012.zip ISO/IEC 10646] (файл ZIP){{ref-en}}, [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip Amendments 1] (файл ZIP){{ref-en}}, Amendments 2 (по состоянию 2014-08-06 ещё недоступен)', 1003 => '* [http://unicode-table.com/ Таблица символов Юникода с названиями и описаниями]{{ref-ru}}{{ref-en}}{{ref-de}}', 1004 => '* [http://www.unicode.org/versions/Unicode5.0.0/appC.pdf Связь Юникода версии 5.0.0 и ISO/IEC 10646] (файл PDF){{ref-en}}', 1005 => '* [http://www.cl.cam.ac.uk/~mgk25/unicode.html FAQ по UTF-8 и Unicode]{{ref-en}}', 1006 => '* [[Кириллица в Юникоде]]: http://www.unicode.org/charts/PDF/U0400.pdf, http://www.unicode.org/charts/PDF/U0500.pdf, http://www.unicode.org/charts/PDF/U2DE0.pdf, http://www.unicode.org/charts/PDF/UA640.pdf{{ref-en}}{{Недоступная ссылка\|date=Январь 2020 \|bot=InternetArchiveBot }}', 1007 => '* [http://www.i18nguy.com/surrogates.html Включение поддержки дополнительных символов Юникода в Windows]{{ref-en}}', 1008 => '* [http://www.fileformat.info/info/unicode/char/search.htm Поиск по символам Юникода]{{ref-en}}', 1009 => '{{Стандарты ISO}}', 1010 => '{{Шрифтовой дизайн}}', 1011 => '[[Категория:Юникод\| ]]', 1012 => '[[Категория:Стандарты Интернета]]', 1013 => '[[Категория:Стандарты ISO]]' ]
Все внешние ссылки, добавленные в правке (`added_links`)	[ 0 => 'https://www.unicode.org/standard/principles.html', 1 => 'https://w3techs.com/technologies/cross/character_encoding/ranking', 2 => 'https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559', 3 => 'https://unicode.org/history/unicode88.pdf', 4 => 'https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf', 5 => 'https://www.unicode.org/history/summary.html', 6 => 'https://unicode.org/history/publicationdates.html', 7 => 'http://tronweb.super-nova.co.jp/unicoderevisited.html', 8 => 'https://unicode.org/consortium/members.html', 9 => 'https://www.unicode.org/charts/', 10 => 'https://home.unicode.org/basic-info/faq/', 11 => 'https://www.unicode.org/roadmaps/bmp/', 12 => 'https://www.unicode.org/pending/about-sei.html', 13 => 'https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html', 14 => 'https://home.unicode.org/unicode-14-0-delayed-for-6-months/', 15 => 'https://www.unicode.org/versions/enumeratedversions.html', 16 => 'https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt', 17 => 'https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt', 18 => 'https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt', 19 => 'https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt', 20 => 'https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt', 21 => 'https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt', 22 => 'https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt', 23 => 'https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt', 24 => 'https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt', 25 => 'https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt', 26 => 'https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt', 27 => 'https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt', 28 => 'https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt', 29 => 'https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt', 30 => 'https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt', 31 => 'https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt', 32 => 'https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt', 33 => 'https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt', 34 => 'https://www.unicode.org/versions/Unicode8.0.0/', 35 => 'https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt', 36 => 'https://www.unicode.org/versions/Unicode9.0.0/', 37 => 'https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt', 38 => 'https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/', 39 => 'https://www.unicode.org/versions/Unicode10.0.0/', 40 => 'https://www.unicode.org/versions/Unicode11.0.0/appC.pdf', 41 => 'https://www.unicode.org/versions/Unicode12.0.0/appC.pdf', 42 => 'https://www.unicode.org/versions/Unicode13.0.0/appC.pdf', 43 => 'https://unicode.org/glossary/', 44 => 'https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212', 45 => 'http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564', 46 => 'https://www.unicode.org/versions/Unicode13.0.0/appA.pdf', 47 => 'https://unicode.org/policies/stability_policy.html', 48 => 'https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463', 49 => 'https://unicode.org/reports/tr17/', 50 => 'https://unicode.org/Public/UNIDATA/NamedSequences.txt', 51 => 'https://unicode.org/Public/UNIDATA/NameAliases.txt', 52 => 'https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf', 53 => 'https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html', 54 => 'https://unicode.org/faq/utf_bom.html', 55 => 'https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt', 56 => 'https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf', 57 => 'https://unicodelookup.com/', 58 => 'http://shapecatcher.com/', 59 => 'http://www.alanwood.net/unicode/explorer.html#ie5', 60 => 'https://www.w3.org/TR/xml11', 61 => 'http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf', 62 => 'https://www.unicode.org/faq/font_keyboard.html', 63 => 'http://tronweb.super-nova.co.jp/characcodehist.html', 64 => 'https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html', 65 => 'https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html', 66 => 'http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc', 67 => 'http://www.ingrid.org/java/i18n/unicode.html', 68 => 'https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html', 69 => 'https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem', 70 => 'https://www.unicode.org/charts/PDF/UFB50.pdf', 71 => 'https://www.unicode.org/charts/PDF/UFE70.pdf', 72 => 'https://www.unicode.org/charts/PDF/UFB00.pdf', 73 => 'https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf', 74 => 'https://www.unicode.org/L2/L2003/03390r-n2654.pdf', 75 => 'https://www.unicode.org/policies/stability_policy.html', 76 => 'https://unicode.org/notes/tn27/', 77 => 'https://www.unicode.org/charts/PDF/U2100.pdf', 78 => 'https://www.unicode.org/charts/PDF/UFE10.pdf', 79 => 'https://www.unicode.org/roadmaps/', 80 => 'https://linguistics.berkeley.edu/sei/', 81 => 'https://www.unicode.org/versions/Unicode6.0.0/', 82 => 'http://www.fluxus-editions.fr/gla1-hara1.php', 83 => 'https://unicode.org/', 84 => 'http://www.alanwood.net/unicode/', 85 => 'https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont', 86 => 'https://github.com/unicode-org', 87 => 'http://www.enciclopedia.cat/enciclop%C3%A8dies/gran-enciclop%C3%A8dia-catalana/EC-GEC-0262830.xml', 88 => 'https://www.britannica.com/topic/Unicode', 89 => 'http://d-nb.info/gnd/4343497-6', 90 => 'http://id.loc.gov/authorities/sh98000843', 91 => 'https://academic.microsoft.com/#/detail/500551929' ]
Все внешние ссылки в новом тексте (`all_links`)	[ 0 => 'https://www.unicode.org/standard/principles.html', 1 => 'https://w3techs.com/technologies/cross/character_encoding/ranking', 2 => 'https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559', 3 => 'https://unicode.org/history/unicode88.pdf', 4 => 'https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf', 5 => 'https://www.unicode.org/history/summary.html', 6 => 'https://unicode.org/history/publicationdates.html', 7 => 'http://tronweb.super-nova.co.jp/unicoderevisited.html', 8 => 'https://unicode.org/consortium/members.html', 9 => 'https://www.unicode.org/charts/', 10 => 'https://home.unicode.org/basic-info/faq/', 11 => 'https://www.unicode.org/roadmaps/bmp/', 12 => 'https://www.unicode.org/pending/about-sei.html', 13 => 'https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html', 14 => 'https://home.unicode.org/unicode-14-0-delayed-for-6-months/', 15 => 'https://www.unicode.org/versions/enumeratedversions.html', 16 => 'https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt', 17 => 'https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt', 18 => 'https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt', 19 => 'https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt', 20 => 'https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt', 21 => 'https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt', 22 => 'https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt', 23 => 'https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt', 24 => 'https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt', 25 => 'https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt', 26 => 'https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt', 27 => 'https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt', 28 => 'https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt', 29 => 'https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt', 30 => 'https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt', 31 => 'https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt', 32 => 'https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt', 33 => 'https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt', 34 => 'https://www.unicode.org/versions/Unicode8.0.0/', 35 => 'https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt', 36 => 'https://www.unicode.org/versions/Unicode9.0.0/', 37 => 'https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt', 38 => 'https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/', 39 => 'https://www.unicode.org/versions/Unicode10.0.0/', 40 => 'https://www.unicode.org/versions/Unicode11.0.0/appC.pdf', 41 => 'http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html', 42 => 'https://www.unicode.org/versions/Unicode12.0.0/appC.pdf', 43 => 'http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html', 44 => 'http://blog.unicode.org/2019/05/unicode-12-1-en.html', 45 => 'https://www.unicode.org/versions/Unicode13.0.0/appC.pdf', 46 => 'http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html', 47 => 'https://unicode.org/glossary/', 48 => 'https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212', 49 => 'http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564', 50 => 'https://www.unicode.org/versions/Unicode13.0.0/appA.pdf', 51 => 'https://unicode.org/policies/stability_policy.html', 52 => 'https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463', 53 => 'https://unicode.org/reports/tr17/', 54 => 'https://unicode.org/Public/UNIDATA/NamedSequences.txt', 55 => 'https://unicode.org/Public/UNIDATA/NameAliases.txt', 56 => 'https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf', 57 => 'https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html', 58 => 'https://unicode.org/faq/utf_bom.html', 59 => 'https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt', 60 => 'https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf', 61 => 'https://unicodelookup.com/', 62 => 'http://shapecatcher.com/', 63 => 'http://www.alanwood.net/unicode/explorer.html#ie5', 64 => 'https://www.w3.org/TR/xml11', 65 => 'http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf', 66 => 'https://www.unicode.org/faq/font_keyboard.html', 67 => 'http://tronweb.super-nova.co.jp/characcodehist.html', 68 => 'https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html', 69 => 'https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html', 70 => 'http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc', 71 => 'http://www.ingrid.org/java/i18n/unicode.html', 72 => 'https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html', 73 => 'https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem', 74 => 'https://www.unicode.org/charts/PDF/UFB50.pdf', 75 => 'https://www.unicode.org/charts/PDF/UFE70.pdf', 76 => 'https://www.unicode.org/charts/PDF/UFB00.pdf', 77 => 'https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf', 78 => 'https://www.unicode.org/L2/L2003/03390r-n2654.pdf', 79 => 'https://www.unicode.org/policies/stability_policy.html', 80 => 'https://unicode.org/notes/tn27/', 81 => 'https://www.unicode.org/charts/PDF/U2100.pdf', 82 => 'https://www.unicode.org/charts/PDF/UFE10.pdf', 83 => 'https://www.unicode.org/roadmaps/', 84 => 'https://linguistics.berkeley.edu/sei/', 85 => 'http://www.unicode.org/versions/Unicode13.0.0/', 86 => 'https://www.unicode.org/versions/Unicode6.0.0/', 87 => 'http://www.fluxus-editions.fr/gla1-hara1.php', 88 => 'https://unicode.org/', 89 => 'http://www.alanwood.net/unicode/', 90 => 'https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont', 91 => 'https://github.com/unicode-org', 92 => 'http://www.enciclopedia.cat/enciclop%C3%A8dies/gran-enciclop%C3%A8dia-catalana/EC-GEC-0262830.xml', 93 => 'https://www.britannica.com/topic/Unicode', 94 => 'http://d-nb.info/gnd/4343497-6', 95 => 'http://id.loc.gov/authorities/sh98000843', 96 => 'https://academic.microsoft.com/#/detail/500551929' ]
Ссылки на странице до правки (`old_links`)	[ 0 => 'http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html', 1 => 'http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html', 2 => 'http://blog.unicode.org/2019/05/unicode-12-1-en.html', 3 => 'http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html', 4 => 'http://curlie.org/Computers/Software/Globalization/Character_Encoding/Unicode/', 5 => 'http://habrahabr.ru/post/45489/', 6 => 'http://standards.iso.org/ittf/PubliclyAvailableStandards/', 7 => 'http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012.zip', 8 => 'http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip', 9 => 'http://support.microsoft.com/kb/99884', 10 => 'http://tools.ietf.org/html/rfc4042', 11 => 'http://unicode-table.com/', 12 => 'http://unicode.org/roadmaps/tip/', 13 => 'http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov', 14 => 'http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt', 15 => 'http://www.cl.cam.ac.uk/~mgk25/unicode.html', 16 => 'http://www.fileformat.info/info/unicode/char/search.htm', 17 => 'http://www.i18nguy.com/surrogates.html', 18 => 'http://www.paratype.ru/help/term/terms.asp?code=361', 19 => 'http://www.transl-gunsmoker.ru/2008/11/unicode.html', 20 => 'http://www.unicode.org/', 21 => 'http://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt', 22 => 'http://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt', 23 => 'http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt', 24 => 'http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt', 25 => 'http://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt', 26 => 'http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt', 27 => 'http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt', 28 => 'http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt', 29 => 'http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt', 30 => 'http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt', 31 => 'http://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt', 32 => 'http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt', 33 => 'http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt', 34 => 'http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt', 35 => 'http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt', 36 => 'http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt', 37 => 'http://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt', 38 => 'http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt', 39 => 'http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt', 40 => 'http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt', 41 => 'http://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt', 42 => 'http://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt', 43 => 'http://www.unicode.org/charts/PDF/U0400.pdf', 44 => 'http://www.unicode.org/charts/PDF/U0500.pdf', 45 => 'http://www.unicode.org/charts/PDF/U2DE0.pdf', 46 => 'http://www.unicode.org/charts/PDF/UA640.pdf', 47 => 'http://www.unicode.org/consortium/consort.html', 48 => 'http://www.unicode.org/copyright.html', 49 => 'http://www.unicode.org/faq/emoji_dingbats.html', 50 => 'http://www.unicode.org/history/publicationdates.html', 51 => 'http://www.unicode.org/history/unicode88.pdf', 52 => 'http://www.unicode.org/policies/stability_policy.html', 53 => 'http://www.unicode.org/standard/UnicodeTranscriptions.html', 54 => 'http://www.unicode.org/standard/principles.html', 55 => 'http://www.unicode.org/standard/translations/russian.html', 56 => 'http://www.unicode.org/ucd/', 57 => 'http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table', 58 => 'http://www.unicode.org/versions/Unicode10.0.0/', 59 => 'http://www.unicode.org/versions/Unicode1.0.0/', 60 => 'http://www.unicode.org/versions/Unicode1.1.0/', 61 => 'http://www.unicode.org/versions/Unicode11.0.0/', 62 => 'http://www.unicode.org/versions/Unicode12.0.0/', 63 => 'http://www.unicode.org/versions/Unicode12.1.0/', 64 => 'http://www.unicode.org/versions/Unicode13.0.0/', 65 => 'http://www.unicode.org/versions/Unicode2.0.0/', 66 => 'http://www.unicode.org/versions/Unicode2.1.0/', 67 => 'http://www.unicode.org/versions/Unicode3.0.0/', 68 => 'http://www.unicode.org/versions/Unicode3.1.0/', 69 => 'http://www.unicode.org/versions/Unicode3.2.0/', 70 => 'http://www.unicode.org/versions/Unicode4.0.0/', 71 => 'http://www.unicode.org/versions/Unicode4.1.0/', 72 => 'http://www.unicode.org/versions/Unicode5.0.0/appC.pdf', 73 => 'http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf', 74 => 'http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf', 75 => 'http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf', 76 => 'http://www.unicode.org/versions/Unicode5.0.0/', 77 => 'http://www.unicode.org/versions/Unicode5.1.0/', 78 => 'http://www.unicode.org/versions/Unicode5.2.0/', 79 => 'http://www.unicode.org/versions/Unicode6.2.0/', 80 => 'http://www.unicode.org/versions/Unicode6.3.0/', 81 => 'http://www.unicode.org/versions/Unicode6.0.0/', 82 => 'http://www.unicode.org/versions/Unicode6.1.0/', 83 => 'http://www.unicode.org/versions/Unicode7.0.0/', 84 => 'http://www.unicode.org/versions/Unicode8.0.0/', 85 => 'http://www.unicode.org/versions/Unicode9.0.0/', 86 => 'http://www.unicode.org/versions/enumeratedversions.html', 87 => 'http://www.unicode.org/versions/index.html', 88 => 'http://www.unicode.org/versions/latest/', 89 => 'https://web.archive.org/web/20060408204540/http://www.unicode.org/standard/UnicodeTranscriptions.html', 90 => 'https://web.archive.org/web/20090926092654/http://support.microsoft.com/kb/99884', 91 => 'https://web.archive.org/web/20100110085403/http://www.unicode.org/history/publicationdates.html', 92 => 'https://web.archive.org/web/20100310120125/http://www.unicode.org/standard/principles.html', 93 => 'https://web.archive.org/web/20100611042601/http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov', 94 => 'https://web.archive.org/web/20100627085503/http://www.unicode.org/consortium/consort.html', 95 => 'https://web.archive.org/web/20100627093139/http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf', 96 => 'https://web.archive.org/web/20100627140856/http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf', 97 => 'https://web.archive.org/web/20100627141434/http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf', 98 => 'https://web.archive.org/web/20170906035012/http://unicode.org/history/unicode88.pdf' ]
Разобранный HTML-код новой версии (`new_html`)	'<div class="mw-parser-output"><p><a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Use_dmy_dates&action=edit&redlink=1" class="new" title="Шаблон:Use dmy dates (страница отсутствует)">Шаблон:Use dmy dates</a> </p> <div class="dablink hatnote">Не следует путать с <a href="/w/index.php?title=Unicode_(telegraphy)&action=edit&redlink=1" class="new" title="Unicode (telegraphy) (страница отсутствует)">Unicode (telegraphy)</a>.</div> <p><a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:For&action=edit&redlink=1" class="new" title="Шаблон:For (страница отсутствует)">Шаблон:For</a> <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Short_description&action=edit&redlink=1" class="new" title="Шаблон:Short description (страница отсутствует)">Шаблон:Short description</a> <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Infobox_character_encoding&action=edit&redlink=1" class="new" title="Шаблон:Infobox character encoding (страница отсутствует)">Шаблон:Infobox character encoding</a> <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Contains_special_characters&action=edit&redlink=1" class="new" title="Шаблон:Contains special characters (страница отсутствует)">Шаблон:Contains special characters</a> <b>Unicode</b> is an <a href="/w/index.php?title=Information_technology&action=edit&redlink=1" class="new" title="Information technology (страница отсутствует)">information technology</a> <a href="/w/index.php?title=Technical_standard&action=edit&redlink=1" class="new" title="Technical standard (страница отсутствует)">standard</a> for the consistent <a href="/w/index.php?title=Character_encoding&action=edit&redlink=1" class="new" title="Character encoding (страница отсутствует)">encoding</a>, representation, and handling of <a href="/w/index.php?title=Character_(computing)&action=edit&redlink=1" class="new" title="Character (computing) (страница отсутствует)">text</a> expressed in most of the world's <a href="/w/index.php?title=Writing_system&action=edit&redlink=1" class="new" title="Writing system (страница отсутствует)">writing systems</a>. The standard is maintained by the <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a>, and на <strong class="error">Ошибка выражения: неопознанное слово «march»</strong> год, there is a repertoire of <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Unicodenover&action=edit&redlink=1" class="new" title="Шаблон:Unicodenover (страница отсутствует)">Шаблон:Unicodenover</a> (these <a href="/w/index.php?title=Character_(computing)&action=edit&redlink=1" class="new" title="Character (computing) (страница отсутствует)">characters</a> consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic <a href="/w/index.php?title=Script_(Unicode)&action=edit&redlink=1" class="new" title="Script (Unicode) (страница отсутствует)">scripts</a>, as well as multiple symbol sets and <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a>. The character repertoire of the Unicode Standard is synchronized with <a href="/wiki/ISO/IEC_10646" class="mw-redirect" title="ISO/IEC 10646">ISO/IEC 10646</a>, and both are code-for-code identical. </p><p><i>The Unicode Standard</i> consists of a set of code charts for visual reference, an encoding method and set of standard <a href="/w/index.php?title=Character_encoding&action=edit&redlink=1" class="new" title="Character encoding (страница отсутствует)">character encodings</a>, a set of reference <a href="/w/index.php?title=Data_file&action=edit&redlink=1" class="new" title="Data file (страница отсутствует)">data files</a>, and a number of related items, such as character properties, rules for <a href="/w/index.php?title=Unicode_normalization&action=edit&redlink=1" class="new" title="Unicode normalization (страница отсутствует)">normalization</a>, decomposition, <a href="/w/index.php?title=Unicode_collation_algorithm&action=edit&redlink=1" class="new" title="Unicode collation algorithm (страница отсутствует)">collation</a>, rendering, and <a href="/w/index.php?title=Bidirectional_text&action=edit&redlink=1" class="new" title="Bidirectional text (страница отсутствует)">bidirectional text</a> display order (for the correct display of text containing both right-to-left scripts, such as <a href="/w/index.php?title=Arabic_script&action=edit&redlink=1" class="new" title="Arabic script (страница отсутствует)">Arabic</a> and <a href="/w/index.php?title=Hebrew_alphabet&action=edit&redlink=1" class="new" title="Hebrew alphabet (страница отсутствует)">Hebrew</a>, and left-to-right scripts).<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[1]</a></sup> </p><p>Unicode's success at unifying character sets has led to its widespread and predominant use in the <a href="/w/index.php?title=Internationalization_and_localization&action=edit&redlink=1" class="new" title="Internationalization and localization (страница отсутствует)">internationalization and localization</a> of computer <a href="/wiki/Software" class="mw-redirect" title="Software">software</a>. The standard has been implemented in many recent technologies, including modern <a href="/w/index.php?title=Operating_system&action=edit&redlink=1" class="new" title="Operating system (страница отсутствует)">operating systems</a>, <a href="/wiki/XML" title="XML">XML</a>, <a href="/wiki/Java_(programming_language)" class="mw-redirect" title="Java (programming language)">Java</a> (and other programming languages), and the <a href="/wiki/.NET_Framework" title=".NET Framework">.NET Framework</a>. </p><p><a href="/w/index.php?title=Comparison_of_Unicode_encodings&action=edit&redlink=1" class="new" title="Comparison of Unicode encodings (страница отсутствует)">Unicode can be implemented</a> by different <a href="/w/index.php?title=Character_encoding&action=edit&redlink=1" class="new" title="Character encoding (страница отсутствует)">character encodings</a>. The Unicode standard defines <a href="/wiki/UTF-8" title="UTF-8">UTF-8</a>, <a href="/wiki/UTF-16" title="UTF-16">UTF-16</a>, and <a href="/wiki/UTF-32" title="UTF-32">UTF-32</a>, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and <a href="/w/index.php?title=Universal_Coded_Character_Set&action=edit&redlink=1" class="new" title="Universal Coded Character Set (страница отсутствует)">UCS</a>-2 (without full support for Unicode), a precursor of UTF-16; <a href="/w/index.php?title=GB_18030&action=edit&redlink=1" class="new" title="GB 18030 (страница отсутствует)">GB18030</a> is standardized in China and implements Unicode fully, while not an official Unicode standard. </p><p>UTF-8, the dominant encoding on the <a href="/wiki/World_Wide_Web" class="mw-redirect" title="World Wide Web">World Wide Web</a> (used in over 94% of websites на март <strong class="error">Ошибка выражения: неопознанное слово «november»</strong> года),<sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup> uses one <a href="/wiki/Byte" class="mw-disambig" title="Byte">byte</a><sup id="cite_ref-3" class="reference"><a href="#cite_note-3">[note 1]</a></sup>for the first 128 <a href="/w/index.php?title=Code_point&action=edit&redlink=1" class="new" title="Code point (страница отсутствует)">code points</a>, and up to 4 bytes for other characters.<sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[3]</a></sup> The first 128 Unicode code points represent the <a href="/wiki/ASCII" title="ASCII">ASCII</a> characters, which means that any ASCII text is also a UTF-8 text. </p><p>UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called <a href="/w/index.php?title=Basic_Multilingual_Plane&action=edit&redlink=1" class="new" title="Basic Multilingual Plane (страница отсутствует)">Basic Multilingual Plane</a> (BMP). With 1,112,064 possible Unicode code points corresponding to characters (see <a href="#Architecture_and_terminology">below</a>) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same <a href="/w/index.php?title=16-bit&action=edit&redlink=1" class="new" title="16-bit (страница отсутствует)">16-bit</a> encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text. </p><p>UTF-32 (also referred to as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used. </p> <div id="toc" class="toc" role="navigation" aria-labelledby="mw-toc-heading"><input type="checkbox" role="button" id="toctogglecheckbox" class="toctogglecheckbox" style="display:none" /><div class="toctitle" lang="ru" dir="ltr"><h2 id="mw-toc-heading">Содержание</h2><span class="toctogglespan"><label class="toctogglelabel" for="toctogglecheckbox"></label></span></div> <ul> <li class="toclevel-1 tocsection-1"><a href="#Origin_and_development"><span class="tocnumber">1</span> <span class="toctext">Origin and development</span></a> <ul> <li class="toclevel-2 tocsection-2"><a href="#History"><span class="tocnumber">1.1</span> <span class="toctext"><span></span>History</span></a></li> <li class="toclevel-2 tocsection-3"><a href="#Unicode_Consortium"><span class="tocnumber">1.2</span> <span class="toctext">Unicode Consortium</span></a></li> <li class="toclevel-2 tocsection-4"><a href="#Scripts_covered"><span class="tocnumber">1.3</span> <span class="toctext">Scripts covered</span></a></li> <li class="toclevel-2 tocsection-5"><a href="#Versions"><span class="tocnumber">1.4</span> <span class="toctext"><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>Versions</span></a></li> </ul> </li> <li class="toclevel-1 tocsection-6"><a href="#Architecture_and_terminology"><span class="tocnumber">2</span> <span class="toctext">Architecture and terminology</span></a> <ul> <li class="toclevel-2 tocsection-7"><a href="#Code_point_planes_and_blocks"><span class="tocnumber">2.1</span> <span class="toctext">Code point planes and blocks</span></a></li> <li class="toclevel-2 tocsection-8"><a href="#General_Category_property"><span class="tocnumber">2.2</span> <span class="toctext">General Category property</span></a></li> <li class="toclevel-2 tocsection-9"><a href="#Abstract_characters"><span class="tocnumber">2.3</span> <span class="toctext">Abstract characters</span></a></li> <li class="toclevel-2 tocsection-10"><a href="#Ready-made_versus_composite_characters"><span class="tocnumber">2.4</span> <span class="toctext">Ready-made versus composite characters</span></a></li> <li class="toclevel-2 tocsection-11"><a href="#Ligatures"><span class="tocnumber">2.5</span> <span class="toctext">Ligatures</span></a></li> <li class="toclevel-2 tocsection-12"><a href="#Standardized_subsets"><span class="tocnumber">2.6</span> <span class="toctext">Standardized subsets</span></a></li> <li class="toclevel-2 tocsection-13"><a href="#Mapping_and_encodings"><span class="tocnumber">2.7</span> <span class="toctext"><span></span>Mapping and encodings</span></a></li> </ul> </li> <li class="toclevel-1 tocsection-14"><a href="#Adoption"><span class="tocnumber">3</span> <span class="toctext">Adoption</span></a> <ul> <li class="toclevel-2 tocsection-15"><a href="#Operating_systems"><span class="tocnumber">3.1</span> <span class="toctext">Operating systems</span></a></li> <li class="toclevel-2 tocsection-16"><a href="#Input_methods"><span class="tocnumber">3.2</span> <span class="toctext">Input methods</span></a></li> <li class="toclevel-2 tocsection-17"><a href="#Email"><span class="tocnumber">3.3</span> <span class="toctext">Email</span></a></li> <li class="toclevel-2 tocsection-18"><a href="#Web"><span class="tocnumber">3.4</span> <span class="toctext">Web</span></a></li> <li class="toclevel-2 tocsection-19"><a href="#Fonts"><span class="tocnumber">3.5</span> <span class="toctext">Fonts</span></a></li> <li class="toclevel-2 tocsection-20"><a href="#Newlines"><span class="tocnumber">3.6</span> <span class="toctext">Newlines</span></a></li> </ul> </li> <li class="toclevel-1 tocsection-21"><a href="#Issues"><span class="tocnumber">4</span> <span class="toctext">Issues</span></a> <ul> <li class="toclevel-2 tocsection-22"><a href="#Philosophical_and_completeness_criticisms"><span class="tocnumber">4.1</span> <span class="toctext">Philosophical and completeness criticisms</span></a></li> <li class="toclevel-2 tocsection-23"><a href="#Mapping_to_legacy_character_sets"><span class="tocnumber">4.2</span> <span class="toctext">Mapping to legacy character sets</span></a></li> <li class="toclevel-2 tocsection-24"><a href="#Indic_scripts"><span class="tocnumber">4.3</span> <span class="toctext">Indic scripts</span></a></li> <li class="toclevel-2 tocsection-25"><a href="#Combining_characters"><span class="tocnumber">4.4</span> <span class="toctext">Combining characters</span></a></li> <li class="toclevel-2 tocsection-26"><a href="#Anomalies"><span class="tocnumber">4.5</span> <span class="toctext">Anomalies</span></a></li> </ul> </li> <li class="toclevel-1 tocsection-27"><a href="#See_also"><span class="tocnumber">5</span> <span class="toctext">See also</span></a></li> <li class="toclevel-1 tocsection-28"><a href="#Further_reading"><span class="tocnumber">6</span> <span class="toctext">Further reading</span></a></li> <li class="toclevel-1 tocsection-29"><a href="#Notes"><span class="tocnumber">7</span> <span class="toctext">Notes</span></a></li> <li class="toclevel-1 tocsection-30"><a href="#References"><span class="tocnumber">8</span> <span class="toctext">References</span></a></li> <li class="toclevel-1 tocsection-31"><a href="#External_links"><span class="tocnumber">9</span> <span class="toctext">External links</span></a></li> </ul> </div> <h2><span class="mw-headline" id="Origin_and_development">Origin and development</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=1" class="mw-editsection-visualeditor" title="Редактировать раздел «Origin and development»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=1" title="Редактировать раздел «Origin and development»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <p>Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the <a href="/w/index.php?title=ISO/IEC_8859&action=edit&redlink=1" class="new" title="ISO/IEC 8859 (страница отсутствует)">ISO/IEC 8859</a> standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using <a href="/w/index.php?title=Latin_character&action=edit&redlink=1" class="new" title="Latin character (страница отсутствует)">Latin characters</a> and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other). </p><p>Unicode, in intent, encodes the underlying characters—<a href="/w/index.php?title=Grapheme&action=edit&redlink=1" class="new" title="Grapheme (страница отсутствует)">graphemes</a> and grapheme-like units—rather than the variant <a href="/w/index.php?title=Glyph&action=edit&redlink=1" class="new" title="Glyph (страница отсутствует)">glyphs</a> (renderings) for such characters. In the case of <a href="/w/index.php?title=Chinese_characters&action=edit&redlink=1" class="new" title="Chinese characters (страница отсутствует)">Chinese characters</a>, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see <a href="/w/index.php?title=Han_unification&action=edit&redlink=1" class="new" title="Han unification (страница отсутствует)">Han unification</a>). </p><p>In text processing, Unicode takes the role of providing a unique <i>code point</i>—a <a href="/w/index.php?title=Number&action=edit&redlink=1" class="new" title="Number (страница отсутствует)">number</a>, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, <a href="/w/index.php?title=Font&action=edit&redlink=1" class="new" title="Font (страница отсутствует)">font</a>, or style) to other software, such as a <a href="/wiki/Web_browser" class="mw-redirect" title="Web browser">web browser</a> or <a href="/w/index.php?title=Word_processor&action=edit&redlink=1" class="new" title="Word processor (страница отсутствует)">word processor</a>. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode. </p><p>The first 256 code points were made identical to the content of <a href="/w/index.php?title=ISO/IEC_8859-1&action=edit&redlink=1" class="new" title="ISO/IEC 8859-1 (страница отсутствует)">ISO/IEC 8859-1</a> so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "<a href="/w/index.php?title=Halfwidth_and_Fullwidth_Forms_(Unicode_block)&action=edit&redlink=1" class="new" title="Halfwidth and Fullwidth Forms (Unicode block) (страница отсутствует)">fullwidth forms</a>" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean (<a href="/wiki/CJK" class="mw-redirect" title="CJK">CJK</a>) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see <a href="/w/index.php?title=Duplicate_characters_in_Unicode&action=edit&redlink=1" class="new" title="Duplicate characters in Unicode (страница отсутствует)">duplicate characters in Unicode</a>. </p> <h3><span class="mw-headline" id="History"><span id="Unicode_88"><span id="Unicode_88"></span></span>History</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=2" class="mw-editsection-visualeditor" title="Редактировать раздел «History»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=2" title="Редактировать раздел «History»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Based on experiences with the <a href="/w/index.php?title=Xerox_Character_Code_Standard&action=edit&redlink=1" class="new" title="Xerox Character Code Standard (страница отсутствует)">Xerox Character Code Standard</a> (XCCS) since 1980,<sup id="cite_ref-unicode-88_5-0" class="reference"><a href="#cite_note-unicode-88-5">[4]</a></sup> the origins of Unicode date to 1987, when <a href="/w/index.php?title=Joe_Becker_(Unicode)&action=edit&redlink=1" class="new" title="Joe Becker (Unicode) (страница отсутствует)">Joe Becker</a> from <a href="/wiki/Xerox" title="Xerox">Xerox</a> with <a href="/w/index.php?title=Lee_Collins_(software_engineer)&action=edit&redlink=1" class="new" title="Lee Collins (software engineer) (страница отсутствует)">Lee Collins</a> and <a href="/w/index.php?title=Mark_Davis_(Unicode)&action=edit&redlink=1" class="new" title="Mark Davis (Unicode) (страница отсутствует)">Mark Davis</a> from <a href="/wiki/Apple_Inc." class="mw-redirect" title="Apple Inc.">Apple</a>, started investigating the practicalities of creating a universal character set.<sup id="cite_ref-6" class="reference"><a href="#cite_note-6">[5]</a></sup> With additional input from Peter Fenwick and Dave Opstad,<sup id="cite_ref-unicode-88_5-1" class="reference"><a href="#cite_note-unicode-88-5">[4]</a></sup> Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".<sup id="cite_ref-unicode-88_5-2" class="reference"><a href="#cite_note-unicode-88-5">[4]</a></sup> </p><p>In this document, entitled <i>Unicode 88</i>, Becker outlined a <a href="/w/index.php?title=16-bit&action=edit&redlink=1" class="new" title="16-bit (страница отсутствует)">16-bit</a> character model:<sup id="cite_ref-unicode-88_5-3" class="reference"><a href="#cite_note-unicode-88-5">[4]</a></sup> </p> <blockquote> <p>Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose. </p> </blockquote> <p>His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:<sup id="cite_ref-unicode-88_5-4" class="reference"><a href="#cite_note-unicode-88-5">[4]</a></sup> </p> <blockquote> <p>Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2<sup>14</sup> = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes. </p> </blockquote> <p>In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of <a href="/w/index.php?title=Research_Libraries_Group&action=edit&redlink=1" class="new" title="Research Libraries Group (страница отсутствует)">RLG</a>, and Glenn Wright of <a href="/wiki/Sun_Microsystems" title="Sun Microsystems">Sun Microsystems</a>, and in 1990, Michel Suignard and Asmus Freytag from <a href="/wiki/Microsoft" title="Microsoft">Microsoft</a> and Rick McGowan of <a href="/wiki/NeXT" title="NeXT">NeXT</a> joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready. </p><p>The <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a> was incorporated in California on 3 January 1991,<sup id="cite_ref-7" class="reference"><a href="#cite_note-7">[6]</a></sup> and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992. </p><p>In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g., <a href="/w/index.php?title=Egyptian_hieroglyphs&action=edit&redlink=1" class="new" title="Egyptian hieroglyphs (страница отсутствует)">Egyptian hieroglyphs</a>) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.<sup id="cite_ref-unicoderevisited_8-0" class="reference"><a href="#cite_note-unicoderevisited-8">[7]</a></sup> </p><p>The Microsoft TrueType specification version 1.0 from 1992 used the name <i>Apple Unicode</i> instead of <i>Unicode</i> for the Platform ID in the naming table. </p> <h3><span class="mw-headline" id="Unicode_Consortium">Unicode Consortium</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=3" class="mw-editsection-visualeditor" title="Редактировать раздел «Unicode Consortium»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=3" title="Редактировать раздел «Unicode Consortium»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a></b></div> <p>The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including <a href="/w/index.php?title=Adobe_Inc.&action=edit&redlink=1" class="new" title="Adobe Inc. (страница отсутствует)">Adobe</a>, <a href="/wiki/Apple_Inc." class="mw-redirect" title="Apple Inc.">Apple</a>, <a href="/wiki/Google" class="mw-disambig" title="Google">Google</a>, <a href="/wiki/International_Business_Machines" class="mw-redirect" title="International Business Machines">IBM</a>, <a href="/wiki/Microsoft" title="Microsoft">Microsoft</a>, and <a href="/wiki/Oracle_Corporation" class="mw-redirect" title="Oracle Corporation">Oracle Corporation</a>.<sup id="cite_ref-members_9-0" class="reference"><a href="#cite_note-members-9">[8]</a></sup> </p><p>Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the <a href="/w/index.php?title=Ministry_of_Endowments_and_Religious_Affairs_(Oman)&action=edit&redlink=1" class="new" title="Ministry of Endowments and Religious Affairs (Oman) (страница отсутствует)">Ministry of Endowments and Religious Affairs (Oman)</a> is a full member with voting rights.<sup id="cite_ref-members_9-1" class="reference"><a href="#cite_note-members-9">[8]</a></sup> </p><p>The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard <a href="#Unicode_Transformation_Format_and_Universal_Character_Set">Unicode Transformation Format</a> (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with <a href="/w/index.php?title=Multilingualism&action=edit&redlink=1" class="new" title="Multilingualism (страница отсутствует)">multilingual</a> environments. </p> <h3><span class="mw-headline" id="Scripts_covered">Scripts covered</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=4" class="mw-editsection-visualeditor" title="Редактировать раздел «Scripts covered»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=4" title="Редактировать раздел «Scripts covered»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Script_(Unicode)&action=edit&redlink=1" class="new" title="Script (Unicode) (страница отсутствует)">Script (Unicode)</a></b></div> <div class="thumb tright"><div class="thumbinner" style="width:202px;"><a href="/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Unicode_sample.png" class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Unicode_sample.png/200px-Unicode_sample.png" decoding="async" width="200" height="56" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Unicode_sample.png/300px-Unicode_sample.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Unicode_sample.png/400px-Unicode_sample.png 2x" data-file-width="582" data-file-height="162" /></a> <div class="thumbcaption"><div class="magnify"><a href="/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Unicode_sample.png" class="internal" title="Увеличить"></a></div>Many modern applications can render a substantial subset of the many <a href="/w/index.php?title=Scripts_in_Unicode&action=edit&redlink=1" class="new" title="Scripts in Unicode (страница отсутствует)">scripts in Unicode</a>, as demonstrated by this screenshot from the <a href="/wiki/OpenOffice.org" class="mw-redirect" title="OpenOffice.org">OpenOffice.org</a> application.</div></div></div> <p>Unicode covers almost all scripts (<a href="/w/index.php?title=Writing_system&action=edit&redlink=1" class="new" title="Writing system (страница отсутствует)">writing systems</a>) in current use today.<sup id="cite_ref-10" class="reference"><a href="#cite_note-10">[9]</a></sup><span style="background: #ffeaea; color: #444444;"></span><sup class="noprint" style="white-space: nowrap">[<i><a href="/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F:%D0%9F%D1%80%D0%BE%D0%B2%D0%B5%D1%80%D1%8F%D0%B5%D0%BC%D0%BE%D1%81%D1%82%D1%8C" title="Википедия:Проверяемость"><span title="утверждение не найдено в указанном источнике" style="">нет в источнике</span></a></i>]</sup><sup id="cite_ref-11" class="reference"><a href="#cite_note-11">[10]</a></sup> </p><p>A total of 154 <a href="/w/index.php?title=Script_(Unicode)&action=edit&redlink=1" class="new" title="Script (Unicode) (страница отсутствует)">scripts</a> are included in the latest version of Unicode (covering <a href="/wiki/Alphabet" title="Alphabet">alphabets</a>, <a href="/w/index.php?title=Abugida&action=edit&redlink=1" class="new" title="Abugida (страница отсутствует)">abugidas</a> and <a href="/w/index.php?title=Syllabary&action=edit&redlink=1" class="new" title="Syllabary (страница отсутствует)">syllabaries</a>), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and <a href="/w/index.php?title=Musical_notation&action=edit&redlink=1" class="new" title="Musical notation (страница отсутствует)">music</a> (in the form of notes and rhythmic symbols), also occur. </p><p>The Unicode Roadmap Committee (<a href="/wiki/Michael_Everson" class="mw-redirect" title="Michael Everson">Michael Everson</a>, Rick McGowan, Ken Whistler, V.S. Umamaheswaran<sup id="cite_ref-12" class="reference"><a href="#cite_note-12">[11]</a></sup>) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the <a rel="nofollow" class="external text" href="https://www.unicode.org/roadmaps/">Unicode Roadmap</a> page of the <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a> Web site. For some scripts on the Roadmap, such as <a href="/w/index.php?title=Jurchen_script&action=edit&redlink=1" class="new" title="Jurchen script (страница отсутствует)">Jurchen</a> and <a href="/w/index.php?title=Khitan_small_script&action=edit&redlink=1" class="new" title="Khitan small script (страница отсутствует)">Khitan small script</a>, encoding proposals have been made and they are working their way through the approval process. For others scripts, such as <a href="/w/index.php?title=Maya_script&action=edit&redlink=1" class="new" title="Maya script (страница отсутствует)">Mayan</a> (besides numbers) and <a href="/w/index.php?title=Rongorongo&action=edit&redlink=1" class="new" title="Rongorongo (страница отсутствует)">Rongorongo</a>, no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved. </p><p>Some modern invented scripts which have not yet been included in Unicode (e.g., <a href="/wiki/Tengwar" class="mw-redirect" title="Tengwar">Tengwar</a>) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., <a href="/w/index.php?title=Klingon_scripts&action=edit&redlink=1" class="new" title="Klingon scripts (страница отсутствует)">Klingon</a>) are listed in the <a href="/w/index.php?title=ConScript_Unicode_Registry&action=edit&redlink=1" class="new" title="ConScript Unicode Registry (страница отсутствует)">ConScript Unicode Registry</a>, along with unofficial but widely used <a href="/w/index.php?title=Private_Use_Areas&action=edit&redlink=1" class="new" title="Private Use Areas (страница отсутствует)">Private Use Areas</a> code assignments. </p><p>There is also a <a href="/w/index.php?title=Medieval_Unicode_Font_Initiative&action=edit&redlink=1" class="new" title="Medieval Unicode Font Initiative (страница отсутствует)">Medieval Unicode Font Initiative</a> focused on special Latin medieval characters. Part of these proposals have been already included into Unicode. </p><p>The <a rel="nofollow" class="external text" href="https://linguistics.berkeley.edu/sei/">Script Encoding Initiative</a>, a project run by Deborah Anderson at the <a href="/wiki/University_of_California,_Berkeley" class="mw-redirect" title="University of California, Berkeley">University of California, Berkeley</a> was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.<sup id="cite_ref-13" class="reference"><a href="#cite_note-13">[12]</a></sup> </p> <h3><span class="mw-headline" id="Versions"><span id="14.0"><span id="13.0"><span id="12.1"><span id="12.0"><span id="11.0"><span id="10.0"><span id="9.0"><span id="8.0"><span id="7.0"><span id="6.3"><span id="6.2"><span id="6.1"><span id="6.0"><span id="5.2"><span id="5.1"><span id="5.0"><span id="4.1"><span id="4.0"><span id="3.2"><span id="3.1"><span id="3.0"><span id="2.1"><span id="2.0"><span id="1.1"><span id="1.0.1"><span id="1.0.0"></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span>Versions</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=5" class="mw-editsection-visualeditor" title="Редактировать раздел «Versions»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=5" title="Редактировать раздел «Versions»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Unicode is developed in conjunction with the <a href="/wiki/International_Organization_for_Standardization" class="mw-redirect" title="International Organization for Standardization">International Organization for Standardization</a> and shares the character repertoire with <a href="/wiki/ISO/IEC_10646" class="mw-redirect" title="ISO/IEC 10646">ISO/IEC 10646</a>: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but <i>The Unicode Standard</i> contains much more information for implementers, covering—in depth—topics such as bitwise encoding, <a href="/w/index.php?title=Unicode_collation_algorithm&action=edit&redlink=1" class="new" title="Unicode collation algorithm (страница отсутствует)">collation</a> and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting <a href="/w/index.php?title=Bi-directional_text&action=edit&redlink=1" class="new" title="Bi-directional text (страница отсутствует)">bidirectional text</a>. The two standards do use slightly different terminology. </p><p>The Unicode Consortium first published <i>The Unicode Standard</i> in 1991 (version 1.0), and has published new versions on a regular basis since then. The latest version of the Unicode Standard, version 13.0, was released in March 2020, and is available in electronic format from the consortium's website. The last version of the standard that was published completely in book form (including the code charts) was version 5.0 in 2006, but since version 5.2 (2009) the core specification of the standard has been published as a print-on-demand paperback.<sup id="cite_ref-version6.1PoD_14-0" class="reference"><a href="#cite_note-version6.1PoD-14">[13]</a></sup> The entire text of each version of the standard, including the core specification, standard annexes and code charts, is freely available in <a href="/wiki/PDF" class="mw-redirect" title="PDF">PDF</a> format on the Unicode website. </p><p>In April 2020, Unicode announced that the release of the forthcoming version 14.0 had been postponed by six months from its initial release of March 2021 due to the <a href="/w/index.php?title=COVID-19_pandemic&action=edit&redlink=1" class="new" title="COVID-19 pandemic (страница отсутствует)">COVID-19 pandemic</a>.<sup id="cite_ref-15" class="reference"><a href="#cite_note-15">[14]</a></sup> </p><p>Thus far, the following major and minor versions of the Unicode standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below.<sup id="cite_ref-16" class="reference"><a href="#cite_note-16">[15]</a></sup> </p> <table class="wikitable"> <caption>Unicode versions </caption> <tbody><tr> <th rowspan="2">Version </th> <th rowspan="2">Date </th> <th rowspan="2">Book </th> <th rowspan="2">Corresponding <a href="/w/index.php?title=Universal_Character_Set&action=edit&redlink=1" class="new" title="Universal Character Set (страница отсутствует)">ISO/IEC 10646</a> edition </th> <th rowspan="2"><a href="/w/index.php?title=Script_(Unicode)&action=edit&redlink=1" class="new" title="Script (Unicode) (страница отсутствует)">Scripts</a> </th> <th colspan="2">Characters </th></tr> <tr> <th>Total<sup id="cite_ref-17" class="reference"><a href="#cite_note-17">[tablenote 1]</a></sup> </th> <th>Notable additions </th></tr> <tr> <td>1.0.0 </td> <td>October 1991 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-201-56788-1" title="Служебная:Источники книг/0-201-56788-1">ISBN <span class="nowrap">0-201-56788-1</span></a> (Vol. 1) </td> <td> </td> <td>24 </td> <td>7,129 </td> <td>Initial repertoire covers these scripts: <a href="/w/index.php?title=Arabic_script&action=edit&redlink=1" class="new" title="Arabic script (страница отсутствует)">Arabic</a>, <a href="/w/index.php?title=Armenian_alphabet&action=edit&redlink=1" class="new" title="Armenian alphabet (страница отсутствует)">Armenian</a>, <a href="/w/index.php?title=Bengali_alphabet&action=edit&redlink=1" class="new" title="Bengali alphabet (страница отсутствует)">Bengali</a>, <a href="/w/index.php?title=Zhuyin&action=edit&redlink=1" class="new" title="Zhuyin (страница отсутствует)">Bopomofo</a>, <a href="/w/index.php?title=Cyrillic_script&action=edit&redlink=1" class="new" title="Cyrillic script (страница отсутствует)">Cyrillic</a>, <a href="/w/index.php?title=Devanagari&action=edit&redlink=1" class="new" title="Devanagari (страница отсутствует)">Devanagari</a>, <a href="/w/index.php?title=Georgian_alphabet&action=edit&redlink=1" class="new" title="Georgian alphabet (страница отсутствует)">Georgian</a>, <a href="/w/index.php?title=Greek_alphabet&action=edit&redlink=1" class="new" title="Greek alphabet (страница отсутствует)">Greek and Coptic</a>, <a href="/w/index.php?title=Gujarati_alphabet&action=edit&redlink=1" class="new" title="Gujarati alphabet (страница отсутствует)">Gujarati</a>, <a href="/w/index.php?title=Gurmukhi_script&action=edit&redlink=1" class="new" title="Gurmukhi script (страница отсутствует)">Gurmukhi</a>, <a href="/w/index.php?title=Hangul&action=edit&redlink=1" class="new" title="Hangul (страница отсутствует)">Hangul</a>, <a href="/w/index.php?title=Hebrew_alphabet&action=edit&redlink=1" class="new" title="Hebrew alphabet (страница отсутствует)">Hebrew</a>, <a href="/w/index.php?title=Hiragana&action=edit&redlink=1" class="new" title="Hiragana (страница отсутствует)">Hiragana</a>, <a href="/w/index.php?title=Kannada_alphabet&action=edit&redlink=1" class="new" title="Kannada alphabet (страница отсутствует)">Kannada</a>, <a href="/wiki/Katakana" class="mw-redirect" title="Katakana">Katakana</a>, <a href="/w/index.php?title=Lao_script&action=edit&redlink=1" class="new" title="Lao script (страница отсутствует)">Lao</a>, <a href="/w/index.php?title=Latin_script&action=edit&redlink=1" class="new" title="Latin script (страница отсутствует)">Latin</a>, <a href="/w/index.php?title=Malayalam_script&action=edit&redlink=1" class="new" title="Malayalam script (страница отсутствует)">Malayalam</a>, <a href="/w/index.php?title=Oriya_script&action=edit&redlink=1" class="new" title="Oriya script (страница отсутствует)">Oriya</a>, <a href="/w/index.php?title=Tamil_script&action=edit&redlink=1" class="new" title="Tamil script (страница отсутствует)">Tamil</a>, <a href="/w/index.php?title=Telugu_script&action=edit&redlink=1" class="new" title="Telugu script (страница отсутствует)">Telugu</a>, <a href="/w/index.php?title=Thai_alphabet&action=edit&redlink=1" class="new" title="Thai alphabet (страница отсутствует)">Thai</a>, and <a href="/w/index.php?title=Tibetan_script&action=edit&redlink=1" class="new" title="Tibetan script (страница отсутствует)">Tibetan</a>.<sup id="cite_ref-18" class="reference"><a href="#cite_note-18">[16]</a></sup> </td></tr> <tr> <td>1.0.1 </td> <td>June 1992 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-201-60845-6" title="Служебная:Источники книг/0-201-60845-6">ISBN <span class="nowrap">0-201-60845-6</span></a> (Vol. 2) </td> <td> </td> <td>25 </td> <td>28,327<br />(21,204 added; 6 removed) </td> <td>The initial set of 20,902 <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK Unified Ideographs</a> is defined.<sup id="cite_ref-19" class="reference"><a href="#cite_note-19">[17]</a></sup> </td></tr> <tr> <td>1.1 </td> <td>June 1993 </td> <td> </td> <td>ISO/IEC 10646-1:1993 </td> <td>24 </td> <td>34,168<br />(5,963 added; 89 removed; 33 reclassified as control characters) </td> <td>4,306 more <a href="/w/index.php?title=Hangul&action=edit&redlink=1" class="new" title="Hangul (страница отсутствует)">Hangul</a> syllables added to original set of 2,350 characters. <a href="/w/index.php?title=Tibetan_script&action=edit&redlink=1" class="new" title="Tibetan script (страница отсутствует)">Tibetan</a> removed.<sup id="cite_ref-20" class="reference"><a href="#cite_note-20">[18]</a></sup> </td></tr> <tr> <td>2.0 </td> <td>July 1996 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-201-48345-9" title="Служебная:Источники книг/0-201-48345-9">ISBN <span class="nowrap">0-201-48345-9</span></a> </td> <td>ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7 </td> <td>25 </td> <td>38,885<br />(11,373 added; 6,656 removed) </td> <td>Original set of <a href="/w/index.php?title=Hangul&action=edit&redlink=1" class="new" title="Hangul (страница отсутствует)">Hangul</a> syllables removed, and a new set of 11,172 Hangul syllables added at a new location. <a href="/w/index.php?title=Tibetan_script&action=edit&redlink=1" class="new" title="Tibetan script (страница отсутствует)">Tibetan</a> added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 <a href="/w/index.php?title=Private_use_(Unicode)&action=edit&redlink=1" class="new" title="Private use (Unicode) (страница отсутствует)">Private Use Areas</a> allocated.<sup id="cite_ref-21" class="reference"><a href="#cite_note-21">[19]</a></sup> </td></tr> <tr> <td>2.1 </td> <td>May 1998 </td> <td> </td> <td>ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18 </td> <td>25 </td> <td>38,887<br />(2 added) </td> <td><a href="/w/index.php?title=Euro_sign&action=edit&redlink=1" class="new" title="Euro sign (страница отсутствует)">Euro sign</a> and <a href="/w/index.php?title=Specials_(Unicode_block)&action=edit&redlink=1" class="new" title="Specials (Unicode block) (страница отсутствует)">Object Replacement Character</a> added.<sup id="cite_ref-22" class="reference"><a href="#cite_note-22">[20]</a></sup> </td></tr> <tr> <td>3.0 </td> <td>September 1999 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-201-61633-5" title="Служебная:Источники книг/0-201-61633-5">ISBN <span class="nowrap">0-201-61633-5</span></a> </td> <td>ISO/IEC 10646-1:2000 </td> <td>38 </td> <td>49,194<br />(10,307 added) </td> <td><a href="/w/index.php?title=Cherokee_syllabary&action=edit&redlink=1" class="new" title="Cherokee syllabary (страница отсутствует)">Cherokee</a>, <a href="/w/index.php?title=Ge%27ez_alphabet&action=edit&redlink=1" class="new" title="Ge'ez alphabet (страница отсутствует)">Ethiopic</a>, <a href="/w/index.php?title=Khmer_script&action=edit&redlink=1" class="new" title="Khmer script (страница отсутствует)">Khmer</a>, <a href="/w/index.php?title=Mongolian_script&action=edit&redlink=1" class="new" title="Mongolian script (страница отсутствует)">Mongolian</a>, <a href="/w/index.php?title=Burmese_script&action=edit&redlink=1" class="new" title="Burmese script (страница отсутствует)">Burmese</a>, <a href="/w/index.php?title=Ogham&action=edit&redlink=1" class="new" title="Ogham (страница отсутствует)">Ogham</a>, <a href="/w/index.php?title=Runic_alphabet&action=edit&redlink=1" class="new" title="Runic alphabet (страница отсутствует)">Runic</a>, <a href="/w/index.php?title=Sinhala_script&action=edit&redlink=1" class="new" title="Sinhala script (страница отсутствует)">Sinhala</a>, <a href="/w/index.php?title=Syriac_alphabet&action=edit&redlink=1" class="new" title="Syriac alphabet (страница отсутствует)">Syriac</a>, <a href="/w/index.php?title=T%C4%81na&action=edit&redlink=1" class="new" title="Tāna (страница отсутствует)">Thaana</a>, <a href="/w/index.php?title=Canadian_Aboriginal_syllabics&action=edit&redlink=1" class="new" title="Canadian Aboriginal syllabics (страница отсутствует)">Unified Canadian Aboriginal Syllabics</a>, and <a href="/w/index.php?title=Yi_script&action=edit&redlink=1" class="new" title="Yi script (страница отсутствует)">Yi Syllables</a> added, as well as a set of <a href="/w/index.php?title=Braille&action=edit&redlink=1" class="new" title="Braille (страница отсутствует)">Braille</a> patterns.<sup id="cite_ref-23" class="reference"><a href="#cite_note-23">[21]</a></sup> </td></tr> <tr> <td>3.1 </td> <td>March 2001 </td> <td> </td> <td>ISO/IEC 10646-1:2000 <p>ISO/IEC 10646-2:2001 </p> </td> <td>41 </td> <td>94,140<br />(44,946 added) </td> <td><a href="/w/index.php?title=Deseret_alphabet&action=edit&redlink=1" class="new" title="Deseret alphabet (страница отсутствует)">Deseret</a>, <a href="/w/index.php?title=Gothic_alphabet&action=edit&redlink=1" class="new" title="Gothic alphabet (страница отсутствует)">Gothic</a> and <a href="/w/index.php?title=Old_Italic_alphabet&action=edit&redlink=1" class="new" title="Old Italic alphabet (страница отсутствует)">Old Italic</a> added, as well as sets of symbols for <a href="/w/index.php?title=Modern_musical_symbols&action=edit&redlink=1" class="new" title="Modern musical symbols (страница отсутствует)">Western music</a> and <a href="/w/index.php?title=Byzantine_music&action=edit&redlink=1" class="new" title="Byzantine music (страница отсутствует)">Byzantine music</a>, and 42,711 additional <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK Unified Ideographs</a>.<sup id="cite_ref-24" class="reference"><a href="#cite_note-24">[22]</a></sup> </td></tr> <tr> <td>3.2 </td> <td>March 2002 </td> <td> </td> <td>ISO/IEC 10646-1:2000 plus Amendment 1 <p>ISO/IEC 10646-2:2001 </p> </td> <td>45 </td> <td>95,156<br />(1,016 added) </td> <td><a href="/wiki/Philippines" class="mw-redirect" title="Philippines">Philippine</a> scripts <a href="/w/index.php?title=Buhid_script&action=edit&redlink=1" class="new" title="Buhid script (страница отсутствует)">Buhid</a>, <a href="/w/index.php?title=Hanun%C3%B3%27o_script&action=edit&redlink=1" class="new" title="Hanunó'o script (страница отсутствует)">Hanunó'o</a>, <a href="/w/index.php?title=Baybayin&action=edit&redlink=1" class="new" title="Baybayin (страница отсутствует)">Tagalog</a>, and <a href="/w/index.php?title=Tagbanwa_script&action=edit&redlink=1" class="new" title="Tagbanwa script (страница отсутствует)">Tagbanwa</a> added.<sup id="cite_ref-25" class="reference"><a href="#cite_note-25">[23]</a></sup> </td></tr> <tr> <td>4.0 </td> <td>April 2003 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-321-18578-1" title="Служебная:Источники книг/0-321-18578-1">ISBN <span class="nowrap">0-321-18578-1</span></a> </td> <td>ISO/IEC 10646:2003 </td> <td>52 </td> <td>96,382<br />(1,226 added) </td> <td><a href="/w/index.php?title=Cypriot_syllabary&action=edit&redlink=1" class="new" title="Cypriot syllabary (страница отсутствует)">Cypriot syllabary</a>, <a href="/w/index.php?title=Limbu_script&action=edit&redlink=1" class="new" title="Limbu script (страница отсутствует)">Limbu</a>, <a href="/w/index.php?title=Linear_B&action=edit&redlink=1" class="new" title="Linear B (страница отсутствует)">Linear B</a>, <a href="/w/index.php?title=Osmanya_script&action=edit&redlink=1" class="new" title="Osmanya script (страница отсутствует)">Osmanya</a>, <a href="/w/index.php?title=Shavian_alphabet&action=edit&redlink=1" class="new" title="Shavian alphabet (страница отсутствует)">Shavian</a>, <a href="/w/index.php?title=Tai_N%C3%BCa_language&action=edit&redlink=1" class="new" title="Tai Nüa language (страница отсутствует)">Tai Le</a>, and <a href="/w/index.php?title=Ugaritic_alphabet&action=edit&redlink=1" class="new" title="Ugaritic alphabet (страница отсутствует)">Ugaritic</a> added, as well as <a href="/w/index.php?title=Hexagram_(I_Ching)&action=edit&redlink=1" class="new" title="Hexagram (I Ching) (страница отсутствует)">Hexagram symbols</a>.<sup id="cite_ref-26" class="reference"><a href="#cite_note-26">[24]</a></sup> </td></tr> <tr> <td>4.1 </td> <td>March 2005 </td> <td> </td> <td>ISO/IEC 10646:2003 plus Amendment 1 </td> <td>59 </td> <td>97,655<br />(1,273 added) </td> <td><a href="/w/index.php?title=Lontara_alphabet&action=edit&redlink=1" class="new" title="Lontara alphabet (страница отсутствует)">Buginese</a>, <a href="/w/index.php?title=Glagolitic_alphabet&action=edit&redlink=1" class="new" title="Glagolitic alphabet (страница отсутствует)">Glagolitic</a>, <a href="/w/index.php?title=Kharo%E1%B9%A3%E1%B9%ADh%C4%AB&action=edit&redlink=1" class="new" title="Kharoṣṭhī (страница отсутствует)">Kharoshthi</a>, <a href="/w/index.php?title=New_Tai_Lue_alphabet&action=edit&redlink=1" class="new" title="New Tai Lue alphabet (страница отсутствует)">New Tai Lue</a>, <a href="/w/index.php?title=Old_Persian_cuneiform_script&action=edit&redlink=1" class="new" title="Old Persian cuneiform script (страница отсутствует)">Old Persian</a>, <a href="/w/index.php?title=Sylheti_Nagari&action=edit&redlink=1" class="new" title="Sylheti Nagari (страница отсутствует)">Syloti Nagri</a>, and <a href="/w/index.php?title=Tifinagh&action=edit&redlink=1" class="new" title="Tifinagh (страница отсутствует)">Tifinagh</a> added, and <a href="/w/index.php?title=Coptic_alphabet&action=edit&redlink=1" class="new" title="Coptic alphabet (страница отсутствует)">Coptic</a> was disunified from <a href="/w/index.php?title=Greek_alphabet&action=edit&redlink=1" class="new" title="Greek alphabet (страница отсутствует)">Greek</a>. Ancient <a href="/w/index.php?title=Unicode_numerals&action=edit&redlink=1" class="new" title="Unicode numerals (страница отсутствует)">Greek numbers</a> and <a href="/w/index.php?title=Musical_notation&action=edit&redlink=1" class="new" title="Musical notation (страница отсутствует)">musical symbols</a> were also added.<sup id="cite_ref-27" class="reference"><a href="#cite_note-27">[25]</a></sup> </td></tr> <tr> <td>5.0 </td> <td>July 2006 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-321-48091-0" title="Служебная:Источники книг/0-321-48091-0">ISBN <span class="nowrap">0-321-48091-0</span></a> </td> <td>ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3 </td> <td>64 </td> <td>99,024<br />(1,369 added) </td> <td><a href="/w/index.php?title=Balinese_alphabet&action=edit&redlink=1" class="new" title="Balinese alphabet (страница отсутствует)">Balinese</a>, <a href="/wiki/Cuneiform" class="mw-redirect" title="Cuneiform">Cuneiform</a>, <a href="/w/index.php?title=N%27Ko_alphabet&action=edit&redlink=1" class="new" title="N'Ko alphabet (страница отсутствует)">N'Ko</a>, <a href="/w/index.php?title=Phags-pa_script&action=edit&redlink=1" class="new" title="Phags-pa script (страница отсутствует)">Phags-pa</a>, and <a href="/w/index.php?title=Phoenician_alphabet&action=edit&redlink=1" class="new" title="Phoenician alphabet (страница отсутствует)">Phoenician</a> added.<sup id="cite_ref-28" class="reference"><a href="#cite_note-28">[26]</a></sup> </td></tr> <tr> <td>5.1 </td> <td>April 2008 </td> <td> </td> <td>ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4 </td> <td>75 </td> <td>100,648<br />(1,624 added) </td> <td><a href="/w/index.php?title=Carian_script&action=edit&redlink=1" class="new" title="Carian script (страница отсутствует)">Carian</a>, <a href="/w/index.php?title=Cham_alphabet&action=edit&redlink=1" class="new" title="Cham alphabet (страница отсутствует)">Cham</a>, <a href="/w/index.php?title=Kayah_Li_script&action=edit&redlink=1" class="new" title="Kayah Li script (страница отсутствует)">Kayah Li</a>, <a href="/w/index.php?title=Lepcha_script&action=edit&redlink=1" class="new" title="Lepcha script (страница отсутствует)">Lepcha</a>, <a href="/w/index.php?title=Lycian_script&action=edit&redlink=1" class="new" title="Lycian script (страница отсутствует)">Lycian</a>, <a href="/w/index.php?title=Lydian_script&action=edit&redlink=1" class="new" title="Lydian script (страница отсутствует)">Lydian</a>, <a href="/w/index.php?title=Ol_Chiki_script&action=edit&redlink=1" class="new" title="Ol Chiki script (страница отсутствует)">Ol Chiki</a>, <a href="/w/index.php?title=Rejang_script&action=edit&redlink=1" class="new" title="Rejang script (страница отсутствует)">Rejang</a>, <a href="/w/index.php?title=Saurashtra_script&action=edit&redlink=1" class="new" title="Saurashtra script (страница отсутствует)">Saurashtra</a>, <a href="/w/index.php?title=Sundanese_script&action=edit&redlink=1" class="new" title="Sundanese script (страница отсутствует)">Sundanese</a>, and <a href="/w/index.php?title=Vai_syllabary&action=edit&redlink=1" class="new" title="Vai syllabary (страница отсутствует)">Vai</a> added, as well as sets of symbols for the <a href="/w/index.php?title=Phaistos_Disc&action=edit&redlink=1" class="new" title="Phaistos Disc (страница отсутствует)">Phaistos Disc</a>, <a href="/wiki/Mahjong" class="mw-redirect" title="Mahjong">Mahjong tiles</a>, and <a href="/w/index.php?title=Dominoes&action=edit&redlink=1" class="new" title="Dominoes (страница отсутствует)">Domino tiles</a>. There were also important additions for <a href="/w/index.php?title=Burmese_script&action=edit&redlink=1" class="new" title="Burmese script (страница отсутствует)">Burmese</a>, additions of letters and <a href="/w/index.php?title=Scribal_abbreviation&action=edit&redlink=1" class="new" title="Scribal abbreviation (страница отсутствует)">Scribal abbreviations</a> used in medieval <a href="/wiki/Manuscript" title="Manuscript">manuscripts</a>, and the addition of <a href="/w/index.php?title=Capital_%E1%BA%9E&action=edit&redlink=1" class="new" title="Capital ẞ (страница отсутствует)">Capital ẞ</a>.<sup id="cite_ref-29" class="reference"><a href="#cite_note-29">[27]</a></sup> </td></tr> <tr> <td>5.2 </td> <td>October 2009 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-00-9" title="Служебная:Источники книг/978-1-936213-00-9">ISBN <span class="nowrap">978-1-936213-00-9</span></a> </td> <td>ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6 </td> <td>90 </td> <td>107,296<br />(6,648 added) </td> <td><a href="/w/index.php?title=Avestan_alphabet&action=edit&redlink=1" class="new" title="Avestan alphabet (страница отсутствует)">Avestan</a>, <a href="/w/index.php?title=Bamum_script&action=edit&redlink=1" class="new" title="Bamum script (страница отсутствует)">Bamum</a>, <a href="/w/index.php?title=Egyptian_hieroglyphs&action=edit&redlink=1" class="new" title="Egyptian hieroglyphs (страница отсутствует)">Egyptian hieroglyphs</a> (the <a href="/w/index.php?title=Gardiner%27s_sign_list&action=edit&redlink=1" class="new" title="Gardiner's sign list (страница отсутствует)">Gardiner Set</a>, comprising 1,071 characters), <a href="/w/index.php?title=Imperial_Aramaic&action=edit&redlink=1" class="new" title="Imperial Aramaic (страница отсутствует)">Imperial Aramaic</a>, <a href="/w/index.php?title=Inscriptional_Pahlavi&action=edit&redlink=1" class="new" title="Inscriptional Pahlavi (страница отсутствует)">Inscriptional Pahlavi</a>, <a href="/w/index.php?title=Inscriptional_Parthian&action=edit&redlink=1" class="new" title="Inscriptional Parthian (страница отсутствует)">Inscriptional Parthian</a>, <a href="/w/index.php?title=Javanese_script&action=edit&redlink=1" class="new" title="Javanese script (страница отсутствует)">Javanese</a>, <a href="/w/index.php?title=Kaithi&action=edit&redlink=1" class="new" title="Kaithi (страница отсутствует)">Kaithi</a>, <a href="/w/index.php?title=Fraser_alphabet&action=edit&redlink=1" class="new" title="Fraser alphabet (страница отсутствует)">Lisu</a>, <a href="/w/index.php?title=Meitei_Mayek_script&action=edit&redlink=1" class="new" title="Meitei Mayek script (страница отсутствует)">Meetei Mayek</a>, <a href="/w/index.php?title=South_Arabian_alphabet&action=edit&redlink=1" class="new" title="South Arabian alphabet (страница отсутствует)">Old South Arabian</a>, <a href="/w/index.php?title=Old_Turkic_script&action=edit&redlink=1" class="new" title="Old Turkic script (страница отсутствует)">Old Turkic</a>, <a href="/w/index.php?title=Samaritan_script&action=edit&redlink=1" class="new" title="Samaritan script (страница отсутствует)">Samaritan</a>, <a href="/w/index.php?title=Tai_Tham_script&action=edit&redlink=1" class="new" title="Tai Tham script (страница отсутствует)">Tai Tham</a> and <a href="/w/index.php?title=Tai_Viet_script&action=edit&redlink=1" class="new" title="Tai Viet script (страница отсутствует)">Tai Viet</a> added. 4,149 additional <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK Unified Ideographs</a> (CJK-C), as well as extended Jamo for <a href="/w/index.php?title=Hangul&action=edit&redlink=1" class="new" title="Hangul (страница отсутствует)">Old Hangul</a>, and characters for <a href="/w/index.php?title=Vedic_Sanskrit&action=edit&redlink=1" class="new" title="Vedic Sanskrit (страница отсутствует)">Vedic Sanskrit</a>.<sup id="cite_ref-30" class="reference"><a href="#cite_note-30">[28]</a></sup> </td></tr> <tr> <td>6.0 </td> <td>October 2010 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-01-6" title="Служебная:Источники книг/978-1-936213-01-6">ISBN <span class="nowrap">978-1-936213-01-6</span></a> </td> <td>ISO/IEC 10646:2010 plus the <a href="/w/index.php?title=Indian_rupee_sign&action=edit&redlink=1" class="new" title="Indian rupee sign (страница отсутствует)">Indian rupee sign</a> </td> <td>93 </td> <td>109,384<br />(2,088 added) </td> <td><a href="/w/index.php?title=Batak_alphabet&action=edit&redlink=1" class="new" title="Batak alphabet (страница отсутствует)">Batak</a>, <a href="/w/index.php?title=Br%C4%81hm%C4%AB_script&action=edit&redlink=1" class="new" title="Brāhmī script (страница отсутствует)">Brahmi</a>, <a href="/w/index.php?title=Mandaic_alphabet&action=edit&redlink=1" class="new" title="Mandaic alphabet (страница отсутствует)">Mandaic</a>, <a href="/w/index.php?title=Playing_card&action=edit&redlink=1" class="new" title="Playing card (страница отсутствует)">playing card</a> symbols, <a href="/w/index.php?title=Traffic_sign&action=edit&redlink=1" class="new" title="Traffic sign (страница отсутствует)">transport</a> and <a href="/wiki/Map" class="mw-disambig" title="Map">map</a> symbols, <a href="/w/index.php?title=Alchemical_symbol&action=edit&redlink=1" class="new" title="Alchemical symbol (страница отсутствует)">alchemical symbols</a>, <a href="/w/index.php?title=Emoticons&action=edit&redlink=1" class="new" title="Emoticons (страница отсутствует)">emoticons</a> and <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a>. 222 additional <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK Unified Ideographs</a> (CJK-D) added.<sup id="cite_ref-31" class="reference"><a href="#cite_note-31">[29]</a></sup> </td></tr> <tr> <td>6.1 </td> <td>January 2012 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-02-3" title="Служебная:Источники книг/978-1-936213-02-3">ISBN <span class="nowrap">978-1-936213-02-3</span></a> </td> <td>ISO/IEC 10646:2012 </td> <td>100 </td> <td>110,116<br />(732 added) </td> <td><a href="/w/index.php?title=Chakma_alphabet&action=edit&redlink=1" class="new" title="Chakma alphabet (страница отсутствует)">Chakma</a>, <a href="/w/index.php?title=Meroitic_alphabet&action=edit&redlink=1" class="new" title="Meroitic alphabet (страница отсутствует)">Meroitic cursive</a>, <a href="/w/index.php?title=Meroitic_alphabet&action=edit&redlink=1" class="new" title="Meroitic alphabet (страница отсутствует)">Meroitic hieroglyphs</a>, <a href="/w/index.php?title=Pollard_script&action=edit&redlink=1" class="new" title="Pollard script (страница отсутствует)">Miao</a>, <a href="/w/index.php?title=%C5%9A%C4%81rad%C4%81_script&action=edit&redlink=1" class="new" title="Śāradā script (страница отсутствует)">Sharada</a>, <a href="/w/index.php?title=Sora_Sompeng&action=edit&redlink=1" class="new" title="Sora Sompeng (страница отсутствует)">Sora Sompeng</a>, and <a href="/w/index.php?title=Takri_alphabet&action=edit&redlink=1" class="new" title="Takri alphabet (страница отсутствует)">Takri</a>.<sup id="cite_ref-32" class="reference"><a href="#cite_note-32">[30]</a></sup> </td></tr> <tr> <td>6.2 </td> <td>September 2012 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-07-8" title="Служебная:Источники книг/978-1-936213-07-8">ISBN <span class="nowrap">978-1-936213-07-8</span></a> </td> <td>ISO/IEC 10646:2012 plus the <a href="/w/index.php?title=Turkish_lira_sign&action=edit&redlink=1" class="new" title="Turkish lira sign (страница отсутствует)">Turkish lira sign</a> </td> <td>100 </td> <td>110,117<br />(1 added) </td> <td><a href="/w/index.php?title=Turkish_lira_sign&action=edit&redlink=1" class="new" title="Turkish lira sign (страница отсутствует)">Turkish lira sign</a>.<sup id="cite_ref-33" class="reference"><a href="#cite_note-33">[31]</a></sup> </td></tr> <tr> <td>6.3 </td> <td>September 2013 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-08-5" title="Служебная:Источники книг/978-1-936213-08-5">ISBN <span class="nowrap">978-1-936213-08-5</span></a> </td> <td>ISO/IEC 10646:2012 plus six characters </td> <td>100 </td> <td>110,122<br />(5 added) </td> <td>5 bidirectional formatting characters.<sup id="cite_ref-34" class="reference"><a href="#cite_note-34">[32]</a></sup> </td></tr> <tr> <td>7.0 </td> <td>June 2014 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-09-2" title="Служебная:Источники книг/978-1-936213-09-2">ISBN <span class="nowrap">978-1-936213-09-2</span></a> </td> <td>ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the <a href="/w/index.php?title=Ruble_sign&action=edit&redlink=1" class="new" title="Ruble sign (страница отсутствует)">Ruble sign</a> </td> <td>123 </td> <td>112,956<br />(2,834 added) </td> <td><a href="/w/index.php?title=Bassa_alphabet&action=edit&redlink=1" class="new" title="Bassa alphabet (страница отсутствует)">Bassa Vah</a>, <a href="/w/index.php?title=Caucasian_Albanian_alphabet&action=edit&redlink=1" class="new" title="Caucasian Albanian alphabet (страница отсутствует)">Caucasian Albanian</a>, <a href="/w/index.php?title=Duployan_shorthand&action=edit&redlink=1" class="new" title="Duployan shorthand (страница отсутствует)">Duployan</a>, <a href="/w/index.php?title=Elbasan_alphabet&action=edit&redlink=1" class="new" title="Elbasan alphabet (страница отсутствует)">Elbasan</a>, <a href="/w/index.php?title=Grantha_alphabet&action=edit&redlink=1" class="new" title="Grantha alphabet (страница отсутствует)">Grantha</a>, <a href="/w/index.php?title=Khojki&action=edit&redlink=1" class="new" title="Khojki (страница отсутствует)">Khojki</a>, <a href="/w/index.php?title=Khudabadi_alphabet&action=edit&redlink=1" class="new" title="Khudabadi alphabet (страница отсутствует)">Khudawadi</a>, <a href="/w/index.php?title=Linear_A&action=edit&redlink=1" class="new" title="Linear A (страница отсутствует)">Linear A</a>, <a href="/w/index.php?title=Mahajani&action=edit&redlink=1" class="new" title="Mahajani (страница отсутствует)">Mahajani</a>, <a href="/w/index.php?title=Manichaean_alphabet&action=edit&redlink=1" class="new" title="Manichaean alphabet (страница отсутствует)">Manichaean</a>, <a href="/w/index.php?title=Mende_script&action=edit&redlink=1" class="new" title="Mende script (страница отсутствует)">Mende Kikakui</a>, <a href="/w/index.php?title=Modi_alphabet&action=edit&redlink=1" class="new" title="Modi alphabet (страница отсутствует)">Modi</a>, <a href="/w/index.php?title=Mro_script&action=edit&redlink=1" class="new" title="Mro script (страница отсутствует)">Mro</a>, <a href="/w/index.php?title=Nabataean_alphabet&action=edit&redlink=1" class="new" title="Nabataean alphabet (страница отсутствует)">Nabataean</a>, <a href="/w/index.php?title=Old_North_Arabian&action=edit&redlink=1" class="new" title="Old North Arabian (страница отсутствует)">Old North Arabian</a>, <a href="/w/index.php?title=Old_Permic_alphabet&action=edit&redlink=1" class="new" title="Old Permic alphabet (страница отсутствует)">Old Permic</a>, <a href="/w/index.php?title=Pahawh_Hmong&action=edit&redlink=1" class="new" title="Pahawh Hmong (страница отсутствует)">Pahawh Hmong</a>, <a href="/w/index.php?title=Palmyrene_script&action=edit&redlink=1" class="new" title="Palmyrene script (страница отсутствует)">Palmyrene</a>, <a href="/w/index.php?title=Pau_Cin_Hau_script&action=edit&redlink=1" class="new" title="Pau Cin Hau script (страница отсутствует)">Pau Cin Hau</a>, <a href="/w/index.php?title=Psalter_Pahlavi&action=edit&redlink=1" class="new" title="Psalter Pahlavi (страница отсутствует)">Psalter Pahlavi</a>, <a href="/w/index.php?title=Siddha%E1%B9%83_alphabet&action=edit&redlink=1" class="new" title="Siddhaṃ alphabet (страница отсутствует)">Siddham</a>, <a href="/w/index.php?title=Tirhuta&action=edit&redlink=1" class="new" title="Tirhuta (страница отсутствует)">Tirhuta</a>, <a href="/w/index.php?title=Warang_Citi&action=edit&redlink=1" class="new" title="Warang Citi (страница отсутствует)">Warang Citi</a>, and <a href="/w/index.php?title=Dingbat&action=edit&redlink=1" class="new" title="Dingbat (страница отсутствует)">Dingbats</a>.<sup id="cite_ref-35" class="reference"><a href="#cite_note-35">[33]</a></sup> </td></tr> <tr> <td>8.0 </td> <td>June 2015 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-10-8" title="Служебная:Источники книг/978-1-936213-10-8">ISBN <span class="nowrap">978-1-936213-10-8</span></a> </td> <td>ISO/IEC 10646:2014 plus Amendment 1, as well as the <a href="/w/index.php?title=Georgian_lari&action=edit&redlink=1" class="new" title="Georgian lari (страница отсутствует)">Lari sign</a>, nine CJK unified ideographs, and 41 emoji characters<sup id="cite_ref-36" class="reference"><a href="#cite_note-36">[34]</a></sup> </td> <td>129 </td> <td>120,672<br />(7,716 added) </td> <td><a href="/w/index.php?title=Ahom_alphabet&action=edit&redlink=1" class="new" title="Ahom alphabet (страница отсутствует)">Ahom</a>, <a href="/w/index.php?title=Anatolian_hieroglyphs&action=edit&redlink=1" class="new" title="Anatolian hieroglyphs (страница отсутствует)">Anatolian hieroglyphs</a>, <a href="/w/index.php?title=Hatran_alphabet&action=edit&redlink=1" class="new" title="Hatran alphabet (страница отсутствует)">Hatran</a>, <a href="/w/index.php?title=Multani_alphabet&action=edit&redlink=1" class="new" title="Multani alphabet (страница отсутствует)">Multani</a>, <a href="/w/index.php?title=Old_Hungarian_alphabet&action=edit&redlink=1" class="new" title="Old Hungarian alphabet (страница отсутствует)">Old Hungarian</a>, <a href="/wiki/SignWriting" title="SignWriting">SignWriting</a>, 5,771 <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK unified ideographs</a>, a set of lowercase letters for <a href="/w/index.php?title=Cherokee_syllabary&action=edit&redlink=1" class="new" title="Cherokee syllabary (страница отсутствует)">Cherokee</a>, and five emoji <a href="/w/index.php?title=Fitzpatrick_scale&action=edit&redlink=1" class="new" title="Fitzpatrick scale (страница отсутствует)">skin tone</a> modifiers<sup id="cite_ref-37" class="reference"><a href="#cite_note-37">[35]</a></sup> </td></tr> <tr> <td>9.0 </td> <td>June 2016 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-13-9" title="Служебная:Источники книг/978-1-936213-13-9">ISBN <span class="nowrap">978-1-936213-13-9</span></a> </td> <td>ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols<sup id="cite_ref-38" class="reference"><a href="#cite_note-38">[36]</a></sup> </td> <td>135 </td> <td>128,172<br />(7,500 added) </td> <td><a href="/w/index.php?title=Adlam_script&action=edit&redlink=1" class="new" title="Adlam script (страница отсутствует)">Adlam</a>, <a href="/w/index.php?title=Bhaiksuki_alphabet&action=edit&redlink=1" class="new" title="Bhaiksuki alphabet (страница отсутствует)">Bhaiksuki</a>, <a href="/w/index.php?title=Zhang-Zhung_language&action=edit&redlink=1" class="new" title="Zhang-Zhung language (страница отсутствует)">Marchen</a>, <a href="/w/index.php?title=Prachalit_Nepal_alphabet&action=edit&redlink=1" class="new" title="Prachalit Nepal alphabet (страница отсутствует)">Newa</a>, <a href="/w/index.php?title=Osage_alphabet&action=edit&redlink=1" class="new" title="Osage alphabet (страница отсутствует)">Osage</a>, <a href="/w/index.php?title=Tangut_script&action=edit&redlink=1" class="new" title="Tangut script (страница отсутствует)">Tangut</a>, and 72 <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a><sup id="cite_ref-39" class="reference"><a href="#cite_note-39">[37]</a></sup><sup id="cite_ref-laobo_40-0" class="reference"><a href="#cite_note-laobo-40">[38]</a></sup> </td></tr> <tr> <td>10.0 </td> <td>June 2017 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-16-0" title="Служебная:Источники книг/978-1-936213-16-0">ISBN <span class="nowrap">978-1-936213-16-0</span></a> </td> <td>ISO/IEC 10646:2017 plus 56 <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a> characters, 285 <a href="/w/index.php?title=Hentaigana&action=edit&redlink=1" class="new" title="Hentaigana (страница отсутствует)">hentaigana</a> characters, and 3 Zanabazar Square characters<sup id="cite_ref-41" class="reference"><a href="#cite_note-41">[39]</a></sup> </td> <td>139 </td> <td>136,690<br />(8,518 added) </td> <td><a href="/w/index.php?title=Zanabazar_Square_alphabet&action=edit&redlink=1" class="new" title="Zanabazar Square alphabet (страница отсутствует)">Zanabazar Square</a>, <a href="/w/index.php?title=Soyombo_alphabet&action=edit&redlink=1" class="new" title="Soyombo alphabet (страница отсутствует)">Soyombo</a>, <a href="/w/index.php?title=Masaram_Gondi_script&action=edit&redlink=1" class="new" title="Masaram Gondi script (страница отсутствует)">Masaram Gondi</a>, <a href="/w/index.php?title=N%C3%BCshu_script&action=edit&redlink=1" class="new" title="Nüshu script (страница отсутствует)">Nüshu</a>, <a href="/w/index.php?title=Hentaigana&action=edit&redlink=1" class="new" title="Hentaigana (страница отсутствует)">hentaigana</a> (non-standard <a href="/w/index.php?title=Hiragana&action=edit&redlink=1" class="new" title="Hiragana (страница отсутствует)">hiragana</a>), 7,494 <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK unified ideographs</a>, and 56 <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a> </td></tr> <tr> <td>11.0 </td> <td>June 2018 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-19-1" title="Служебная:Источники книг/978-1-936213-19-1">ISBN <span class="nowrap">978-1-936213-19-1</span></a> </td> <td>ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters.<sup id="cite_ref-42" class="reference"><a href="#cite_note-42">[40]</a></sup> </td> <td>146 </td> <td>137,374<br />(684 added) </td> <td><a href="/w/index.php?title=Dogri_script&action=edit&redlink=1" class="new" title="Dogri script (страница отсутствует)">Dogra</a>, <a href="/w/index.php?title=Georgian_scripts&action=edit&redlink=1" class="new" title="Georgian scripts (страница отсутствует)">Georgian Mtavruli</a> capital letters, <a href="/w/index.php?title=Gunjala_Gondi_Lipi&action=edit&redlink=1" class="new" title="Gunjala Gondi Lipi (страница отсутствует)">Gunjala Gondi</a>, <a href="/w/index.php?title=Hanifi_Rohingya_script&action=edit&redlink=1" class="new" title="Hanifi Rohingya script (страница отсутствует)">Hanifi Rohingya</a>, <a href="/w/index.php?title=Indic_Siyaq_Numbers_(Unicode_block)&action=edit&redlink=1" class="new" title="Indic Siyaq Numbers (Unicode block) (страница отсутствует)">Indic Siyaq numbers</a>, <a href="/w/index.php?title=Makassarese_language&action=edit&redlink=1" class="new" title="Makassarese language (страница отсутствует)">Makasar</a>, <a href="/w/index.php?title=Medefaidrin&action=edit&redlink=1" class="new" title="Medefaidrin (страница отсутствует)">Medefaidrin</a>, <a href="/w/index.php?title=Sogdian_alphabet&action=edit&redlink=1" class="new" title="Sogdian alphabet (страница отсутствует)">Old Sogdian and Sogdian</a>, <a href="/w/index.php?title=Mayan_numerals&action=edit&redlink=1" class="new" title="Mayan numerals (страница отсутствует)">Mayan numerals</a>, 5 urgently needed <a href="/w/index.php?title=CJK_Unified_Ideographs&action=edit&redlink=1" class="new" title="CJK Unified Ideographs (страница отсутствует)">CJK unified ideographs</a>, symbols for <a href="/w/index.php?title=Xiangqi&action=edit&redlink=1" class="new" title="Xiangqi (страница отсутствует)">xiangqi</a> (Chinese chess) and <a href="/w/index.php?title=Star_(classification)&action=edit&redlink=1" class="new" title="Star (classification) (страница отсутствует)">star ratings</a>, and 145 <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a><sup id="cite_ref-43" class="reference"><a href="#cite_note-43">[41]</a></sup> </td></tr> <tr> <td>12.0 </td> <td>March 2019 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-22-1" title="Служебная:Источники книг/978-1-936213-22-1">ISBN <span class="nowrap">978-1-936213-22-1</span></a> </td> <td>ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters.<sup id="cite_ref-44" class="reference"><a href="#cite_note-44">[42]</a></sup> </td> <td>150 </td> <td>137,928<br />(554 added) </td> <td><a href="/w/index.php?title=Elymaic&action=edit&redlink=1" class="new" title="Elymaic (страница отсутствует)">Elymaic</a>, <a href="/w/index.php?title=Nandinagari&action=edit&redlink=1" class="new" title="Nandinagari (страница отсутствует)">Nandinagari</a>, <a href="/w/index.php?title=Nyiakeng_Puachue_Hmong&action=edit&redlink=1" class="new" title="Nyiakeng Puachue Hmong (страница отсутствует)">Nyiakeng Puachue Hmong</a>, <a href="/w/index.php?title=Wancho_script&action=edit&redlink=1" class="new" title="Wancho script (страница отсутствует)">Wancho</a>, <a href="/w/index.php?title=Pollard_script&action=edit&redlink=1" class="new" title="Pollard script (страница отсутствует)">Miao script</a> additions for several Miao and Yi dialects in China, <a href="/w/index.php?title=Hiragana&action=edit&redlink=1" class="new" title="Hiragana (страница отсутствует)">hiragana</a> and <a href="/wiki/Katakana" class="mw-redirect" title="Katakana">katakana</a> small letters for writing archaic Japanese, <a href="/w/index.php?title=Tamil_script&action=edit&redlink=1" class="new" title="Tamil script (страница отсутствует)">Tamil</a> historic fractions and symbols, <a href="/w/index.php?title=Lao_alphabet&action=edit&redlink=1" class="new" title="Lao alphabet (страница отсутствует)">Lao</a> letters for <a href="/w/index.php?title=Pali&action=edit&redlink=1" class="new" title="Pali (страница отсутствует)">Pali</a>, Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, and 61 <a href="/wiki/Emoji" class="mw-redirect" title="Emoji">emoji</a><sup id="cite_ref-45" class="reference"><a href="#cite_note-45">[43]</a></sup> </td></tr> <tr> <td>12.1 </td> <td>May 2019 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-25-2" title="Служебная:Источники книг/978-1-936213-25-2">ISBN <span class="nowrap">978-1-936213-25-2</span></a> </td> <td> </td> <td>150 </td> <td>137,929<br />(1 added) </td> <td>Adds a single character at U+32FF for the square ligature form of the name of the <a href="/w/index.php?title=Reiwa&action=edit&redlink=1" class="new" title="Reiwa (страница отсутствует)">Reiwa era</a>.<sup id="cite_ref-46" class="reference"><a href="#cite_note-46">[44]</a></sup> </td></tr> <tr> <td><a rel="nofollow" class="external text" href="http://www.unicode.org/versions/Unicode13.0.0/">13.0</a> </td> <td>March 2020 </td> <td><a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/978-1-936213-26-9" title="Служебная:Источники книг/978-1-936213-26-9">ISBN <span class="nowrap">978-1-936213-26-9</span></a> </td> <td>ISO/IEC 10646:2020<sup id="cite_ref-47" class="reference"><a href="#cite_note-47">[45]</a></sup> </td> <td>154 </td> <td>143,859<br />(5,930 added) </td> <td><a href="/w/index.php?title=Khwarezmian_language&action=edit&redlink=1" class="new" title="Khwarezmian language (страница отсутствует)">Chorasmian</a>, <a href="/w/index.php?title=Dhives_akuru&action=edit&redlink=1" class="new" title="Dhives akuru (страница отсутствует)">Dives Akuru</a>, <a href="/w/index.php?title=Khitan_small_script&action=edit&redlink=1" class="new" title="Khitan small script (страница отсутствует)">Khitan small script</a>, <a href="/w/index.php?title=Kurdish_alphabets&action=edit&redlink=1" class="new" title="Kurdish alphabets (страница отсутствует)">Yezidi</a>, 4,969 CJK unified ideographs added (including 4,939 in <a href="/w/index.php?title=CJK_Unified_Ideographs_Extension_G&action=edit&redlink=1" class="new" title="CJK Unified Ideographs Extension G (страница отсутствует)">Ext. G</a>), Arabic script additions used to write <a href="/w/index.php?title=Hausa_language&action=edit&redlink=1" class="new" title="Hausa language (страница отсутствует)">Hausa</a>, <a href="/w/index.php?title=Wolof_language&action=edit&redlink=1" class="new" title="Wolof language (страница отсутствует)">Wolof</a>, and other languages in Africa and other additions used to write <a href="/w/index.php?title=Hindko&action=edit&redlink=1" class="new" title="Hindko (страница отсутствует)">Hindko</a> and <a href="/w/index.php?title=Punjabi_language&action=edit&redlink=1" class="new" title="Punjabi language (страница отсутствует)">Punjabi</a> in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems from the 1970s and 1980s, and 55 emoji<sup id="cite_ref-48" class="reference"><a href="#cite_note-48">[46]</a></sup> </td></tr></tbody></table> <div class="reflist columns" style="list-style-type: decimal;"> <div class="mw-references-wrap"><ol class="references"> <li id="cite_note-17"><span class="mw-cite-backlink"><a href="#cite_ref-17">↑</a></span> <span class="reference-text">The number of characters listed for each version of Unicode is the total number of graphic and format characters (i.e., excluding <a href="/wiki/Private_Use_Area" class="mw-redirect" title="Private Use Area">private-use characters</a>, <a href="/w/index.php?title=Unicode_control_characters&action=edit&redlink=1" class="new" title="Unicode control characters (страница отсутствует)">control characters</a>, <a href="/w/index.php?title=Noncharacter&action=edit&redlink=1" class="new" title="Noncharacter (страница отсутствует)">noncharacters</a> and <a href="/w/index.php?title=Surrogate_code_points&action=edit&redlink=1" class="new" title="Surrogate code points (страница отсутствует)">surrogate code points</a>).</span> </li> </ol></div></div> <h2><span class="mw-headline" id="Architecture_and_terminology"><span id="Upluslink"></span><span id="codespace"></span> Architecture and terminology</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=6" class="mw-editsection-visualeditor" title="Редактировать раздел «Architecture and terminology»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=6" title="Редактировать раздел «Architecture and terminology»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <div class="hatnote dabhide">См. также: <a href="/w/index.php?title=Universal_Character_Set_characters&action=edit&redlink=1" class="new" title="Universal Character Set characters (страница отсутствует)">Universal Character Set characters</a></div> <p>The Unicode Standard defines a <i>codespace</i><sup id="cite_ref-Glossary_49-0" class="reference"><a href="#cite_note-Glossary-49">[47]</a></sup> of numerical values ranging from 0 through 10FFFF<sub><a href="/wiki/Hexadecimal" class="mw-redirect" title="Hexadecimal">16</a></sub>,<sup id="cite_ref-50" class="reference"><a href="#cite_note-50">[48]</a></sup> called <i><a href="/w/index.php?title=Code_point&action=edit&redlink=1" class="new" title="Code point (страница отсутствует)">code points</a></i><sup id="cite_ref-:0_51-0" class="reference"><a href="#cite_note-:0-51">[49]</a></sup> and denoted as U+0000 through U+10FFFF ("U+" plus the code point value in <a href="/wiki/Hexadecimal" class="mw-redirect" title="Hexadecimal">hexadecimal</a>, prepended with <a href="/w/index.php?title=Leading_zero&action=edit&redlink=1" class="new" title="Leading zero (страница отсутствует)">leading zeros</a> as necessary to result in a minimum of four digits, <i>e. g.</i>, U+00F7 for the division sign, ÷, versus U+13254 for the <a href="/w/index.php?title=Egyptian_hieroglyph&action=edit&redlink=1" class="new" title="Egyptian hieroglyph (страница отсутствует)">Egyptian hieroglyph</a> designating a <a href="/w/index.php?title=List_of_hieroglyphs&action=edit&redlink=1" class="new" title="List of hieroglyphs (страница отсутствует)">reed shelter</a> or a <a href="https://commons.wikimedia.org/wiki/Category:Winding_wall_(h_hieroglyph)" class="extiw" title="c:Category:Winding wall (h hieroglyph)">winding wall</a> <span class="nowrap">( <a href="/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Hiero_O4.png" class="image"><img alt="Hiero O4.png" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Hiero_O4.png/15px-Hiero_O4.png" decoding="async" width="15" height="12" style="vertical-align: text-bottom" srcset="//upload.wikimedia.org/wikipedia/commons/b/ba/Hiero_O4.png 1.5x" data-file-width="23" data-file-height="18" /></a> )</span><sup id="cite_ref-52" class="reference"><a href="#cite_note-52">[50]</a></sup>), respectively. Out of these 2<sup>16</sup> + 2<sup>20</sup> defined code points, the code points from U+D800 through U+DFFF, which are used to encode surrogate pairs in <a href="/wiki/UTF-16" title="UTF-16">UTF-16</a>, are reserved by the Standard and may not be used to encode valid characters, resulting in a net total of 2<sup>16</sup> − 2<sup>11</sup> + 2<sup>20</sup> = 1,112,064 possible code points corresponding to valid Unicode characters. Not all of these code points necessarily correspond to visible characters; several, for example, are assigned to control codes such as <a href="/wiki/Carriage_return" class="mw-redirect" title="Carriage return">carriage return</a>. </p> <h3><span class="mw-headline" id="Code_point_planes_and_blocks">Code point planes and blocks</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=7" class="mw-editsection-visualeditor" title="Редактировать раздел «Code point planes and blocks»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=7" title="Редактировать раздел «Code point planes and blocks»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Plane_(Unicode)&action=edit&redlink=1" class="new" title="Plane (Unicode) (страница отсутствует)">Plane (Unicode)</a></b></div> <p>The Unicode codespace is divided into seventeen <i>planes</i>, numbered 0 to 16: </p><p><a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Planes_(Unicode)&action=edit&redlink=1" class="new" title="Шаблон:Planes (Unicode) (страница отсутствует)">Шаблон:Planes (Unicode)</a> </p><p>All code points in the BMP are accessed as a single code unit in <a href="/wiki/UTF-16" title="UTF-16">UTF-16</a> encoding and can be encoded in one, two or three bytes in <a href="/wiki/UTF-8" title="UTF-8">UTF-8</a>. Code points in Planes 1 through 16 (<i>supplementary planes</i>) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8. </p><p>Within each plane, characters are allocated within named <i><a href="/w/index.php?title=Block_(Unicode)&action=edit&redlink=1" class="new" title="Block (Unicode) (страница отсутствует)">blocks</a></i> of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks. </p> <h3><span class="mw-headline" id="General_Category_property">General Category property</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=8" class="mw-editsection-visualeditor" title="Редактировать раздел «General Category property»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=8" title="Редактировать раздел «General Category property»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Each code point has a single <a href="/w/index.php?title=Character_property_(Unicode)&action=edit&redlink=1" class="new" title="Character property (Unicode) (страница отсутствует)">General Category</a> property. The major categories are denoted: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. In most cases other properties must be used to sufficiently specify the characteristics of a code point. The possible General Categories are: </p><p><a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:General_Category_(Unicode)&action=edit&redlink=1" class="new" title="Шаблон:General Category (Unicode) (страница отсутствует)">Шаблон:General Category (Unicode)</a> </p><p>Code points in the range U+D800–U+DBFF (1,024 code points) are known as high-<b>surrogate</b> code points, and code points in the range U+DC00–U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point form a surrogate pair in <a href="/wiki/UTF-16" title="UTF-16">UTF-16</a> to represent code points greater than U+FFFF. These code points otherwise cannot be used (this rule is ignored often in practice especially when not using UTF-16). </p><p>A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six of these <b>noncharacters</b>: U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.<sup id="cite_ref-stability-policy_53-0" class="reference"><a href="#cite_note-stability-policy-53">[51]</a></sup> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the <a href="/wiki/Byte_order_mark" class="mw-redirect" title="Byte order mark">byte order mark</a> assumes that U+FFFE will never be the first code point in a text. </p><p>Excluding surrogates and noncharacters leaves 1,111,998 code points available for use. </p><p><b>Private-use</b> code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard<sup id="cite_ref-54" class="reference"><a href="#cite_note-54">[52]</a></sup> so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private-use areas in the Unicode codespace: </p> <ul><li>Private Use Area: U+E000–U+F8FF (6,400 characters)</li> <li>Supplementary Private Use Area-A: U+F0000–U+FFFFD (65,534 characters)</li> <li>Supplementary Private Use Area-B: U+100000–U+10FFFD (65,534 characters).</li></ul> <p>Graphic characters are characters defined by Unicode to have particular semantics, and either have a visible <a href="/w/index.php?title=Glyph&action=edit&redlink=1" class="new" title="Glyph (страница отсутствует)">glyph</a> shape or represent a visible space. As of Unicode 13.0 there are 143,696 graphic characters. </p><p><b>Format</b> characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. For example, <span class="nowrap">U+200C</span> <a href="/w/index.php?title=Zero-width_non-joiner&action=edit&redlink=1" class="new" title="Zero-width non-joiner (страница отсутствует)"><span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">zero-width non-joiner</span></a> and <span class="nowrap">U+200D</span> <a href="/w/index.php?title=Zero-width_joiner&action=edit&redlink=1" class="new" title="Zero-width joiner (страница отсутствует)"><span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">zero-width joiner</span></a> may be used to change the default shaping behavior of adjacent characters (e.g., to inhibit ligatures or request ligature formation). There are 163 format characters in Unicode 13.0. </p><p>Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as <b>control</b> codes, and correspond to the <a href="/w/index.php?title=C0_and_C1_control_codes&action=edit&redlink=1" class="new" title="C0 and C1 control codes (страница отсутствует)">C0 and C1 control codes</a> defined in <a href="/w/index.php?title=ISO/IEC_6429&action=edit&redlink=1" class="new" title="ISO/IEC 6429 (страница отсутствует)">ISO/IEC 6429</a>. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated (<a href="/w/index.php?title=Mojibake&action=edit&redlink=1" class="new" title="Mojibake (страница отсутствует)">Mojibake</a>) legacy <a href="/wiki/Windows-1252" class="mw-redirect" title="Windows-1252">Windows-1252</a> characters used by some English and Western European texts with Windows technologies. </p><p>Graphic characters, format characters, control code characters, and private use characters are known collectively as <i>assigned characters</i>. <b>Reserved</b> code points are those code points which are available for use, but are not yet assigned. As of Unicode 13.0 there are 830,606 reserved code points. </p> <h3><span class="mw-headline" id="Abstract_characters">Abstract characters</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=9" class="mw-editsection-visualeditor" title="Редактировать раздел «Abstract characters»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=9" title="Редактировать раздел «Abstract characters»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of <i>abstract characters</i> that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.<sup id="cite_ref-55" class="reference"><a href="#cite_note-55">[53]</a></sup> However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an <a href="/w/index.php?title=Ogonek&action=edit&redlink=1" class="new" title="Ogonek (страница отсутствует)">ogonek</a>, a <a href="/w/index.php?title=Dot_above&action=edit&redlink=1" class="new" title="Dot above (страница отсутствует)">dot above</a>, and an <a href="/w/index.php?title=Acute_accent&action=edit&redlink=1" class="new" title="Acute accent (страница отсутствует)">acute accent</a>, which is required in <a href="/w/index.php?title=Lithuanian_language&action=edit&redlink=1" class="new" title="Lithuanian language (страница отсутствует)">Lithuanian</a>, is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.<sup id="cite_ref-56" class="reference"><a href="#cite_note-56">[54]</a></sup> </p><p>All graphic, format, and private use characters have a unique and immutable name by which they may be identified. This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy.<sup id="cite_ref-stability-policy_53-1" class="reference"><a href="#cite_note-stability-policy-53">[51]</a></sup> In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, <span class="nowrap">U+A015</span> <span style="font-size:125%">ꀕ</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">yi syllable wu</span> has the formal alias <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a>, and <span class="nowrap">U+FE18</span> <span style="font-size:125%">︘</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">presentation form for vertical right white lenticular bra<b>kc</b>et</span> (<a href="/wiki/Sic" title="Sic">sic</a>) has the formal alias <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a>.<sup id="cite_ref-57" class="reference"><a href="#cite_note-57">[55]</a></sup> </p> <h3><span class="mw-headline" id="Ready-made_versus_composite_characters">Ready-made versus composite characters</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=10" class="mw-editsection-visualeditor" title="Редактировать раздел «Ready-made versus composite characters»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=10" title="Редактировать раздел «Ready-made versus composite characters»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Unicode includes a mechanism for modifying characters that greatly extends the supported glyph repertoire. This covers the use of <a href="/w/index.php?title=Combining_diacritical_mark&action=edit&redlink=1" class="new" title="Combining diacritical mark (страница отсутствует)">combining diacritical marks</a> that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains <a href="/w/index.php?title=Precomposed_character&action=edit&redlink=1" class="new" title="Precomposed character (страница отсутствует)">precomposed</a> versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, <i>é</i> can be represented in Unicode as <a href="#Upluslink">U+</a>0065 (<a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a>) followed by U+0301 (<a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a>), but it can also be represented as the precomposed character U+00E9 (<a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a>). Thus, in many cases, users have multiple ways of encoding the same character. To deal with this, Unicode provides the mechanism of <a href="/w/index.php?title=Canonical_equivalence&action=edit&redlink=1" class="new" title="Canonical equivalence (страница отсутствует)">canonical equivalence</a>. </p><p>An example of this arises with <a href="/w/index.php?title=Hangul&action=edit&redlink=1" class="new" title="Hangul (страница отсутствует)">Hangul</a>, the Korean alphabet. Unicode provides a mechanism for composing Hangul syllables with their individual subcomponents, known as <a href="/w/index.php?title=Hangul_Jamo&action=edit&redlink=1" class="new" title="Hangul Jamo (страница отсутствует)">Hangul Jamo</a>. However, it also provides 11,172 combinations of precomposed syllables made from the most common jamo. </p><p>The <a href="/wiki/CJK" class="mw-redirect" title="CJK">CJK</a> characters currently have codes only for their precomposed form. Still, most of those characters comprise simpler elements (called <a href="/w/index.php?title=Radical_(Chinese_characters)&action=edit&redlink=1" class="new" title="Radical (Chinese characters) (страница отсутствует)">radicals</a>), so in principle Unicode could have decomposed them as it did with Hangul. This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable character (which might do away with some of the problems caused by <a href="/w/index.php?title=Han_unification&action=edit&redlink=1" class="new" title="Han unification (страница отсутствует)">Han unification</a>). A similar idea is used by some <a href="/w/index.php?title=Input_method&action=edit&redlink=1" class="new" title="Input method (страница отсутствует)">input methods</a>, such as <a href="/w/index.php?title=Cangjie_method&action=edit&redlink=1" class="new" title="Cangjie method (страница отсутствует)">Cangjie</a> and <a href="/w/index.php?title=Wubi_method&action=edit&redlink=1" class="new" title="Wubi method (страница отсутствует)">Wubi</a>. However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does. </p><p>A set of <a href="/w/index.php?title=Radical_(Chinese_character)&action=edit&redlink=1" class="new" title="Radical (Chinese character) (страница отсутствует)">radicals</a> was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 12.2 of Unicode 5.2) warns against using <a href="/w/index.php?title=Ideographic_Description_Sequences&action=edit&redlink=1" class="new" title="Ideographic Description Sequences (страница отсутствует)">ideographic description sequences</a> as an alternate representation for previously encoded characters: </p> <style data-mw-deduplicate="TemplateStyles:r104610251">.mw-parser-output .ts-Цитата-container{margin:auto;border-collapse:collapse;display:flex;justify-content:center}.mw-parser-output .ts-Цитата-quote{font-style:italic}.mw-parser-output .ts-Цитата-container cite{display:block;float:right;font-style:normal}.mw-parser-output .ts-Цитата-leftQuote,.mw-parser-output .ts-Цитата-rightQuote{width:30px;padding-right:10px}.mw-parser-output .ts-Цитата-leftQuote{vertical-align:top}.mw-parser-output .ts-Цитата-rightQuote{vertical-align:bottom}.mw-parser-output .ts-Цитата-container .ts-oq .NavFrame{padding:0.25em 0 0}</style><table class="ts-Цитата-container"><tbody><tr><td class="ts-Цитата-leftQuote"><img alt="«" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/44/Aquote1.png/30px-Aquote1.png" decoding="async" width="30" height="23" srcset="//upload.wikimedia.org/wikipedia/commons/4/44/Aquote1.png 1.5x" data-file-width="40" data-file-height="30" /></td><td class="ts-Цитата-quote">This process is different from a formal <i>encoding</i> of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase "an 'e' with an acute accent on it" than to the character sequence <U+0065, U+0301>.</td><td class="ts-Цитата-rightQuote"><img alt="»" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/49/Aquote2.png/30px-Aquote2.png" decoding="async" width="30" height="23" srcset="//upload.wikimedia.org/wikipedia/commons/4/49/Aquote2.png 1.5x" data-file-width="40" data-file-height="30" /></td></tr></tbody></table> <h3><span class="mw-headline" id="Ligatures">Ligatures</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=11" class="mw-editsection-visualeditor" title="Редактировать раздел «Ligatures»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=11" title="Редактировать раздел «Ligatures»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Many scripts, including <a href="/w/index.php?title=Arabic_script&action=edit&redlink=1" class="new" title="Arabic script (страница отсутствует)">Arabic</a> and <a href="/w/index.php?title=Devanagari&action=edit&redlink=1" class="new" title="Devanagari (страница отсутствует)">Devanāgarī</a>, have special orthographic rules that require certain combinations of letterforms to be combined into special <a href="/w/index.php?title=Ligature_(typography)&action=edit&redlink=1" class="new" title="Ligature (typography) (страница отсутствует)">ligature forms</a>. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard), which became the <a href="/wiki/Proof_of_concept" class="mw-redirect" title="Proof of concept">proof of concept</a> for <a href="/wiki/OpenType" title="OpenType">OpenType</a> (by Adobe and Microsoft), <a href="/w/index.php?title=Graphite_(SIL)&action=edit&redlink=1" class="new" title="Graphite (SIL) (страница отсутствует)">Graphite</a> (by <a href="/wiki/SIL_International" title="SIL International">SIL International</a>), or <a href="/w/index.php?title=Apple_Advanced_Typography&action=edit&redlink=1" class="new" title="Apple Advanced Typography (страница отсутствует)">AAT</a> (by Apple). </p><p>Instructions are also embedded in fonts to tell the <a href="/w/index.php?title=Operating_system&action=edit&redlink=1" class="new" title="Operating system (страница отсутствует)">operating system</a> how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally this approach is only effective in monospaced fonts, but may be used as a fallback rendering method when more complex methods fail. </p> <h3><span class="mw-headline" id="Standardized_subsets">Standardized subsets</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=12" class="mw-editsection-visualeditor" title="Редактировать раздел «Standardized subsets»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=12" title="Редактировать раздел «Standardized subsets»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Several subsets of Unicode are standardized: Microsoft Windows since <a href="/wiki/Windows_NT_4.0" title="Windows NT 4.0">Windows NT 4.0</a> supports <a href="/w/index.php?title=WGL-4&action=edit&redlink=1" class="new" title="WGL-4 (страница отсутствует)">WGL-4</a> with 656 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets:<sup id="cite_ref-58" class="reference"><a href="#cite_note-58">[56]</a></sup> </p><p>MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters)<sup id="cite_ref-59" class="reference"><a href="#cite_note-59">[57]</a></sup> and MES-3A & MES-3B (two larger subsets, not shown here). Note that MES-2 includes every character in MES-1 and WGL-4. </p> <table class="wikitable"> <caption><span style="font-weight:normal;"><b>WGL-4</b>, <i>MES-1</i> and MES-2</span> </caption> <tbody><tr> <th>Row</th> <th>Cells</th> <th>Range(s) </th></tr> <tr> <th rowspan="2">00 </th> <td><i><b>20–7E</b></i> </td> <td><a href="/w/index.php?title=Basic_Latin_(Unicode_block)&action=edit&redlink=1" class="new" title="Basic Latin (Unicode block) (страница отсутствует)">Basic Latin</a> (00–7F) </td></tr> <tr> <td><i><b>A0–FF</b></i> </td> <td><a href="/w/index.php?title=Latin-1_Supplement_(Unicode_block)&action=edit&redlink=1" class="new" title="Latin-1 Supplement (Unicode block) (страница отсутствует)">Latin-1 Supplement</a> (80–FF) </td></tr> <tr> <th rowspan="2">01 </th> <td><b><i>00–13,</i> 14–15, <i>16–2B,</i> 2C–2D, <i>2E–4D,</i> 4E–4F, <i>50–7E,</i> 7F</b> </td> <td><a href="/wiki/Latin_Extended-A" class="mw-redirect" title="Latin Extended-A">Latin Extended-A</a> (00–7F) </td></tr> <tr> <td>8F, <b>92,</b> B7, DE-EF, <b>FA–FF</b> </td> <td><a href="/wiki/Latin_Extended-B" class="mw-redirect" title="Latin Extended-B">Latin Extended-B</a> (80–FF <span title="U+024F">...</span>) </td></tr> <tr> <th rowspan="3">02 </th> <td>18–1B, 1E–1F </td> <td>Latin Extended-B (<span title="U+00180">...</span> 00–4F) </td></tr> <tr> <td>59, 7C, 92 </td> <td><a href="/wiki/IPA_Extensions" class="mw-redirect" title="IPA Extensions">IPA Extensions</a> (50–AF) </td></tr> <tr> <td>BB–BD, <b>C6, <i>C7,</i> C9,</b> D6, <b><i>D8–DB,</i> DC, <i>DD,</i></b> DF, EE </td> <td><a href="/wiki/Spacing_Modifier_Letters" class="mw-redirect" title="Spacing Modifier Letters">Spacing Modifier Letters</a> (B0–FF) </td></tr> <tr> <th>03 </th> <td>74–75, 7A, 7E, <b>84–8A, 8C, 8E–A1, A3–CE,</b> D7, DA–E1 </td> <td><a href="/wiki/Greek_and_Coptic" class="mw-redirect" title="Greek and Coptic">Greek</a> (70–FF) </td></tr> <tr> <th>04 </th> <td><b>00–5F, 90–91,</b> 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9 </td> <td><a href="/w/index.php?title=Cyrillic_(Unicode_block)&action=edit&redlink=1" class="new" title="Cyrillic (Unicode block) (страница отсутствует)">Cyrillic</a> (00–FF) </td></tr> <tr> <th>1E </th> <td>02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, <b>80–85,</b> 9B, <b>F2–F3</b> </td> <td><a href="/wiki/Latin_Extended_Additional" class="mw-redirect" title="Latin Extended Additional">Latin Extended Additional</a> (00–FF) </td></tr> <tr> <th>1F </th> <td>00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE </td> <td><a href="/w/index.php?title=Greek_Extended&action=edit&redlink=1" class="new" title="Greek Extended (страница отсутствует)">Greek Extended</a> (00–FF) </td></tr> <tr> <th rowspan="3">20 </th> <td><b>13–14, <i>15,</i> 17, <i>18–19,</i> 1A–1B, <i>1C–1D,</i> 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,</b> 4A </td> <td><a href="/w/index.php?title=General_Punctuation&action=edit&redlink=1" class="new" title="General Punctuation (страница отсутствует)">General Punctuation</a> (00–6F) </td></tr> <tr> <td><b>7F</b>, 82 </td> <td><a href="/wiki/Superscripts_and_Subscripts" class="mw-redirect" title="Superscripts and Subscripts">Superscripts and Subscripts</a> (70–9F) </td></tr> <tr> <td><b>A3–A4, A7, <i>AC,</i></b> AF </td> <td><a href="/w/index.php?title=Currency_Symbols_(Unicode_block)&action=edit&redlink=1" class="new" title="Currency Symbols (Unicode block) (страница отсутствует)">Currency Symbols</a> (A0–CF) </td></tr> <tr> <th rowspan="3">21 </th> <td><b>05, 13, 16, <i>22, 26,</i> 2E</b> </td> <td><a href="/w/index.php?title=Letterlike_Symbols&action=edit&redlink=1" class="new" title="Letterlike Symbols (страница отсутствует)">Letterlike Symbols</a> (00–4F) </td></tr> <tr> <td><i><b>5B–5E</b></i> </td> <td><a href="/w/index.php?title=Number_Forms&action=edit&redlink=1" class="new" title="Number Forms (страница отсутствует)">Number Forms</a> (50–8F) </td></tr> <tr> <td><b><i>90–93,</i> 94–95, A8</b> </td> <td><a href="/w/index.php?title=Arrows_(Unicode_block)&action=edit&redlink=1" class="new" title="Arrows (Unicode block) (страница отсутствует)">Arrows</a> (90–FF) </td></tr> <tr> <th>22 </th> <td>00, <b>02,</b> 03, <b>06,</b> 08–09, <b>0F, 11–12, 15, 19–1A, 1E–1F,</b> 27–28, <b>29,</b> 2A, <b>2B, 48,</b> 59, <b>60–61, 64–65,</b> 82–83, 95, 97 </td> <td><a href="/w/index.php?title=Mathematical_Operators&action=edit&redlink=1" class="new" title="Mathematical Operators (страница отсутствует)">Mathematical Operators</a> (00–FF) </td></tr> <tr> <th>23 </th> <td><b>02, 0A, 20–21,</b> 29–2A </td> <td><a href="/w/index.php?title=Miscellaneous_Technical&action=edit&redlink=1" class="new" title="Miscellaneous Technical (страница отсутствует)">Miscellaneous Technical</a> (00–FF) </td></tr> <tr> <th rowspan="3">25 </th> <td><b>00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C</b> </td> <td><a href="/w/index.php?title=Box_Drawing&action=edit&redlink=1" class="new" title="Box Drawing (страница отсутствует)">Box Drawing</a> (00–7F) </td></tr> <tr> <td><b>80, 84, 88, 8C, 90–93</b> </td> <td><a href="/w/index.php?title=Block_Elements&action=edit&redlink=1" class="new" title="Block Elements (страница отсутствует)">Block Elements</a> (80–9F) </td></tr> <tr> <td><b>A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6</b> </td> <td><a href="/w/index.php?title=Geometric_Shapes&action=edit&redlink=1" class="new" title="Geometric Shapes (страница отсутствует)">Geometric Shapes</a> (A0–FF) </td></tr> <tr> <th>26 </th> <td><b>3A–3C, 40, 42, 60, 63, 65–66, <i>6A,</i> 6B</b> </td> <td><a href="/w/index.php?title=Miscellaneous_Symbols&action=edit&redlink=1" class="new" title="Miscellaneous Symbols (страница отсутствует)">Miscellaneous Symbols</a> (00–FF) </td></tr> <tr> <th>F0 </th> <td>(01–02) </td> <td><a href="/w/index.php?title=Private_Use_Area_(Unicode_block)&action=edit&redlink=1" class="new" title="Private Use Area (Unicode block) (страница отсутствует)">Private Use Area</a> (00–FF ...) </td></tr> <tr> <th>FB </th> <td><b>01–02</b> </td> <td><a href="/w/index.php?title=Alphabetic_Presentation_Forms&action=edit&redlink=1" class="new" title="Alphabetic Presentation Forms (страница отсутствует)">Alphabetic Presentation Forms</a> (00–4F) </td></tr> <tr> <th>FF </th> <td>FD </td> <td><a href="/w/index.php?title=Specials_(Unicode_block)&action=edit&redlink=1" class="new" title="Specials (Unicode block) (страница отсутствует)">Specials</a> </td></tr></tbody></table> <p>Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "<a href="/w/index.php?title=Replacement_character&action=edit&redlink=1" class="new" title="Replacement character (страница отсутствует)">replacement character</a>" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's <a href="/w/index.php?title=Last_Resort_font&action=edit&redlink=1" class="new" title="Last Resort font (страница отсутствует)">Last Resort font</a> will display a substitute glyph indicating the Unicode range of the character, and the <a href="/wiki/SIL_International" title="SIL International">SIL International</a>'s <a href="/w/index.php?title=Unicode_fallback_font&action=edit&redlink=1" class="new" title="Unicode fallback font (страница отсутствует)">Unicode Fallback</a> font will display a box showing the hexadecimal scalar value of the character. </p> <h3><span class="mw-headline" id="Mapping_and_encodings"><span id="UCS"><span id="UTF"></span></span>Mapping and encodings</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=13" class="mw-editsection-visualeditor" title="Редактировать раздел «Mapping and encodings»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=13" title="Редактировать раздел «Mapping and encodings»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Several mechanisms have been specified for storing a series of code points as a series of bytes. </p><p>Unicode defines two mapping methods: the <i>Unicode Transformation Format</i> (UTF) encodings, and the <i><a href="/w/index.php?title=Universal_Coded_Character_Set&action=edit&redlink=1" class="new" title="Universal Coded Character Set (страница отсутствует)">Universal Coded Character Set</a></i> (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode <i>code points</i> to sequences of values in some fixed-size range, termed <i>code units</i>. All UTF encodings map code points to a unique sequence of bytes.<sup id="cite_ref-60" class="reference"><a href="#cite_note-60">[58]</a></sup> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. </p><p>UTF encodings include: </p> <ul><li><a href="/w/index.php?title=UTF-1&action=edit&redlink=1" class="new" title="UTF-1 (страница отсутствует)">UTF-1</a>, a retired predecessor of UTF-8, maximizes compatibility with <a href="/w/index.php?title=ISO/IEC_2022&action=edit&redlink=1" class="new" title="ISO/IEC 2022 (страница отсутствует)">ISO 2022</a>, no longer part of <i>The Unicode Standard</i></li> <li><a href="/wiki/UTF-7" title="UTF-7">UTF-7</a>, a 7-bit encoding sometimes used in e-mail, often considered obsolete (not part of <i>The Unicode Standard</i>, but only documented as an informational <a href="/wiki/Request_for_Comments" class="mw-redirect" title="Request for Comments">RFC</a>, i.e., not on the Internet Standards Track)</li> <li><a href="/wiki/UTF-8" title="UTF-8">UTF-8</a>, uses one to four bytes for each code point, maximizes compatibility with <a href="/wiki/ASCII" title="ASCII">ASCII</a></li> <li><a href="/wiki/UTF-EBCDIC" title="UTF-EBCDIC">UTF-EBCDIC</a>, similar to UTF-8 but designed for compatibility with <a href="/wiki/EBCDIC" title="EBCDIC">EBCDIC</a> (not part of <i>The Unicode Standard</i>)</li> <li><a href="/wiki/UTF-16" title="UTF-16">UTF-16</a>, uses one or two 16-bit code units per code point, cannot encode surrogates</li> <li><a href="/wiki/UTF-32" title="UTF-32">UTF-32</a>, uses one 32-bit code unit per code point</li></ul> <p>UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the <i>de facto</i> standard encoding for interchange of Unicode text. It is used by <a href="/wiki/FreeBSD" title="FreeBSD">FreeBSD</a> and most recent <a href="/w/index.php?title=Linux_distributions&action=edit&redlink=1" class="new" title="Linux distributions (страница отсутствует)">Linux distributions</a> as a direct replacement for legacy encodings in general text handling. </p><p>The UCS-2 and UTF-16 encodings specify the Unicode <a href="/w/index.php?title=Byte_Order_Mark&action=edit&redlink=1" class="new" title="Byte Order Mark (страница отсутствует)">Byte Order Mark</a> (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or <a href="/wiki/Endianness" class="mw-redirect" title="Endianness">byte endianness</a> detection). The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of <a href="/w/index.php?title=Ligature_(typography)&action=edit&redlink=1" class="new" title="Ligature (typography) (страница отсутствует)">ligatures</a>). </p><p>The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".<sup id="cite_ref-61" class="reference"><a href="#cite_note-61">[59]</a></sup> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit <a href="/w/index.php?title=Code_page&action=edit&redlink=1" class="new" title="Code page (страница отсутствует)">code pages</a>. However <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:IETF_RFC&action=edit&redlink=1" class="new" title="Шаблон:IETF RFC (страница отсутствует)">Шаблон:IETF RFC</a>, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM. </p><p>In UTF-32 and UCS-4, one <a href="/wiki/32-bit" class="mw-redirect" title="32-bit">32-bit</a> code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the <a href="/wiki/GNU_Compiler_Collection" title="GNU Compiler Collection">gcc</a> compilers to generate software uses it as the standard "<a href="/w/index.php?title=Wide_character&action=edit&redlink=1" class="new" title="Wide character (страница отсутствует)">wide character</a>" encoding. Some programming languages, such as <a href="/wiki/Seed7" title="Seed7">Seed7</a>, use UTF-32 as internal representation for strings and characters. Recent versions of the <a href="/w/index.php?title=Python_(programming_language)&action=edit&redlink=1" class="new" title="Python (programming language) (страница отсутствует)">Python</a> programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in <a href="/w/index.php?title=High-level_programming_language&action=edit&redlink=1" class="new" title="High-level programming language (страница отсутствует)">high-level</a> coded software. </p><p><a href="/wiki/Punycode" title="Punycode">Punycode</a>, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the <a href="/wiki/ASCII" title="ASCII">ASCII</a>-based <a href="/wiki/Domain_Name_System" class="mw-redirect" title="Domain Name System">Domain Name System</a> (DNS). The encoding is used as part of <a href="/w/index.php?title=IDNA&action=edit&redlink=1" class="new" title="IDNA (страница отсутствует)">IDNA</a>, which is a system enabling the use of <a href="/wiki/Internationalized_Domain_Names" class="mw-redirect" title="Internationalized Domain Names">Internationalized Domain Names</a> in all scripts that are supported by Unicode. Earlier and now historical proposals include <a href="/w/index.php?title=UTF-5&action=edit&redlink=1" class="new" title="UTF-5 (страница отсутствует)">UTF-5</a> and <a href="/w/index.php?title=UTF-6&action=edit&redlink=1" class="new" title="UTF-6 (страница отсутствует)">UTF-6</a>. </p><p><a href="/w/index.php?title=GB_18030&action=edit&redlink=1" class="new" title="GB 18030 (страница отсутствует)">GB18030</a> is another encoding form for Unicode, from the <a href="/w/index.php?title=Standardization_Administration_of_China&action=edit&redlink=1" class="new" title="Standardization Administration of China (страница отсутствует)">Standardization Administration of China</a>. It is the official <a href="/w/index.php?title=Character_set&action=edit&redlink=1" class="new" title="Character set (страница отсутствует)">character set</a> of the <a href="/w/index.php?title=People%27s_Republic_of_China&action=edit&redlink=1" class="new" title="People's Republic of China (страница отсутствует)">People's Republic of China</a> (PRC). <a href="/w/index.php?title=Binary_Ordered_Compression_for_Unicode&action=edit&redlink=1" class="new" title="Binary Ordered Compression for Unicode (страница отсутствует)">BOCU-1</a> and <a href="/w/index.php?title=Standard_Compression_Scheme_for_Unicode&action=edit&redlink=1" class="new" title="Standard Compression Scheme for Unicode (страница отсутствует)">SCSU</a> are Unicode compression schemes. The <a href="/w/index.php?title=April_Fools%27_Day_RFC&action=edit&redlink=1" class="new" title="April Fools' Day RFC (страница отсутствует)">April Fools' Day RFC</a> of 2005 specified two <a href="/w/index.php?title=Parody&action=edit&redlink=1" class="new" title="Parody (страница отсутствует)">parody</a> UTF encodings, <a href="/w/index.php?title=UTF-9&action=edit&redlink=1" class="new" title="UTF-9 (страница отсутствует)">UTF-9</a> and <a href="/w/index.php?title=UTF-18&action=edit&redlink=1" class="new" title="UTF-18 (страница отсутствует)">UTF-18</a>. </p> <h2><span class="mw-headline" id="Adoption">Adoption</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=14" class="mw-editsection-visualeditor" title="Редактировать раздел «Adoption»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=14" title="Редактировать раздел «Adoption»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <h3><span class="mw-headline" id="Operating_systems">Operating systems</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=15" class="mw-editsection-visualeditor" title="Редактировать раздел «Operating systems»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=15" title="Редактировать раздел «Operating systems»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Unicode has become the dominant scheme for internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use <a href="/wiki/UCS-2" class="mw-redirect" title="UCS-2">UCS-2</a> (the fixed-width two-byte precursor to UTF-16) and later moved to <a href="/wiki/UTF-16" title="UTF-16">UTF-16</a> (the variable-width current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is <a href="/wiki/Windows_NT" title="Windows NT">Windows NT</a> (and its descendants, <a href="/wiki/Windows_2000" title="Windows 2000">Windows 2000</a>, <a href="/wiki/Windows_XP" title="Windows XP">Windows XP</a>, <a href="/wiki/Windows_Vista" title="Windows Vista">Windows Vista</a>, <a href="/wiki/Windows_7" title="Windows 7">Windows 7</a>, <a href="/wiki/Windows_8" title="Windows 8">Windows 8</a> and <a href="/wiki/Windows_10" title="Windows 10">Windows 10</a>), which uses UTF-16 as the sole internal character encoding. The <a href="/wiki/Java_virtual_machine" class="mw-redirect" title="Java virtual machine">Java</a> and <a href="/wiki/.NET_Framework" title=".NET Framework">.NET</a> bytecode environments, <a href="/wiki/MacOS" title="MacOS">macOS</a>, and <a href="/wiki/KDE" title="KDE">KDE</a> also use it for internal representation. Partial support for Unicode can be installed on <a href="/wiki/Windows_9x" title="Windows 9x">Windows 9x</a> through the <a href="/w/index.php?title=Microsoft_Layer_for_Unicode&action=edit&redlink=1" class="new" title="Microsoft Layer for Unicode (страница отсутствует)">Microsoft Layer for Unicode</a>. </p><p><a href="/wiki/UTF-8" title="UTF-8">UTF-8</a> (originally developed for <a href="/w/index.php?title=Plan_9_from_Bell_Labs&action=edit&redlink=1" class="new" title="Plan 9 from Bell Labs (страница отсутствует)">Plan 9</a>)<sup id="cite_ref-62" class="reference"><a href="#cite_note-62">[60]</a></sup> has become the main storage encoding on most <a href="/wiki/Unix-like" class="mw-redirect" title="Unix-like">Unix-like</a> operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional <a href="/w/index.php?title=Extended_ASCII&action=edit&redlink=1" class="new" title="Extended ASCII (страница отсутствует)">extended ASCII</a> character sets. UTF-8 is also the most common Unicode encoding used in <a href="/wiki/HTML" title="HTML">HTML</a> documents on the <a href="/wiki/World_Wide_Web" class="mw-redirect" title="World Wide Web">World Wide Web</a>. </p><p>Multilingual text-rendering engines which use Unicode include <a href="/w/index.php?title=Uniscribe&action=edit&redlink=1" class="new" title="Uniscribe (страница отсутствует)">Uniscribe</a> and <a href="/wiki/DirectWrite" title="DirectWrite">DirectWrite</a> for Microsoft Windows, <a href="/w/index.php?title=ATSUI&action=edit&redlink=1" class="new" title="ATSUI (страница отсутствует)">ATSUI</a> and <a href="/w/index.php?title=Core_Text&action=edit&redlink=1" class="new" title="Core Text (страница отсутствует)">Core Text</a> for macOS, and <a href="/wiki/Pango" title="Pango">Pango</a> for <a href="/wiki/GTK%2B" class="mw-redirect" title="GTK+">GTK+</a> and the <a href="/wiki/GNOME" title="GNOME">GNOME</a> desktop. </p> <h3><span class="mw-headline" id="Input_methods">Input methods</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=16" class="mw-editsection-visualeditor" title="Редактировать раздел «Input methods»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=16" title="Редактировать раздел «Input methods»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Unicode_input&action=edit&redlink=1" class="new" title="Unicode input (страница отсутствует)">Unicode input</a></b></div> <p>Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire. </p><p><a href="/w/index.php?title=ISO/IEC_14755&action=edit&redlink=1" class="new" title="ISO/IEC 14755 (страница отсутствует)">ISO/IEC 14755</a>,<sup id="cite_ref-63" class="reference"><a href="#cite_note-63">[61]</a></sup> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the <i>Basic method</i>, where a <i>beginning sequence</i> is followed by the hexadecimal representation of the code point and the <i>ending sequence</i>. There is also a <i>screen-selection entry method</i> specified, where the characters are listed in a table in a screen, such as with a character map program. </p><p>Online tools for finding the code point for a known character include Unicode Lookup<sup id="cite_ref-64" class="reference"><a href="#cite_note-64">[62]</a></sup> by Jonathan Hedley and Shapecatcher<sup id="cite_ref-65" class="reference"><a href="#cite_note-65">[63]</a></sup> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on <a href="/w/index.php?title=Shape_context&action=edit&redlink=1" class="new" title="Shape context (страница отсутствует)">Shape context</a>, one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned. </p> <h3><span class="mw-headline" id="Email">Email</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=17" class="mw-editsection-visualeditor" title="Редактировать раздел «Email»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=17" title="Редактировать раздел «Email»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Unicode_and_email&action=edit&redlink=1" class="new" title="Unicode and email (страница отсутствует)">Unicode and email</a></b></div> <p><a href="/wiki/MIME" title="MIME">MIME</a> defines two different mechanisms for encoding non-ASCII characters in <a href="/wiki/Email" class="mw-redirect" title="Email">email</a>, depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the <a href="/wiki/UTF-8" title="UTF-8">UTF-8</a> character set and the <a href="/wiki/Base64" title="Base64">Base64</a> or the <a href="/w/index.php?title=Quoted-printable&action=edit&redlink=1" class="new" title="Quoted-printable (страница отсутствует)">Quoted-printable</a> transfer encoding are recommended, depending on whether much of the message consists of <a href="/wiki/ASCII" title="ASCII">ASCII</a> characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software. </p><p>The adoption of Unicode in email has been very slow. Some East Asian text is still encoded in encodings such as <a href="/w/index.php?title=ISO-2022&action=edit&redlink=1" class="new" title="ISO-2022 (страница отсутствует)">ISO-2022</a>, and some devices, such as mobile phones, still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as <a href="/wiki/Yahoo" class="mw-redirect" title="Yahoo">Yahoo</a>, <a href="/wiki/Google" class="mw-disambig" title="Google">Google</a> (<a href="/wiki/Gmail" title="Gmail">Gmail</a>), and <a href="/wiki/Microsoft" title="Microsoft">Microsoft</a> (<a href="/wiki/Outlook.com" class="mw-redirect" title="Outlook.com">Outlook.com</a>) support it. </p> <h3><span class="mw-headline" id="Web">Web</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=18" class="mw-editsection-visualeditor" title="Редактировать раздел «Web»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=18" title="Редактировать раздел «Web»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Unicode_and_HTML&action=edit&redlink=1" class="new" title="Unicode and HTML (страница отсутствует)">Unicode and HTML</a></b></div> <p>All <a href="/wiki/W3C" class="mw-redirect" title="W3C">W3C</a> recommendations have used Unicode as their <i>document character set</i> since HTML 4.0. <a href="/wiki/Web_browser" class="mw-redirect" title="Web browser">Web browsers</a> have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from <a href="/w/index.php?title=Typeface&action=edit&redlink=1" class="new" title="Typeface (страница отсутствует)">font</a> related issues; e.g. v 6 and older of Microsoft <a href="/wiki/Internet_Explorer" title="Internet Explorer">Internet Explorer</a> did not render many code points unless explicitly told to use a font that contains them.<sup id="cite_ref-66" class="reference"><a href="#cite_note-66">[64]</a></sup> </p><p>Although syntax rules may affect the order in which characters are allowed to appear, <a href="/wiki/XML" title="XML">XML</a> (including <a href="/wiki/XHTML" title="XHTML">XHTML</a>) documents, by definition,<sup id="cite_ref-67" class="reference"><a href="#cite_note-67">[65]</a></sup> comprise characters from most of the Unicode code points, with the exception of: </p> <ul><li>most of the <a href="/w/index.php?title=C0_and_C1_control_codes&action=edit&redlink=1" class="new" title="C0 and C1 control codes (страница отсутствует)">C0 control codes</a></li> <li>the permanently unassigned code points D800–DFFF</li> <li>FFFE or FFFF</li></ul> <p>HTML characters manifest either directly as <a href="/wiki/Byte" class="mw-disambig" title="Byte">bytes</a> according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&#916;</code>, <code>&#1049;</code>, <code>&#1511;</code>, <code>&#1605;</code>, <code>&#3671;</code>, <code>&#12354;</code>, <code>&#21494;</code>, <code>&#33865;</code>, and <code>&#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말. </p><p>When specifying <a href="/wiki/Uniform_Resource_Identifier" class="mw-redirect" title="Uniform Resource Identifier">URIs</a>, for example as <a href="/wiki/URL" title="URL">URLs</a> in <a href="/wiki/HTTP" title="HTTP">HTTP</a> requests, non-ASCII characters must be <a href="/w/index.php?title=Percent_encoding&action=edit&redlink=1" class="new" title="Percent encoding (страница отсутствует)">percent-encoded</a>. </p> <h3><span class="mw-headline" id="Fonts">Fonts</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=19" class="mw-editsection-visualeditor" title="Редактировать раздел «Fonts»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=19" title="Редактировать раздел «Fonts»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Unicode_font&action=edit&redlink=1" class="new" title="Unicode font (страница отсутствует)">Unicode font</a></b></div> <p>Unicode is not in principle concerned with fonts <i>per se</i>, seeing them as implementation choices.<sup id="cite_ref-68" class="reference"><a href="#cite_note-68">[66]</a></sup> Any given character may have many <a href="/w/index.php?title=Allograph&action=edit&redlink=1" class="new" title="Allograph (страница отсутствует)">allographs</a>, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode standard.<sup id="cite_ref-69" class="reference"><a href="#cite_note-69">[67]</a></sup> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire. </p><p>Free and retail <a href="/w/index.php?title=Font&action=edit&redlink=1" class="new" title="Font (страница отсутствует)">fonts</a> based on Unicode are widely available, since <a href="/wiki/TrueType" title="TrueType">TrueType</a> and <a href="/wiki/OpenType" title="OpenType">OpenType</a> support Unicode. These font formats map Unicode code points to glyphs, but TrueType font is restricted to 65,535 glyphs. </p><p><a href="/w/index.php?title=List_of_typefaces&action=edit&redlink=1" class="new" title="List of typefaces (страница отсутствует)">Thousands of fonts</a> exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based <a href="/w/index.php?title=List_of_Unicode_fonts&action=edit&redlink=1" class="new" title="List of Unicode fonts (страница отсутствует)">fonts</a> typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., <a href="/w/index.php?title=Font_substitution&action=edit&redlink=1" class="new" title="Font substitution (страница отсутствует)">font substitution</a>. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of <a href="/w/index.php?title=Diminishing_returns&action=edit&redlink=1" class="new" title="Diminishing returns (страница отсутствует)">diminishing returns</a> for most typefaces. </p> <h3><span class="mw-headline" id="Newlines">Newlines</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=20" class="mw-editsection-visualeditor" title="Редактировать раздел «Newlines»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=20" title="Редактировать раздел «Newlines»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Unicode partially addresses the <a href="/w/index.php?title=Newline&action=edit&redlink=1" class="new" title="Newline (страница отсутствует)">newline</a> problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of <a href="/w/index.php?title=Newline&action=edit&redlink=1" class="new" title="Newline (страница отсутствует)">characters</a> that conforming applications should recognize as line terminators. </p><p>In terms of the newline, Unicode introduced <span class="nowrap">U+2028</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">line separator</span> and <span class="nowrap">U+2029</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">paragraph separator</span>. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding. </p> <h2><span class="mw-headline" id="Issues">Issues</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=21" class="mw-editsection-visualeditor" title="Редактировать раздел «Issues»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=21" title="Редактировать раздел «Issues»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <h3><span class="mw-headline" id="Philosophical_and_completeness_criticisms">Philosophical and completeness criticisms</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=22" class="mw-editsection-visualeditor" title="Редактировать раздел «Philosophical and completeness criticisms»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=22" title="Редактировать раздел «Philosophical and completeness criticisms»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p><a href="/w/index.php?title=Han_unification&action=edit&redlink=1" class="new" title="Han unification (страница отсутствует)">Han unification</a> (the identification of forms in the <a href="/w/index.php?title=East_Asian_language&action=edit&redlink=1" class="new" title="East Asian language (страница отсутствует)">East Asian languages</a> which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the <a href="/w/index.php?title=Ideographic_Research_Group&action=edit&redlink=1" class="new" title="Ideographic Research Group (страница отсутствует)">Ideographic Research Group</a> (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification.<sup id="cite_ref-70" class="reference"><a href="#cite_note-70">[68]</a></sup> </p><p>Unicode has been criticized for failing to separately encode older and alternative forms of <a href="/w/index.php?title=Kanji&action=edit&redlink=1" class="new" title="Kanji (страница отсутствует)">kanji</a> which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). Unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged.<sup id="cite_ref-dw2001_71-0" class="reference"><a href="#cite_note-dw2001-71">[69]</a></sup><span style="background: #ffeaea; color: #444444;"></span><sup class="" style="white-space: nowrap">[<i><a href="/wiki/%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:%D0%9F%D1%80%D0%BE%D1%8F%D1%81%D0%BD%D0%B8%D1%82%D1%8C/doc" title="Шаблон:Прояснить/doc"><span title="Шаблон:Прояснить/doc" style="">прояснить</span></a></i>]</sup> There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. An example of one is <a href="/w/index.php?title=TRON_(encoding)&action=edit&redlink=1" class="new" title="TRON (encoding) (страница отсутствует)">TRON</a> (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it). </p><p>Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 92,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam. </p><p>Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations, in the form of <a href="/w/index.php?title=Variation_Selectors&action=edit&redlink=1" class="new" title="Variation Selectors (страница отсутствует)">Unicode variation sequences</a>. For example, the Advanced Typographic tables of <a href="/wiki/OpenType" title="OpenType">OpenType</a> permit one of a number of alternative glyph representations to be selected when performing the character to glyph mapping process. In this case, information can be provided within plain text to designate which alternate character form to select. </p> <div class="thumb tright"><div class="thumbinner" style="width:222px;"><a href="/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Cyrillic_cursive.svg" class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Cyrillic_cursive.svg/220px-Cyrillic_cursive.svg.png" decoding="async" width="220" height="284" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Cyrillic_cursive.svg/330px-Cyrillic_cursive.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/71/Cyrillic_cursive.svg/440px-Cyrillic_cursive.svg.png 2x" data-file-width="425" data-file-height="549" /></a> <div class="thumbcaption"><div class="magnify"><a href="/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Cyrillic_cursive.svg" class="internal" title="Увеличить"></a></div>Various <a href="/w/index.php?title=Cyrillic&action=edit&redlink=1" class="new" title="Cyrillic (страница отсутствует)">Cyrillic</a> characters shown with and without italics</div></div></div> <p>If the difference in the appropriate glyphs for two characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison between Russian (labeled standard) and Serbian characters at right, meaning that the differences are displayed through smart font technology or manually changing fonts. </p> <h3><span class="mw-headline" id="Mapping_to_legacy_character_sets">Mapping to legacy character sets</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=23" class="mw-editsection-visualeditor" title="Редактировать раздел «Mapping to legacy character sets»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=23" title="Редактировать раздел «Mapping to legacy character sets»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p>Unicode was designed to provide code-point-by-code-point <a href="/w/index.php?title=Round-trip_format_conversion&action=edit&redlink=1" class="new" title="Round-trip format conversion (страница отсутствует)">round-trip format conversion</a> to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. That has meant that inconsistent legacy architectures, such as <a href="/w/index.php?title=Combining_character&action=edit&redlink=1" class="new" title="Combining character (страница отсутствует)">combining diacritics</a> and <a href="/w/index.php?title=Precomposed_character&action=edit&redlink=1" class="new" title="Precomposed character (страница отсутствует)">precomposed characters</a>, both exist in Unicode, giving more than one method of representing some text. This is most pronounced in the three different encoding forms for Korean <a href="/w/index.php?title=Hangul&action=edit&redlink=1" class="new" title="Hangul (страница отсутствует)">Hangul</a>. Since version 3.0, any precomposed characters that can be represented by a combining sequence of already existing characters can no longer be added to the standard in order to preserve interoperability between software using different versions of Unicode. </p><p><a href="/w/index.php?title=Injective&action=edit&redlink=1" class="new" title="Injective (страница отсутствует)">Injective</a> mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as <a href="/w/index.php?title=Shift-JIS&action=edit&redlink=1" class="new" title="Shift-JIS (страница отсутствует)">Shift-JIS</a> or <a href="/w/index.php?title=EUC-JP&action=edit&redlink=1" class="new" title="EUC-JP (страница отсутствует)">EUC-JP</a> and Unicode led to <a href="/w/index.php?title=Round-trip_format_conversion&action=edit&redlink=1" class="new" title="Round-trip format conversion (страница отсутствует)">round-trip format conversion</a> mismatches, particularly the mapping of the character JIS X 0208 '～' (1-33, WAVE DASH), heavily used in legacy database data, to either <span class="nowrap">U+FF5E</span> <span style="font-size:125%">～</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">fullwidth tilde</span> (in <a href="/wiki/Microsoft_Windows" class="mw-redirect" title="Microsoft Windows">Microsoft Windows</a>) or <span class="nowrap">U+301C</span> <span style="font-size:125%">〜</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">wave dash</span> (other vendors).<sup id="cite_ref-72" class="reference"><a href="#cite_note-72">[70]</a></sup> </p><p>Some Japanese computer programmers objected to Unicode because it requires them to separate the use of <span class="nowrap">U+005C</span> <span style="font-size:125%">\</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">reverse solidus</span> (backslash) and <span class="nowrap">U+00A5</span> <span style="font-size:125%">¥</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">yen sign</span>, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage.<sup id="cite_ref-73" class="reference"><a href="#cite_note-73">[71]</a></sup> (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in <a href="/wiki/ISO_8859-1" title="ISO 8859-1">ISO 8859-1</a>, from long before Unicode. </p> <h3><span class="mw-headline" id="Indic_scripts">Indic scripts</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=24" class="mw-editsection-visualeditor" title="Редактировать раздел «Indic scripts»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=24" title="Редактировать раздел «Indic scripts»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <p><a href="/w/index.php?title=Indic_script&action=edit&redlink=1" class="new" title="Indic script (страница отсутствует)">Indic scripts</a> such as <a href="/w/index.php?title=Tamil_script&action=edit&redlink=1" class="new" title="Tamil script (страница отсутствует)">Tamil</a> and <a href="/w/index.php?title=Devanagari&action=edit&redlink=1" class="new" title="Devanagari (страница отсутствует)">Devanagari</a> are each allocated only 128 code points, matching the <a href="/w/index.php?title=ISCII&action=edit&redlink=1" class="new" title="ISCII (страница отсутствует)">ISCII</a> standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (aka conjuncts) out of components. Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only.<sup id="cite_ref-74" class="reference"><a href="#cite_note-74">[72]</a></sup><sup id="cite_ref-75" class="reference"><a href="#cite_note-75">[73]</a></sup><sup id="cite_ref-76" class="reference"><a href="#cite_note-76">[74]</a></sup> Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for the <a href="/w/index.php?title=Tibetan_script&action=edit&redlink=1" class="new" title="Tibetan script (страница отсутствует)">Tibetan script</a> in 2003 when the <a href="/w/index.php?title=Standardization_Administration_of_China&action=edit&redlink=1" class="new" title="Standardization Administration of China (страница отсутствует)">Standardization Administration of China</a> proposed encoding 956 precomposed Tibetan syllables,<sup id="cite_ref-77" class="reference"><a href="#cite_note-77">[75]</a></sup> but these were rejected for encoding by the relevant ISO committee (<a href="/w/index.php?title=ISO/IEC_JTC_1/SC_2&action=edit&redlink=1" class="new" title="ISO/IEC JTC 1/SC 2 (страница отсутствует)">ISO/IEC JTC 1/SC 2</a>).<sup id="cite_ref-78" class="reference"><a href="#cite_note-78">[76]</a></sup> </p><p><a href="/w/index.php?title=Thai_alphabet&action=edit&redlink=1" class="new" title="Thai alphabet (страница отсутствует)">Thai alphabet</a> support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the <a href="/w/index.php?title=TIS-620&action=edit&redlink=1" class="new" title="TIS-620 (страница отсутствует)">Thai Industrial Standard 620</a>, which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation.<sup id="cite_ref-dw2001_71-1" class="reference"><a href="#cite_note-dw2001-71">[69]</a></sup> Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Wiktth&action=edit&redlink=1" class="new" title="Шаблон:Wiktth (страница отсутствует)">Шаблон:Wiktth</a> <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:IPA-th&action=edit&redlink=1" class="new" title="Шаблон:IPA-th (страница отсутствует)">Шаблон:IPA-th</a> "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส. </p> <h3><span class="mw-headline" id="Combining_characters">Combining characters</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=25" class="mw-editsection-visualeditor" title="Редактировать раздел «Combining characters»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=25" title="Редактировать раздел «Combining characters»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Combining_character&action=edit&redlink=1" class="new" title="Combining character (страница отсутствует)">Combining character</a></b></div> <div class="hatnote dabhide">См. также: <a href="/w/index.php?title=Unicode_normalization&action=edit&redlink=1" class="new" title="Unicode normalization (страница отсутствует)">Unicode normalization § Normalization</a></div> <p>Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an <a href="/wiki/E" class="mw-disambig" title="E">e</a> with a <a href="/w/index.php?title=Macron_(diacritic)&action=edit&redlink=1" class="new" title="Macron (diacritic) (страница отсутствует)">macron</a> and <a href="/w/index.php?title=Acute_accent&action=edit&redlink=1" class="new" title="Acute accent (страница отсутствует)">acute accent</a>, but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Similarly, <a href="/w/index.php?title=Dot_(diacritic)&action=edit&redlink=1" class="new" title="Dot (diacritic) (страница отсутствует)">underdots</a>, as needed in the <a href="/w/index.php?title=Romanization&action=edit&redlink=1" class="new" title="Romanization (страница отсутствует)">romanization</a> of <a href="/w/index.php?title=Indo-Aryan_languages&action=edit&redlink=1" class="new" title="Indo-Aryan languages (страница отсутствует)">Indic</a>, will often be placed incorrectly.<span style="background: #ffeaea; color: #444444;"></span><sup class="noprint" style="white-space: nowrap"><strong class="error noprint">Ошибка<sup><a href="/wiki/%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:%D0%9D%D0%B0%D0%B4%D1%81%D1%82%D1%80%D0%BE%D1%87%D0%BD%D0%BE%D0%B5_%D0%BF%D1%80%D0%B5%D0%B4%D1%83%D0%BF%D1%80%D0%B5%D0%B6%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5#Дата_установки" title="Шаблон:Надстрочное предупреждение">?</a></sup>: некорректно задана дата установки</strong></sup>. Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded the problem can often be solved by using a specialist Unicode font such as <a href="/wiki/Charis_SIL" class="mw-redirect" title="Charis SIL">Charis SIL</a> that uses <a href="/w/index.php?title=Graphite_(SIL)&action=edit&redlink=1" class="new" title="Graphite (SIL) (страница отсутствует)">Graphite</a>, <a href="/wiki/OpenType" title="OpenType">OpenType</a>, or <a href="/w/index.php?title=Apple_Advanced_Typography&action=edit&redlink=1" class="new" title="Apple Advanced Typography (страница отсутствует)">AAT</a> technologies for advanced rendering features. </p> <h3><span class="mw-headline" id="Anomalies">Anomalies</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=26" class="mw-editsection-visualeditor" title="Редактировать раздел «Anomalies»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=26" title="Редактировать раздел «Anomalies»">править код</a><span class="mw-editsection-bracket">]</span></span></h3> <div class="hatnote">Основная статья: <b><a href="/w/index.php?title=Unicode_alias_names_and_abbreviations&action=edit&redlink=1" class="new" title="Unicode alias names and abbreviations (страница отсутствует)">Unicode alias names and abbreviations</a></b></div> <p>The Unicode standard has imposed rules intended to guarantee stability.<sup id="cite_ref-79" class="reference"><a href="#cite_note-79">[77]</a></sup> Depending on the strictness of a rule, a change can be prohibited or allowed. For example, a "name" given to a code point cannot and will not change. But a "script" property is more flexible, by Unicode's own rules. In version 2.0, Unicode changed many code point "names" from version 1. At the same moment, Unicode stated that from then on, an assigned name to a code point will never change anymore. This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a> for <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Sc2&action=edit&redlink=1" class="new" title="Шаблон:Sc2 (страница отсутствует)">Шаблон:Sc2</a> in a character name). In 2006 a list of anomalies in character names was first published, and, as of April 2017, there were 94 characters with identified issues,<sup id="cite_ref-tn17_80-0" class="reference"><a href="#cite_note-tn17-80">[78]</a></sup> for example: </p> <ul><li><span class="nowrap">U+2118</span> <span style="font-size:125%">℘</span> <a href="/w/index.php?title=Weierstrass_p&action=edit&redlink=1" class="new" title="Weierstrass p (страница отсутствует)"><span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">script capital p</span></a>: This is a small letter. The capital is <span class="nowrap">U+1D4AB</span> <span style="font-size:125%">𝒫</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">mathematical script capital p</span><sup id="cite_ref-81" class="reference"><a href="#cite_note-81">[79]</a></sup></li> <li><span class="nowrap">U+034F</span> <span style="font-size:125%">͏</span> <a href="/w/index.php?title=Combining_grapheme_joiner&action=edit&redlink=1" class="new" title="Combining grapheme joiner (страница отсутствует)"><span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">combining grapheme joiner</span></a>: Does not join graphemes.<sup id="cite_ref-tn17_80-1" class="reference"><a href="#cite_note-tn17-80">[78]</a></sup></li> <li><span class="nowrap">U+A015</span> <span style="font-size:125%">ꀕ</span> <a href="/w/index.php?title=Yi_language&action=edit&redlink=1" class="new" title="Yi language (страница отсутствует)"><span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">yi syllable wu</span></a>: This is not a Yi syllable, but a Yi iteration mark.</li> <li><span class="nowrap">U+FE18</span> <span style="font-size:125%">︘</span> <span style="font-variant:small-caps; -moz-font-feature-settings:"smcp" 1; -ms-font-feature-settings:"smcp" 1; -o-font-feature-settings:"smcp" 1; -webkit-font-feature-settings:"smcp" 1; font-feature-settings:"smcp" 1;">presentation form for vertical right white lenticular <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Typo&action=edit&redlink=1" class="new" title="Шаблон:Typo (страница отсутствует)">шаблон:typo</a></span>: <i>bracket</i> is spelled incorrectly.<sup id="cite_ref-82" class="reference"><a href="#cite_note-82">[80]</a></sup></li></ul> <p>Spelling errors are resolved by using <a href="/w/index.php?title=Unicode_alias_names_and_abbreviations&action=edit&redlink=1" class="new" title="Unicode alias names and abbreviations (страница отсутствует)">Unicode alias names and abbreviations</a>. </p> <h2><span class="mw-headline" id="See_also">See also</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=27" class="mw-editsection-visualeditor" title="Редактировать раздел «See also»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=27" title="Редактировать раздел «See also»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <ul><li><a href="/w/index.php?title=Comparison_of_Unicode_encodings&action=edit&redlink=1" class="new" title="Comparison of Unicode encodings (страница отсутствует)">Comparison of Unicode encodings</a></li> <li><a href="/w/index.php?title=Cultural,_political,_and_religious_symbols_in_Unicode&action=edit&redlink=1" class="new" title="Cultural, political, and religious symbols in Unicode (страница отсутствует)">Cultural, political, and religious symbols in Unicode</a></li> <li><a href="/w/index.php?title=International_Components_for_Unicode&action=edit&redlink=1" class="new" title="International Components for Unicode (страница отсутствует)">International Components for Unicode</a> (ICU), now as ICU-<abbr title="technical committee">TC</abbr> a part of Unicode</li> <li><a href="/w/index.php?title=List_of_binary_codes&action=edit&redlink=1" class="new" title="List of binary codes (страница отсутствует)">List of binary codes</a></li> <li><a href="/w/index.php?title=List_of_Unicode_characters&action=edit&redlink=1" class="new" title="List of Unicode characters (страница отсутствует)">List of Unicode characters</a></li> <li><a href="/w/index.php?title=List_of_XML_and_HTML_character_entity_references&action=edit&redlink=1" class="new" title="List of XML and HTML character entity references (страница отсутствует)">List of XML and HTML character entity references</a></li> <li><a href="/w/index.php?title=Open-source_Unicode_typefaces&action=edit&redlink=1" class="new" title="Open-source Unicode typefaces (страница отсутствует)">Open-source Unicode typefaces</a></li> <li><a href="/w/index.php?title=Standards_related_to_Unicode&action=edit&redlink=1" class="new" title="Standards related to Unicode (страница отсутствует)">Standards related to Unicode</a></li> <li><a href="/w/index.php?title=Unicode_symbols&action=edit&redlink=1" class="new" title="Unicode symbols (страница отсутствует)">Unicode symbols</a></li> <li><a href="/w/index.php?title=Universal_Coded_Character_Set&action=edit&redlink=1" class="new" title="Universal Coded Character Set (страница отсутствует)">Universal Coded Character Set</a></li> <li><a href="/w/index.php?title=Lotus_Multi-Byte_Character_Set&action=edit&redlink=1" class="new" title="Lotus Multi-Byte Character Set (страница отсутствует)">Lotus Multi-Byte Character Set</a> (LMBCS), a parallel development with similar intentions</li></ul> <h2><span class="mw-headline" id="Further_reading">Further reading</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=28" class="mw-editsection-visualeditor" title="Редактировать раздел «Further reading»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=28" title="Редактировать раздел «Further reading»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <div class="reflist not-references" style=""> <ul><li><i>The Unicode Standard, Version 3.0</i>, The Unicode Consortium, Addison-Wesley Longman, Inc., April 2000. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-201-61633-5" title="Служебная:Источники книг/0-201-61633-5">ISBN <span class="nowrap">0-201-61633-5</span></a></li> <li><i>The Unicode Standard, Version 4.0</i>, The Unicode Consortium, Addison-Wesley Professional, 27 August 2003. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-321-18578-1" title="Служебная:Источники книг/0-321-18578-1">ISBN <span class="nowrap">0-321-18578-1</span></a></li> <li><i>The Unicode Standard, Version 5.0, Fifth Edition</i>, The <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a>, Addison-Wesley Professional, 27 October 2006. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-321-48091-0" title="Служебная:Источники книг/0-321-48091-0">ISBN <span class="nowrap">0-321-48091-0</span></a></li> <li>Julie D. Allen. <i>The Unicode Standard, Version 6.0</i>, The <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a>, Mountain View, 2011, <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/9781936213016" title="Служебная:Источники книг/9781936213016">ISBN <span class="nowrap">9781936213016</span></a>, (<a rel="nofollow" class="external autonumber" href="https://www.unicode.org/versions/Unicode6.0.0/">[1]</a>).</li> <li><i>The Complete Manual of Typography</i>, James Felici, Adobe Press; 1st edition, 2002. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-321-12730-7" title="Служебная:Источники книг/0-321-12730-7">ISBN <span class="nowrap">0-321-12730-7</span></a></li> <li><i>Unicode: A Primer</i>, Tony Graham, M&T books, 2000. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-7645-4625-2" title="Служебная:Источники книг/0-7645-4625-2">ISBN <span class="nowrap">0-7645-4625-2</span></a>.</li> <li><i>Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard</i>, Richard Gillam, Addison-Wesley Professional; 1st edition, 2002. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-201-70052-2" title="Служебная:Источники книг/0-201-70052-2">ISBN <span class="nowrap">0-201-70052-2</span></a></li> <li><i>Unicode Explained</i>, Jukka K. Korpela, O'Reilly; 1st edition, 2006. <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/0-596-10121-X" title="Служебная:Источники книг/0-596-10121-X">ISBN <span class="nowrap">0-596-10121-X</span></a></li></ul> </div> <ul><li><span class="citation" id="CITEREFhttps://doi.org/10.36824/2018-graf-hara1"><span class="citation">Unicode from a Linguistic Point of View // <a rel="nofollow" class="external text" href="http://www.fluxus-editions.fr/gla1-hara1.php">Proceedings of Graphemics in the 21st Century, Brest 2018</a>. — Brest : Fluxus Editions, 2019. — P. 167-183. — <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/9782957054916" class="internal mw-magiclink-isbn">ISBN 978-2-9570549-1-6</a>.</span></span></li></ul> <h2><span class="mw-headline" id="Notes">Notes</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=29" class="mw-editsection-visualeditor" title="Редактировать раздел «Notes»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=29" title="Редактировать раздел «Notes»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <div class="reflist" style="list-style-type: decimal;"> <ol class="references"> <li id="cite_note-3"><span class="mw-cite-backlink"><a href="#cite_ref-3">↑</a></span> <span class="reference-text">The Unicode Consortium uses the ambiguous term byte; The <a href="/wiki/International_Organization_for_Standardization" class="mw-redirect" title="International Organization for Standardization">International Organization for Standardization</a> (ISO), the <a href="/wiki/International_Electrotechnical_Commission" class="mw-redirect" title="International Electrotechnical Commission">International Electrotechnical Commission</a> (IEC) and the <a href="/wiki/Internet_Engineering_Task_Force" class="mw-redirect" title="Internet Engineering Task Force">Internet Engineering Task Force</a> (IETF) use the more specific term <a href="/w/index.php?title=Octet_(computing)&action=edit&redlink=1" class="new" title="Octet (computing) (страница отсутствует)">octet</a> in current documents related to Unicode.</span> </li> </ol></div> <h2><span class="mw-headline" id="References">References</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=30" class="mw-editsection-visualeditor" title="Редактировать раздел «References»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=30" title="Редактировать раздел «References»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <div class="reflist columns" style="-moz-column-width:30em; -webkit-column-width:30em; column-width:30em; list-style-type: decimal;"> <ol class="references"> <li id="cite_note-1"><span class="mw-cite-backlink"><a href="#cite_ref-1">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/standard/principles.html">The Unicode Standard: A Technical Introduction</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-2"><span class="mw-cite-backlink"><a href="#cite_ref-2">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="en"><a rel="nofollow" class="external text" href="https://w3techs.com/technologies/cross/character_encoding/ranking">Usage Survey of Character Encodings broken down by Ranking</a></span> <span class="ref-info" style="cursor:help;" title="на английском языке">(англ.)</span>. <i>w3techs.com</i>. <small>Дата обращения 11 ноября 2019.</small></span></span> </span></li> <li id="cite_note-4"><span class="mw-cite-backlink"><a href="#cite_ref-4">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559">Conformance</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>The Unicode Standard</i> (March 2020). <small>Дата обращения 15 марта 2020.</small></span></span> </span></li> <li id="cite_note-unicode-88-5"><span class="mw-cite-backlink">↑ <a href="#cite_ref-unicode-88_5-0"><sup><i><b>1</b></i></sup></a> <a href="#cite_ref-unicode-88_5-1"><sup><i><b>2</b></i></sup></a> <a href="#cite_ref-unicode-88_5-2"><sup><i><b>3</b></i></sup></a> <a href="#cite_ref-unicode-88_5-3"><sup><i><b>4</b></i></sup></a> <a href="#cite_ref-unicode-88_5-4"><sup><i><b>5</b></i></sup></a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/history/unicode88.pdf">Unicode 88</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>unicode.org</i>. <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a> (10 сентября 1998). — «In 1978, the initial proposal for a set of "Universal Signs" was made by <a href="/w/index.php?title=Bob_Belleville&action=edit&redlink=1" class="new" title="Bob Belleville (страница отсутствует)">Bob Belleville</a> at <a href="/wiki/Xerox_PARC" title="Xerox PARC">Xerox PARC</a>. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the <a href="/w/index.php?title=Xerox_Character_Code_Standard&action=edit&redlink=1" class="new" title="Xerox Character Code Standard (страница отсутствует)">Xerox Character Code Standard</a> (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.<br />Unicode arose as the result of eight years of working experience with XCCS. Its fundamental differences from XCCS were proposed by Peter Fenwick and Dave Opstad (pure 16-bit codes), and by <a href="/w/index.php?title=Lee_Collins_(Unicode)&action=edit&redlink=1" class="new" title="Lee Collins (Unicode) (страница отсутствует)">Lee Collins</a> (ideographic character unification). Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communication multilingual system products.». <small>Дата обращения 25 октября 2016.</small> <small><a rel="nofollow" class="external text" href="https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf">Архивировано</a> 25 ноября 2016 года.</small></span></span> </span></li> <li id="cite_note-6"><span class="mw-cite-backlink"><a href="#cite_ref-6">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/history/summary.html">Summary Narrative</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 15 марта 2010.</small></span></span> </span></li> <li id="cite_note-7"><span class="mw-cite-backlink"><a href="#cite_ref-7">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://unicode.org/history/publicationdates.html">History of Unicode Release and Publication Dates</a> on <i>unicode.org.</i> Retrieved February 28, 2017.</span> </li> <li id="cite_note-unicoderevisited-8"><span class="mw-cite-backlink"><a href="#cite_ref-unicoderevisited_8-0">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>Searle, Stephen J</i> <span lang="und"><a rel="nofollow" class="external text" href="http://tronweb.super-nova.co.jp/unicoderevisited.html">Unicode Revisited</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 18 января 2013.</small></span></span> </span></li> <li id="cite_note-members-9"><span class="mw-cite-backlink">↑ <a href="#cite_ref-members_9-0"><sup><i><b>1</b></i></sup></a> <a href="#cite_ref-members_9-1"><sup><i><b>2</b></i></sup></a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/consortium/members.html">The Unicode Consortium Members</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 4 января 2019.</small></span></span> </span></li> <li id="cite_note-10"><span class="mw-cite-backlink"><a href="#cite_ref-10">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/charts/">Character Code Charts</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 17 марта 2010.</small></span></span> </span></li> <li id="cite_note-11"><span class="mw-cite-backlink"><a href="#cite_ref-11">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://home.unicode.org/basic-info/faq/">Unicode FAQ</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 2 апреля 2020.</small></span></span> </span></li> <li id="cite_note-12"><span class="mw-cite-backlink"><a href="#cite_ref-12">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/roadmaps/bmp/">Roadmap to the BMP</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <a href="/w/index.php?title=Unicode_Consortium&action=edit&redlink=1" class="new" title="Unicode Consortium (страница отсутствует)">Unicode Consortium</a>. <small>Дата обращения 30 июля 2018.</small></span></span> </span></li> <li id="cite_note-13"><span class="mw-cite-backlink"><a href="#cite_ref-13">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/pending/about-sei.html">About The Script Encoding Initiative</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. The Unicode Consortium. <small>Дата обращения 4 июня 2012.</small></span></span> </span></li> <li id="cite_note-version6.1PoD-14"><span class="mw-cite-backlink"><a href="#cite_ref-version6.1PoD_14-0">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html">Unicode 6.1 Paperback Available</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>announcements_at_unicode.org</i>. <small>Дата обращения 30 мая 2012.</small></span></span> </span></li> <li id="cite_note-15"><span class="mw-cite-backlink"><a href="#cite_ref-15">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://home.unicode.org/unicode-14-0-delayed-for-6-months/">Unicode 14.0 Delayed for 6 Months</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 5 мая 2020.</small></span></span> </span></li> <li id="cite_note-16"><span class="mw-cite-backlink"><a href="#cite_ref-16">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/enumeratedversions.html">Enumerated Versions of The Unicode Standard</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 21 июня 2016.</small></span></span> </span></li> <li id="cite_note-18"><span class="mw-cite-backlink"><a href="#cite_ref-18">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt">Unicode Data 1.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-19"><span class="mw-cite-backlink"><a href="#cite_ref-19">↑</a></span> <span class="reference-text"> <span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt">Unicode Data 1.0.1</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-20"><span class="mw-cite-backlink"><a href="#cite_ref-20">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt">Unicode Data 1995</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-21"><span class="mw-cite-backlink"><a href="#cite_ref-21">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt">Unicode Data-2.0.14</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-22"><span class="mw-cite-backlink"><a href="#cite_ref-22">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt">Unicode Data-2.1.2</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-23"><span class="mw-cite-backlink"><a href="#cite_ref-23">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt">Unicode Data-3.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-24"><span class="mw-cite-backlink"><a href="#cite_ref-24">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt">Unicode Data-3.1.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-25"><span class="mw-cite-backlink"><a href="#cite_ref-25">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt">Unicode Data-3.2.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-26"><span class="mw-cite-backlink"><a href="#cite_ref-26">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt">Unicode Data-4.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-27"><span class="mw-cite-backlink"><a href="#cite_ref-27">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt">Unicode Data-4.1.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-28"><span class="mw-cite-backlink"><a href="#cite_ref-28">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt">Unicode Data 5.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 17 марта 2010.</small></span></span> </span></li> <li id="cite_note-29"><span class="mw-cite-backlink"><a href="#cite_ref-29">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt">Unicode Data 5.1.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 17 марта 2010.</small></span></span> </span></li> <li id="cite_note-30"><span class="mw-cite-backlink"><a href="#cite_ref-30">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt">Unicode Data 5.2.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 17 марта 2010.</small></span></span> </span></li> <li id="cite_note-31"><span class="mw-cite-backlink"><a href="#cite_ref-31">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt">Unicode Data 6.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 11 октября 2010.</small></span></span> </span></li> <li id="cite_note-32"><span class="mw-cite-backlink"><a href="#cite_ref-32">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt">Unicode Data 6.1.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 31 января 2012.</small></span></span> </span></li> <li id="cite_note-33"><span class="mw-cite-backlink"><a href="#cite_ref-33">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt">Unicode Data 6.2.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 26 сентября 2012.</small></span></span> </span></li> <li id="cite_note-34"><span class="mw-cite-backlink"><a href="#cite_ref-34">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt">Unicode Data 6.3.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 30 сентября 2013.</small></span></span> </span></li> <li id="cite_note-35"><span class="mw-cite-backlink"><a href="#cite_ref-35">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt">Unicode Data 7.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 15 июня 2014.</small></span></span> </span></li> <li id="cite_note-36"><span class="mw-cite-backlink"><a href="#cite_ref-36">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode8.0.0/">Unicode 8.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium. <small>Дата обращения 17 июня 2015.</small></span></span> </span></li> <li id="cite_note-37"><span class="mw-cite-backlink"><a href="#cite_ref-37">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt">Unicode Data 8.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 17 июня 2015.</small></span></span> </span></li> <li id="cite_note-38"><span class="mw-cite-backlink"><a href="#cite_ref-38">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode9.0.0/">Unicode 9.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium. <small>Дата обращения 21 июня 2016.</small></span></span> </span></li> <li id="cite_note-39"><span class="mw-cite-backlink"><a href="#cite_ref-39">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt">Unicode Data 9.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 21 июня 2016.</small></span></span> </span></li> <li id="cite_note-laobo-40"><span class="mw-cite-backlink"><a href="#cite_ref-laobo_40-0">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>Lobao, Martim</i> <span lang="und"><a rel="nofollow" class="external text" href="https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/">These Are The Two Emoji That Weren't Approved For Unicode 9 But Which Google Added To Android Anyway</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>Android Police</i> (7 June 2016). <small>Дата обращения 4 сентября 2016.</small></span></span> </span></li> <li id="cite_note-41"><span class="mw-cite-backlink"><a href="#cite_ref-41">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode10.0.0/">Unicode 10.0.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium. <small>Дата обращения 20 июня 2017.</small></span></span> </span></li> <li id="cite_note-42"><span class="mw-cite-backlink"><a href="#cite_ref-42">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode11.0.0/appC.pdf">The Unicode Standard, Version 11.0.0 Appendix C</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium. <small>Дата обращения 11 июня 2018.</small></span></span> </span></li> <li id="cite_note-43"><span class="mw-cite-backlink"><a href="#cite_ref-43">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html">Announcing The Unicode Standard, Version 11.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>blog.unicode.org</i>. <small>Дата обращения 6 июня 2018.</small></span></span> </span></li> <li id="cite_note-44"><span class="mw-cite-backlink"><a href="#cite_ref-44">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode12.0.0/appC.pdf">The Unicode Standard, Version 12.0.0 Appendix C</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium. <small>Дата обращения 5 марта 2019.</small></span></span> </span></li> <li id="cite_note-45"><span class="mw-cite-backlink"><a href="#cite_ref-45">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html">Announcing The Unicode Standard, Version 12.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>blog.unicode.org</i>. <small>Дата обращения 5 марта 2019.</small></span></span> </span></li> <li id="cite_note-46"><span class="mw-cite-backlink"><a href="#cite_ref-46">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="http://blog.unicode.org/2019/05/unicode-12-1-en.html">Unicode Version 12.1 released in support of the Reiwa Era</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>blog.unicode.org</i>. <small>Дата обращения 7 мая 2019.</small></span></span> </span></li> <li id="cite_note-47"><span class="mw-cite-backlink"><a href="#cite_ref-47">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode13.0.0/appC.pdf">The Unicode Standard, Version 13.0– Core Specification Appendix C</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium. <small>Дата обращения 11 марта 2020.</small></span></span> </span></li> <li id="cite_note-48"><span class="mw-cite-backlink"><a href="#cite_ref-48">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html">Announcing The Unicode Standard, Version 13.0</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>blog.unicode.org</i>. <small>Дата обращения 11 марта 2020.</small></span></span> </span></li> <li id="cite_note-Glossary-49"><span class="mw-cite-backlink"><a href="#cite_ref-Glossary_49-0">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/glossary/">Glossary of Unicode Terms</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-50"><span class="mw-cite-backlink"><a href="#cite_ref-50">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation">3.4 Characters and Encoding // <a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212">The Unicode Standard, Version 13.0</a>. — 2019. — P. 19.</span></span></span> </li> <li id="cite_note-:0-51"><span class="mw-cite-backlink"><a href="#cite_ref-:0_51-0">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation">2.4 Code Points and Characters // <a rel="nofollow" class="external text" href="http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564">The Unicode Standard Version 12.0 – Core Specification</a>. — 2019. — P. 29.</span></span></span> </li> <li id="cite_note-52"><span class="mw-cite-backlink"><a href="#cite_ref-52">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode13.0.0/appA.pdf">Appendix A: Notational Conventions</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>The Unicode Standard</i>. Unicode Consortium (March 2020).</span> In conformity with the bullet point relating to Unicode in <a href="/w/index.php?title=MOS:ALLCAPS&action=edit&redlink=1" class="new" title="MOS:ALLCAPS (страница отсутствует)">MOS:ALLCAPS</a>, the formal Unicode names are not used in this paragraph.</span> </span></li> <li id="cite_note-stability-policy-53"><span class="mw-cite-backlink">↑ <a href="#cite_ref-stability-policy_53-0"><sup><i><b>1</b></i></sup></a> <a href="#cite_ref-stability-policy_53-1"><sup><i><b>2</b></i></sup></a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/policies/stability_policy.html">Unicode Character Encoding Stability Policy</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-54"><span class="mw-cite-backlink"><a href="#cite_ref-54">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463">Properties</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 15 марта 2020.</small></span></span> </span></li> <li id="cite_note-55"><span class="mw-cite-backlink"><a href="#cite_ref-55">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/reports/tr17/">Unicode Character Encoding Model</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-56"><span class="mw-cite-backlink"><a href="#cite_ref-56">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/Public/UNIDATA/NamedSequences.txt">Unicode Named Sequences</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-57"><span class="mw-cite-backlink"><a href="#cite_ref-57">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/Public/UNIDATA/NameAliases.txt">Unicode Name Aliases</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 16 марта 2010.</small></span></span> </span></li> <li id="cite_note-58"><span class="mw-cite-backlink"><a href="#cite_ref-58">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf">CWA 13873:2000 – Multilingual European Subsets in ISO/IEC 10646-1</a> <a href="/w/index.php?title=European_Committee_for_Standardization&action=edit&redlink=1" class="new" title="European Committee for Standardization (страница отсутствует)">CEN</a> Workshop Agreement 13873</span> </li> <li id="cite_note-59"><span class="mw-cite-backlink"><a href="#cite_ref-59">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html">Multilingual European Character Set 2 (MES-2) Rationale</a>, <a href="/w/index.php?title=Markus_Kuhn_(computer_scientist)&action=edit&redlink=1" class="new" title="Markus Kuhn (computer scientist) (страница отсутствует)">Markus Kuhn</a>, 1998</span> </li> <li id="cite_note-60"><span class="mw-cite-backlink"><a href="#cite_ref-60">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/faq/utf_bom.html">UTF-8, UTF-16, UTF-32 & BOM</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>Unicode.org FAQ</i>. <small>Дата обращения 12 декабря 2016.</small></span></span> </span></li> <li id="cite_note-61"><span class="mw-cite-backlink"><a href="#cite_ref-61">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation">The Unicode Standard, Version 6.2. — The Unicode Consortium, 2013. — P. 561. — <a href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%98%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%B8_%D0%BA%D0%BD%D0%B8%D0%B3/9781936213085" class="internal mw-magiclink-isbn">ISBN 978-1-936213-08-5</a>.</span></span></span> </li> <li id="cite_note-62"><span class="mw-cite-backlink"><a href="#cite_ref-62">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i><a href="/w/index.php?title=Rob_Pike&action=edit&redlink=1" class="new" title="Rob Pike (страница отсутствует)">Pike, Rob</a></i> <span lang="und"><a rel="nofollow" class="external text" href="https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt">UTF-8 history</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span> (30 апреля 2003).</span></span> </span></li> <li id="cite_note-63"><span class="mw-cite-backlink"><a href="#cite_ref-63">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf">ISO/IEC JTC1/SC 18/WG 9 N</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 4 июня 2012.</small></span></span> </span></li> <li id="cite_note-64"><span class="mw-cite-backlink"><a href="#cite_ref-64">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>Hedley, Jonathan</i> <span lang="und"><a rel="nofollow" class="external text" href="https://unicodelookup.com/">Unicode Lookup</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span> (2009).</span></span> </span></li> <li id="cite_note-65"><span class="mw-cite-backlink"><a href="#cite_ref-65">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>Milde, Benjamin</i> <span lang="und"><a rel="nofollow" class="external text" href="http://shapecatcher.com/">Unicode Character Recognition</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span> (2011).</span></span> </span></li> <li id="cite_note-66"><span class="mw-cite-backlink"><a href="#cite_ref-66">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>Wood, Alan</i> <span lang="und"><a rel="nofollow" class="external text" href="http://www.alanwood.net/unicode/explorer.html#ie5">Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Alan Wood. <small>Дата обращения 4 июня 2012.</small></span></span> </span></li> <li id="cite_note-67"><span class="mw-cite-backlink"><a href="#cite_ref-67">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.w3.org/TR/xml11">Extensible Markup Language (XML) 1.1 (Second Edition)</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 1 ноября 2013.</small></span></span> </span></li> <li id="cite_note-68"><span class="mw-cite-backlink"><a href="#cite_ref-68">↑</a></span> <span class="reference-text"><cite class="citation journal">Bigelow, Charles; Holmes, Kris (September 1993). <a rel="nofollow" class="external text" href="http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf">“The design of a Unicode font”</a> <span class="cs1-format">(PDF)</span>. <i>Electronic Publishing</i>. VOL. 6(3), 289–305: 292.</cite><span title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.jtitle=Electronic+Publishing&rft.atitle=The+design+of+a+Unicode+font&rft.volume=VOL.+6%283%29%2C+289%E2%80%93305&rft.pages=292&rft.date=1993-09&rft.aulast=Bigelow&rft.aufirst=Charles&rft.au=Holmes%2C+Kris&rft_id=http%3A%2F%2Fcajun.cs.nott.ac.uk%2Fwiley%2Fjournals%2Fepobetan%2Fpdf%2Fvolume6%2Fissue3%2Fbigelow.pdf&rfr_id=info%3Asid%2Fru.wikipedia.org%3A%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4" class="Z3988"></span><style data-mw-deduplicate="TemplateStyles:r95432997">.mw-parser-output cite.citation{font-style:inherit}.mw-parser-output q{quotes:"\"""\"""'""'"}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}</style></span> </li> <li id="cite_note-69"><span class="mw-cite-backlink"><a href="#cite_ref-69">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/faq/font_keyboard.html">Fonts and keyboards</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. Unicode Consortium (28 June 2017). <small>Дата обращения 13 октября 2019.</small></span></span> </span></li> <li id="cite_note-70"><span class="mw-cite-backlink"><a href="#cite_ref-70">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="http://tronweb.super-nova.co.jp/characcodehist.html">A Brief History of Character Codes</a>, Steven J. Searle, originally written <a rel="nofollow" class="external text" href="https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html">1999</a>, last updated 2004</span> </li> <li id="cite_note-dw2001-71"><span class="mw-cite-backlink">↑ <a href="#cite_ref-dw2001_71-0"><sup><i><b>1</b></i></sup></a> <a href="#cite_ref-dw2001_71-1"><sup><i><b>2</b></i></sup></a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html">The secret life of Unicode: A peek at Unicode's soft underbelly</a>, Suzanne Topping, 1 May 2001 <i>(Internet Archive)</i></span> </li> <li id="cite_note-72"><span class="mw-cite-backlink"><a href="#cite_ref-72">↑</a></span> <span class="reference-text"> <a rel="nofollow" class="external text" href="http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc">AFII contribution about WAVE DASH</a>, <span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="http://www.ingrid.org/java/i18n/unicode.html">An Unicode vendor-specific character table for japanese</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>web.archive.org</i> (22 апреля 2011). <small><a rel="nofollow" class="external text" href="https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html">Архивировано</a> 22 апреля 2011 года.</small></span></span> </span></li> <li id="cite_note-73"><span class="mw-cite-backlink"><a href="#cite_ref-73">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem"><i>ISO 646-* Problem</i></a>, Section 4.4.3.5 of <i>Introduction to I18n</i>, Tomohiro KUBOTA, 2001</span> </li> <li id="cite_note-74"><span class="mw-cite-backlink"><a href="#cite_ref-74">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/charts/PDF/UFB50.pdf">Arabic Presentation Forms-A</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 20 марта 2010.</small></span></span> </span></li> <li id="cite_note-75"><span class="mw-cite-backlink"><a href="#cite_ref-75">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/charts/PDF/UFE70.pdf">Arabic Presentation Forms-B</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 20 марта 2010.</small></span></span> </span></li> <li id="cite_note-76"><span class="mw-cite-backlink"><a href="#cite_ref-76">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/charts/PDF/UFB00.pdf">Alphabetic Presentation Forms</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <small>Дата обращения 20 марта 2010.</small></span></span> </span></li> <li id="cite_note-77"><span class="mw-cite-backlink"><a href="#cite_ref-77">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>China.</i> <span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf">Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span> (2 December 2002).</span></span> </span></li> <li id="cite_note-78"><span class="mw-cite-backlink"><a href="#cite_ref-78">↑</a></span> <span class="reference-text"><span class="citation"><span class="citation"><i>V. S. Umamaheswaran.</i> <span lang="und"><a rel="nofollow" class="external text" href="https://www.unicode.org/L2/L2003/03390r-n2654.pdf">Resolutions of WG 2 meeting 44</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span> (7 November 2003).</span></span> </span></li> <li id="cite_note-79"><span class="mw-cite-backlink"><a href="#cite_ref-79">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://www.unicode.org/policies/stability_policy.html">Unicode stability policy</a></span> </li> <li id="cite_note-tn17-80"><span class="mw-cite-backlink">↑ <a href="#cite_ref-tn17_80-0"><sup><i><b>1</b></i></sup></a> <a href="#cite_ref-tn17_80-1"><sup><i><b>2</b></i></sup></a></span> <span class="reference-text"><span class="citation"><span class="citation"><span lang="und"><a rel="nofollow" class="external text" href="https://unicode.org/notes/tn27/">Unicode Technical Note #27: Known Anomalies in Unicode Character Names</a></span><span class="hidden-ref" style="display:none"><b> <span class="ref-info" style="cursor:help;" title="на неопределённом языке">(неопр.)</span></b></span>. <i>unicode.org</i> (10 April 2017).</span></span> </span></li> <li id="cite_note-81"><span class="mw-cite-backlink"><a href="#cite_ref-81">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://www.unicode.org/charts/PDF/U2100.pdf">Unicode chart: "actually this has the form of a lowercase calligraphic p, despite its name"</a></span> </li> <li id="cite_note-82"><span class="mw-cite-backlink"><a href="#cite_ref-82">↑</a></span> <span class="reference-text"><a rel="nofollow" class="external text" href="https://www.unicode.org/charts/PDF/UFE10.pdf">"Misspelling of BRACKET in character name is a known defect"</a></span> </li> </ol></div> <h2><span class="mw-headline" id="External_links">External links</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&veaction=edit&section=31" class="mw-editsection-visualeditor" title="Редактировать раздел «External links»">править</a><span class="mw-editsection-divider"> \| </span><a href="/w/index.php?title=%D0%AE%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4&action=edit&section=31" title="Редактировать раздел «External links»">править код</a><span class="mw-editsection-bracket">]</span></span></h2> <style data-mw-deduplicate="TemplateStyles:r104610277">.mw-parser-output .ts-Родственные_проекты{width:19em;box-sizing:border-box;margin:0 0 .5em 1em;padding:.4em;background:#f8f9fa;border:1px solid #a2a9b1;font-size:90%}.mw-parser-output .ts-Родственные_проекты-header{margin-bottom:.2em;padding:.2em .6em;font-size:110%}.mw-parser-output .ts-Родственные_проекты ul li{display:flex;padding:.2em .6em}.mw-parser-output .ts-Родственные_проекты ul li .image{min-width:24px;display:inline-block;margin-right:.4em;flex:none;vertical-align:top;text-align:center}.mw-parser-output .ts-Родственные_проекты ul li .image img{vertical-align:middle}.mw-parser-output .ts-Родственные_проекты ul li .label{align-self:center}.mw-parser-output .ts-Родственные_проекты ul li hr{width:100%;margin:0}@media(max-width:719px){.mw-parser-output .ts-Родственные_проекты{width:auto;margin-left:0;margin-right:0}}</style><div class="ts-Родственные_проекты tright metadata plainlinks plainlist ruwikiWikimediaNavigation"><ul><li><span class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/76/Notification-icon-Commons-logo.svg/24px-Notification-icon-Commons-logo.svg.png" decoding="async" width="24" height="24" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/76/Notification-icon-Commons-logo.svg/36px-Notification-icon-Commons-logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/76/Notification-icon-Commons-logo.svg/48px-Notification-icon-Commons-logo.svg.png 2x" data-file-width="30" data-file-height="30" /></span><span class="label commons-ref"><a href="https://commons.wikimedia.org/wiki/Category:Unicode" class="extiw" title="commons:Category:Unicode">Медиафайлы на Викискладе</a></span></li></ul></div> <ul><li><span class="wikidata-claim" data-wikidata-property-id="P856" data-wikidata-claim-id="Q8819$b5162a84-449e-ef3a-da0d-518f577596f1"><span class="wikidata-snak wikidata-main-snak"><a rel="nofollow" class="external text" href="https://unicode.org/">unicode.org</a></span></span> — официальный сайт Юникод <span style="font-weight:bold;"> ·</span> <span class="wikidata-claim" data-wikidata-property-id="P856" data-wikidata-claim-id="Q8819$b5162a84-449e-ef3a-da0d-518f577596f1"><span class="wikidata-snak wikidata-main-snak"><a rel="nofollow" class="external text" href="https://unicode.org/">unicode.org</a></span></span> — официальный сайт Юникод</li> <li><a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:DMOZ&action=edit&redlink=1" class="new" title="Шаблон:DMOZ (страница отсутствует)">Шаблон:DMOZ</a></li> <li><a rel="nofollow" class="external text" href="http://www.alanwood.net/unicode/">Alan Wood's Unicode Resources</a> – Contains lists of word processors with Unicode capability; fonts and characters are grouped by type; characters are presented in lists, not grids.</li> <li><a rel="nofollow" class="external text" href="https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont">Unicode BMP Fallback Font</a> Displays the Unicode value of any character in a document, including in the Private Use Area, rather than the glyph itself.</li></ul> <p><a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Unicode_navigation&action=edit&redlink=1" class="new" title="Шаблон:Unicode navigation (страница отсутствует)">Шаблон:Unicode navigation</a> <a href="/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Character_encoding&action=edit&redlink=1" class="new" title="Шаблон:Character encoding (страница отсутствует)">Шаблон:Character encoding</a> </p> <div role="navigation" class="navbox" aria-label="Навигационный шаблон" style="padding:3px"><table class="nowraplinks hlist navbox-inner" style="border-spacing:0;background:transparent;color:inherit"><tbody><tr><th scope="row" class="navbox-group" style="width:1px"><div style="padding: 0px 18px 0px 0px; width: 100%;"><div style="float: left;"><span class="noprint plainlinks nowrap" style="font-size:85%;"><a href="/wiki/%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:%D0%92%D0%BD%D0%B5%D1%88%D0%BD%D0%B8%D0%B5_%D1%81%D1%81%D1%8B%D0%BB%D0%BA%D0%B8" title="Просмотр этого шаблона"><img alt="⚙️" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Wikipedia_interwiki_section_gear_icon.svg/14px-Wikipedia_interwiki_section_gear_icon.svg.png" decoding="async" width="14" height="14" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Wikipedia_interwiki_section_gear_icon.svg/21px-Wikipedia_interwiki_section_gear_icon.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Wikipedia_interwiki_section_gear_icon.svg/28px-Wikipedia_interwiki_section_gear_icon.svg.png 2x" data-file-width="14" data-file-height="14" /></a> </span></div>  Тематические сайты</div></th><td class="navbox-list navbox-odd" style="text-align:left;border-left-width:2px;border-left-style:solid;width:100%;padding:0px"><div style="padding:0em 0.25em"><a rel="nofollow" class="external text" href="https://github.com/unicode-org">GitHub</a></div></td></tr><tr><th scope="row" class="navbox-group" style="width:1px">Словари и энциклопедии</th><td class="navbox-list navbox-even" style="text-align:left;border-left-width:2px;border-left-style:solid;width:100%;padding:0px"><div style="padding:0em 0.25em"><a rel="nofollow" class="external text" href="http://www.enciclopedia.cat/enciclopèdies/gran-enciclopèdia-catalana/EC-GEC-0262830.xml">Большая каталанская</a> · <a rel="nofollow" class="external text" href="https://www.britannica.com/topic/Unicode">Britannica (онлайн)</a></div></td></tr><tr><th scope="row" class="navbox-group" style="width:1px"><a href="/wiki/%D0%9D%D0%BE%D1%80%D0%BC%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D1%8B%D0%B9_%D0%BA%D0%BE%D0%BD%D1%82%D1%80%D0%BE%D0%BB%D1%8C" title="Нормативный контроль">Нормативный контроль</a></th><td class="navbox-list navbox-odd" style="text-align:left;border-left-width:2px;border-left-style:solid;width:100%;padding:0px"><div style="padding:0em 0.25em"><a href="/wiki/Gemeinsame_Normdatei" title="Gemeinsame Normdatei">GND</a>: <a rel="nofollow" class="external text" href="http://d-nb.info/gnd/4343497-6">4343497-6</a> · <a href="/wiki/%D0%9A%D0%BE%D0%BD%D1%82%D1%80%D0%BE%D0%BB%D1%8C%D0%BD%D1%8B%D0%B9_%D0%BD%D0%BE%D0%BC%D0%B5%D1%80_%D0%91%D0%B8%D0%B1%D0%BB%D0%B8%D0%BE%D1%82%D0%B5%D0%BA%D0%B8_%D0%9A%D0%BE%D0%BD%D0%B3%D1%80%D0%B5%D1%81%D1%81%D0%B0" title="Контрольный номер Библиотеки Конгресса">LCCN</a>: <a rel="nofollow" class="external text" href="http://id.loc.gov/authorities/sh98000843">sh98000843</a> · <a href="/wiki/Microsoft_Academic" title="Microsoft Academic">Microsoft</a>: <a rel="nofollow" class="external text" href="https://academic.microsoft.com/#/detail/500551929">500551929</a></div></td></tr></tbody></table></div> '
Была ли правка сделана через выходной узел сети Tor (`tor_exit_node`)	false
Unix-время изменения (`timestamp`)	1589714437

+{{Use dmy dates|date=May 2019|cs1-dates=y}}
-[[Файл:New Unicode logo.svg|x200px|thumb|right|Логотип Unicode Consortium]]
+{{distinguish|Unicode (telegraphy)}}
+{{For|what the term "Unicode" means in Microsoft documentation|UTF-16}}
+{{Short description|Character encoding standard}}
+{{Infobox character encoding
+| name = Unicode
+| mime =
+| alias = [[Universal Coded Character Set]] (UCS)
+| image = New Unicode logo.svg
+| caption = Logo of the [[Unicode Consortium]]
+| standard = Unicode Standard
+| lang = International
+| status =
+| encodings = [[UTF-8]], [[UTF-16]], [[GB 18030|GB18030]]<br/>'''Less common''': [[UTF-32]], [[BOCU]], [[Standard Compression Scheme for Unicode|SCSU]], [[UTF-7]]
+| encodes =
+| extends =
+| prev = [[ISO 8859]], various others
+| next =
+}}
+{{Contains special characters| special = uncommon [[Unicode]] characters}}
+'''Unicode''' is an [[information technology]] [[technical standard|standard]] for the consistent [[character encoding|encoding]], representation, and handling of [[character (computing)|text]] expressed in most of the world's [[writing system]]s. The standard is maintained by the [[Unicode Consortium]], and {{as of|March 2020|lc=y}}, there is a repertoire of {{unicodenover}} (these [[character (computing)|characters]] consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic [[script (Unicode)|scripts]], as well as multiple symbol sets and [[emoji]]. The character repertoire of the Unicode Standard is synchronized with [[ISO/IEC 10646]], and both are code-for-code identical.
+''The Unicode Standard'' consists of a set of code charts for visual reference, an encoding method and set of standard [[character encoding]]s, a set of reference [[data file]]s, and a number of related items, such as character properties, rules for [[Unicode normalization|normalization]], decomposition, [[Unicode collation algorithm|collation]], rendering, and [[bidirectional text]] display order (for the correct display of text containing both right-to-left scripts, such as [[Arabic script|Arabic]] and [[Hebrew alphabet|Hebrew]], and left-to-right scripts).<ref>{{Cite web | title = The Unicode Standard: A Technical Introduction | url = https://www.unicode.org/standard/principles.html | accessdate = 2010-03-16}}</ref>
-'''Юнико́д'''<ref name=autogenerated1>{{cite web|url=http://www.unicode.org/standard/UnicodeTranscriptions.html|title=Unicode Transcriptions|publisher=|date=|accessdate=10 мая 2010|lang=en|archiveurl=https://web.archive.org/web/20060408204540/http://www.unicode.org/standard/UnicodeTranscriptions.html|archivedate=2006-04-08|deadlink=yes}}</ref> (чаще всего) или '''Унико́д'''<ref>[http://www.paratype.ru/help/term/terms.asp?code=361 Уникод в словаре Paratype]</ref> ({{lang-en|Unicode}}) — стандарт [[Набор символов|кодирования символов]], включающий в себя знаки почти всех письменных [[язык]]ов мира<ref name="unicode-techintro">{{cite web|url=http://www.unicode.org/standard/principles.html|title=The Unicode® Standard: A Technical Introduction|accessdate=2010-07-04|archiveurl=https://web.archive.org/web/20100310120125/http://www.unicode.org/standard/principles.html|archivedate=2010-03-10|deadlink=yes}}</ref>. В настоящее время стандарт является преобладающим в [[Интернет|Интернете]].
+Unicode's success at unifying character sets has led to its widespread and predominant use in the [[internationalization and localization]] of computer [[software]]. The standard has been implemented in many recent technologies, including modern [[operating system]]s, [[XML]], [[Java (programming language)|Java]] (and other programming languages), and the [[.NET Framework]].
-Стандарт предложен в [[1991 год]]у некоммерческой организацией «Консорциум Юникода» ({{lang-en|Unicode Consortium, Unicode Inc.}})<ref>{{cite web|url=http://www.unicode.org/history/publicationdates.html|title=History of Unicode Release and Publication Dates|accessdate=2010-07-04|archiveurl=https://web.archive.org/web/20100110085403/http://www.unicode.org/history/publicationdates.html|archivedate=2010-01-10|deadlink=yes}}</ref><ref>{{cite web|url=http://www.unicode.org/consortium/consort.html|title=The Unicode Consortium|accessdate=2010-07-04|archiveurl=https://web.archive.org/web/20100627085503/http://www.unicode.org/consortium/consort.html|archivedate=2010-06-27|deadlink=yes}}</ref>. Применение этого стандарта позволяет закодировать очень большое число символов из разных систем письменности: в документах, закодированных по стандарту Юникод, могут соседствовать китайские [[иероглиф]]ы, математические символы, буквы [[греческий алфавит|греческого алфавита]], [[латинский алфавит|латиницы]] и [[кириллица|кириллицы]], символы музыкальной нотной нотации, при этом становится ненужным переключение [[кодовая страница|кодовых страниц]]<ref name="unicode-foreword">{{cite web|url=http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf|title=Foreword|accessdate=2010-07-04|archiveurl=https://web.archive.org/web/20100627141434/http://www.unicode.org/versions/Unicode5.2.0/Foreword.pdf|archivedate=2010-06-27|deadlink=yes}}</ref>.
+[[Comparison of Unicode encodings|Unicode can be implemented]] by different [[character encoding]]s. The Unicode standard defines [[UTF-8]], [[UTF-16]], and [[UTF-32]], and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and [[Universal Coded Character Set|UCS]]-2 (without full support for Unicode), a precursor of UTF-16; [[GB 18030|GB18030]] is standardized in China and implements Unicode fully, while not an official Unicode standard.
-Стандарт состоит из двух основных частей: универсального набора символов ({{lang-en|Universal character set, UCS}}) и семейства кодировок ({{lang-en|Unicode transformation format, UTF}}). Универсальный набор символов перечисляет допустимые по стандарту Юникод символы и присваивает каждому символу код в виде неотрицательного целого числа, записываемого обычно в шестнадцатеричной форме с префиксом <code>U+</code>, например, <code>U+040F</code>. Семейство кодировок определяет способы преобразования кодов символов для передачи в потоке или в файле.
+UTF-8, the dominant encoding on the [[World Wide Web]] (used in over 94% of websites {{As of|2019|November|df=|lc=y}}),<ref>{{Cite web|url=https://w3techs.com/technologies/cross/character_encoding/ranking|title=Usage Survey of Character Encodings broken down by Ranking|website=w3techs.com|language=en|access-date=2019-11-11}}</ref> uses one [[byte]]{{efn|The Unicode Consortium uses the ambiguous term byte; The [[International Organization for Standardization]] (ISO), the [[International Electrotechnical Commission]] (IEC) and the [[Internet Engineering Task Force]] (IETF) use the more specific term [[Octet (computing)|octet]] in current documents related to Unicode.|group=note}}for the first 128 [[code point]]s, and up to 4 bytes for other characters.<ref>{{cite web|url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#I1.36559|work=The Unicode Standard|title=Conformance | date=March 2020|accessdate=2020-03-15}}</ref> The first 128 Unicode code points represent the [[ASCII]] characters, which means that any ASCII text is also a UTF-8 text.
-Коды в стандарте Юникод разделены на несколько областей. Область с кодами от U+0000 до U+007F содержит символы набора [[ASCII]], и коды этих символов  совпадают с их кодами в ASCII. Далее расположены области символов других систем письменности, знаки пунктуации и технические символы. Часть кодов зарезервирована для использования в будущем<ref name='unicode-02'>{{cite web|url=http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf|title=General Structure|accessdate=2010-07-05|archiveurl=https://web.archive.org/web/20100627093139/http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf|archivedate=2010-06-27|deadlink=yes}}</ref>. Под символы кириллицы выделены области знаков с кодами от U+0400 до U+052F, от U+2DE0 до U+2DFF, от U+A640 до U+A69F (см. [[Кириллица в Юникоде]])<ref>{{cite web|url=http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf|title=European Alphabetic Scripts|accessdate=2010-07-04|archiveurl=https://web.archive.org/web/20100627140856/http://www.unicode.org/versions/Unicode5.2.0/ch07.pdf|archivedate=2010-06-27|deadlink=yes}}</ref>.
+UCS-2 uses two bytes (16&nbsp;bits) for each character but can only encode the first 65,536 code points, the so-called [[Basic Multilingual Plane]] (BMP).  With 1,112,064 possible Unicode code points corresponding to characters (see [[#Architecture and terminology|below]]) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters.  Therefore, UCS-2 is outdated, though still widely used in software.  UTF-16 extends UCS-2, by using the same [[16-bit]] encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text.
-== Предпосылки создания и развитие Юникода ==
-{{цитата|Unicode — это уникальный код для любого символа, независимо от платформы, независимо от программы, независимо от языка.|автор=Консорциум Юникода<ref>[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]</ref>}}
-К концу 1980-х годов стандартом стали 8-битные кодировки, их существовало уже большое множество, и постоянно появлялись новые. Это объяснялось как расширением круга поддерживаемых языков, так и стремлением создавать кодировки, частично совместимые между собой (характерный пример — появление [[альтернативная кодировка|альтернативной кодировки для русского языка]], обусловленное эксплуатацией западных программ, созданных для кодировки [[CP437]]). В результате появилось несколько проблем:
-# проблема неправильной раскодировки;
-# проблема ограниченности набора символов;
-# проблема преобразования одной кодировки в другую;
-# проблема дублирования шрифтов.
+UTF-32 (also referred to as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used.
-'''Проблема неправильной раскодировки''' вызывала появление в документе символов иностранных языков, не предполагавшихся в документе, или появление не предполагавшихся [[псевдографика|псевдографических]] символов, прозванных русскоязычными пользователями «кракозябрами». Проблема во многом была вызвана отсутствием стандартизированной формы указания кодировки для файла или потока. Проблему можно было решить либо последовательным внедрением стандарта указания кодировки, либо внедрением общей для всех языков кодировки.<ref name='unicode-foreword' />
+==Origin and development==
-'''Проблема ограниченности набора символов'''<ref name='unicode-foreword' />. Проблему можно было решить либо переключением шрифтов внутри документа, либо внедрением «широкой» кодировки. Переключение шрифтов издавна практиковалось в [[текстовый процессор|текстовых процессорах]], причём часто использовались [[нестандартные шрифты|шрифты с нестандартной кодировкой]], т. н. «dingbat fonts». В итоге при попытке переноса документа в другую систему все нестандартные символы превращались в «кракозябры».
+Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the [[ISO/IEC 8859]] standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using [[Latin character]]s and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).
+Unicode, in intent, encodes the underlying characters—[[grapheme]]s and grapheme-like units—rather than the variant [[glyph]]s (renderings) for such characters. In the case of [[Chinese characters]], this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see [[Han unification]]).
-'''Проблема преобразования одной кодировки в другую'''. Проблему можно было решить либо составлением таблиц перекодировки для каждой пары кодировок, либо использованием промежуточного преобразования в третью кодировку, включающую все символы всех кодировок<ref>{{cite web|url=http://www.unicode.org/history/unicode88.pdf|title=Unicode 88|accessdate=2010-07-08|archiveurl=https://web.archive.org/web/20170906035012/http://unicode.org/history/unicode88.pdf|archivedate=2017-09-06|deadlink=yes}}</ref>.
+In text processing, Unicode takes the role of providing a unique ''code point''—a [[number]], not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, [[font]], or style) to other software, such as a [[web browser]] or [[word processor]]. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.
-'''Проблема дублирования шрифтов'''. Для каждой кодировки создавался свой шрифт, даже если наборы символов в кодировках совпадали частично или полностью. Проблему можно было решить путём создания «больших» шрифтов, из которых впоследствии выбирались бы нужные для данной кодировки символы. Однако это требовало создания единого реестра символов, чтобы определять, чему что соответствует.
+The first 256 code points were made identical to the content of [[ISO/IEC 8859-1]] so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "[[Halfwidth and Fullwidth Forms (Unicode block)|fullwidth forms]]" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean ([[CJK]]) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see [[duplicate characters in Unicode]].
-Была признана необходимость создания единой «широкой» кодировки. Кодировки с переменной длиной символа, широко использующиеся в Восточной Азии, были признаны слишком сложными в использовании, поэтому было решено использовать символы фиксированной ширины. Использование 32-битных символов казалось слишком расточительным, поэтому было решено использовать 16-битные.
+==={{anchor|Unicode 88}}History===
-Первая версия Юникода представляла собой кодировку с фиксированным размером символа в 16 бит, то есть общее число кодов было 2<sup>16</sup> ({{formatnum:65536}}). С тех пор символы стали обозначать четырьмя шестнадцатеричными цифрами (например, <code>U+04F0</code>). При этом в Юникоде планировалось кодировать не все существующие символы, а только те, которые необходимы в повседневном обиходе. Редко используемые символы должны были размещаться в «области пользовательских символов» ({{lang|en|private use area}}), которая первоначально занимала коды <code>U+D800…U+F8FF</code>. Чтобы использовать Юникод также и в качестве промежуточного звена при преобразовании разных кодировок друг в друга, в него включили все символы, представленные во всех наиболее известных кодировках.
+Based on experiences with the [[Xerox Character Code Standard]] (XCCS) since 1980,<ref name="unicode-88"/> the origins of Unicode date to 1987, when [[Joe Becker (Unicode)|Joe Becker]] from [[Xerox]] with [[Lee Collins (software engineer)|Lee Collins]] and [[Mark Davis (Unicode)|Mark Davis]] from [[Apple Inc.|Apple]], started investigating the practicalities of creating a universal character set.<ref>{{cite web |title=Summary Narrative |url=https://www.unicode.org/history/summary.html |access-date=2010-03-15}}</ref> With additional input from Peter Fenwick and Dave Opstad,<ref name="unicode-88"/> Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".<ref name="unicode-88">{{Cite web |url=https://unicode.org/history/unicode88.pdf |title=Unicode 88 |author-last=Becker |author-first=Joseph D. |author-link=Joseph D. Becker |date=1998-09-10 |orig-year=1988-08-29 |edition=10th anniversary reprint |website=unicode.org |publisher=[[Unicode Consortium]] |access-date=2016-10-25 |url-status=live |archive-url=https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf |archive-date=2016-11-25 |quote=In 1978, the initial proposal for a set of "Universal Signs" was made by [[Bob Belleville]] at [[Xerox PARC]]. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the [[Xerox Character Code Standard]] (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.<br/>Unicode arose as the result of eight years of working experience with XCCS. Its fundamental differences from XCCS were proposed by Peter Fenwick and Dave Opstad (pure 16-bit codes), and by [[Lee Collins (Unicode)|Lee Collins]] (ideographic character unification). Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communication multilingual system products.}}</ref>
+In this document, entitled ''Unicode 88'', Becker outlined a [[16-bit]] character model:<ref name="unicode-88"/>
-В дальнейшем, однако, было принято решение кодировать все символы и в связи с этим значительно расширить кодовую область. Одновременно с этим, коды символов стали рассматриваться не как 16-битные значения, а как абстрактные числа, которые в компьютере могут представляться множеством разных способов (см. [[#Способы представления|способы представления]]).
+<blockquote>
-Поскольку в ряде компьютерных систем (например, [[Windows NT]]<ref name="windows-nt">{{cite web|url=http://support.microsoft.com/kb/99884|title=Unicode and Microsoft Windows NT|work=Microsoft Support|lang=en|archiveurl=https://web.archive.org/web/20090926092654/http://support.microsoft.com/kb/99884|archivedate=2009-09-26|accessdate=2009-11-12|deadlink=yes}}</ref>) фиксированные 16-битные символы уже использовались в качестве кодировки по умолчанию, было решено все наиболее важные знаки кодировать только в пределах первых {{formatnum:65536}} позиций (так называемая {{lang-en|basic multilingual plane, BMP}}). Остальное пространство используется для «дополнительных символов» ({{lang-en|supplementary characters}}): систем письма вымерших языков или очень редко используемых [[китай]]ских иероглифов, математических и музыкальных символов.
+Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16&nbsp;bits to encompass the characters of all the world's living languages. In a properly engineered design, 16&nbsp;bits per character are more than sufficient for this purpose.
+</blockquote>
+His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:<ref name="unicode-88"/>
-Для совместимости со старыми 16-битными системами была изобретена система [[UTF-16]], где первые {{formatnum:65536}} позиций, за исключением позиций из интервала U+D800…U+DFFF, отображаются непосредственно как 16-битные числа, а остальные представляются в виде «суррогатных пар» (первый элемент пары из области U+D800…U+DBFF, второй элемент пары из области U+DC00…U+DFFF). Для суррогатных пар была использована часть кодового пространства (2048 позиций), отведённого «для частного использования».
+<blockquote>
-Поскольку в UTF-16 можно отобразить только 2<sup>20</sup>+2<sup>16</sup>−2048 ({{formatnum:1112064}}) символов, то это число и было выбрано в качестве окончательной величины кодового пространства Юникода (диапазон кодов: 0x000000-0x10FFFF).
+Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2<sup>14</sup> = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.
+</blockquote>
+In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of [[Research Libraries Group|RLG]], and Glenn Wright of [[Sun Microsystems]], and in 1990, Michel Suignard and Asmus Freytag from [[Microsoft]] and Rick McGowan of [[NeXT]] joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready.
-Хотя кодовая область Юникода была расширена за пределы 2<sup>16</sup> уже в версии 2.0, первые символы в «верхней» области были размещены только в версии 3.1.
+The [[Unicode Consortium]] was incorporated in California on 3 January 1991,<ref>[https://unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates] on ''unicode.org.'' Retrieved February 28, 2017.</ref> and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992.
-Роль этой кодировки в веб-секторе постоянно растёт. На начало 2010 доля веб-сайтов, использующих Юникод, составила около 50 %<ref>{{cite web|url=http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov|title=Unicode используется почти на 50% веб-сайтов|lang=ru|archiveurl=https://web.archive.org/web/20100611042601/http://w3pro.ru/news/unicode-ispolzuetsya-pochti-na-50-veb-saitov|archivedate=2010-06-11|accessdate=2010-02-09|deadlink=yes}}</ref>.
+In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g., [[Egyptian hieroglyphs]]) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.<ref name=unicoderevisited>{{cite web|last=Searle|first=Stephen J|title=Unicode Revisited|url=http://tronweb.super-nova.co.jp/unicoderevisited.html|accessdate=2013-01-18}}</ref>
-== Версии Юникода ==
-Работа по доработке стандарта продолжается. Новые версии выпускаются по мере изменения и пополнения таблиц символов. Параллельно выпускаются новые документы [[Международная организация по стандартизации|ISO]]/IEC 10646.
+The Microsoft TrueType specification version 1.0 from 1992 used the name ''Apple Unicode'' instead of ''Unicode'' for the Platform ID in the naming table.
-Первый стандарт выпущен в 1991 году, последний на данный момент — в 2020. Версии стандарта 1.0—5.0 публиковались как книги и имеют [[ISBN]]<ref>[http://www.unicode.org/history/publicationdates.html History of Unicode Release and Publication Dates]</ref><ref>[http://www.unicode.org/versions/enumeratedversions.html Enumerated Versions]</ref>.
+===Unicode Consortium===
-Номер версии стандарта составлен из трёх цифр (например, 3.1.1). Третью цифру меняют при внесении в стандарт небольших изменений, не добавляющих новых символов (исключение — версия 1.0.1, в которой добавлены {{iw|Унифицированные идеограммы ККЯ|унифицированные идеограммы китайского, японского и корейского письма|en|CJK Unified Ideographs}})<ref>[http://www.unicode.org/versions/index.html About Versions]</ref>.
+{{Main|Unicode Consortium}}
+The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including [[Adobe Inc.|Adobe]], [[Apple Inc.|Apple]], [[Google]], [[International Business Machines|IBM]], [[Microsoft]], and [[Oracle Corporation]].<ref name="members">{{cite web
-База данных символов Юникода ([http://www.unicode.org/ucd/ Unicode Character Database]) доступна для всех версий на официальном сайте как в простом текстовом, так и в XML-формате. Файлы распространяются под BSD-подобной [http://www.unicode.org/copyright.html лицензией].
+| title = The Unicode Consortium Members
+| url = https://unicode.org/consortium/members.html
+| accessdate = 2019-01-04}}</ref>
+Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the [[Ministry of Endowments and Religious Affairs (Oman)]] is a full member with voting rights.<ref name="members" />
-{{Временная линия Юникода}}
+The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard [[#Unicode Transformation Format and Universal Character Set|Unicode Transformation Format]] (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with [[multilingualism|multilingual]] environments.
+===Scripts covered===
+{{Main|Script (Unicode)}}
+[[File:Unicode sample.png|thumb|right|200px|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts -->
+Unicode covers almost all scripts ([[writing system]]s) in current use today.<ref>{{cite web
+| title = Character Code Charts
+| url = https://www.unicode.org/charts/
+| accessdate = 2010-03-17}}
+</ref>{{failed verification|date=October 2013}}<ref>{{Cite web|url=https://home.unicode.org/basic-info/faq/|title=Unicode FAQ|last=|first=|date=|website=|url-status=live|archive-url=|archive-date=|access-date=2020-04-02}}</ref>
+A total of 154 [[Script (Unicode)|scripts]] are included in the latest version of Unicode (covering [[alphabet]]s, [[abugida]]s and [[Syllabary|syllabaries]]), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and [[musical notation|music]] (in the form of notes and rhythmic symbols), also occur.
+The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran<ref>{{Cite web | title=Roadmap to the BMP | url=https://www.unicode.org/roadmaps/bmp/ | publisher=[[Unicode Consortium]] | accessdate=30 July 2018 }}</ref>) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the [https://www.unicode.org/roadmaps/ Unicode Roadmap] page of the [[Unicode Consortium]] Web site. For some scripts on the Roadmap, such as [[Jurchen script|Jurchen]] and [[Khitan small script]], encoding proposals have been made and they are working their way through the approval process. For others scripts, such as [[Maya script|Mayan]] (besides numbers) and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
+Some modern invented scripts which have not yet been included in Unicode (e.g., [[Tengwar]]) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., [[Klingon scripts|Klingon]]) are listed in the [[ConScript Unicode Registry]], along with unofficial but widely used [[Private Use Areas]] code assignments.
+There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals have been already included into Unicode.
+The [https://linguistics.berkeley.edu/sei/ Script Encoding Initiative], a project run by Deborah Anderson at the [[University of California, Berkeley]] was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.<ref>{{cite web|url=https://www.unicode.org/pending/about-sei.html |title=About The Script Encoding Initiative |publisher=The Unicode Consortium |date= |accessdate=2012-06-04}}</ref>
+==={{anchor|1.0.0|1.0.1|1.1|2.0|2.1|3.0|3.1|3.2|4.0|4.1|5.0|5.1|5.2|6.0|6.1|6.2|6.3|7.0|8.0|9.0|10.0|11.0|12.0|12.1|13.0|14.0}}Versions===
+Unicode is developed in conjunction with the [[International Organization for Standardization]] and shares the character repertoire with [[ISO/IEC 10646]]: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but ''The Unicode Standard'' contains much more information for implementers, covering—in depth—topics such as bitwise encoding, [[Unicode collation algorithm|collation]] and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting [[Bi-directional text|bidirectional text]]. The two standards do use slightly different terminology.
+The Unicode Consortium first published ''The Unicode Standard'' in 1991 (version 1.0), and has published new versions on a regular basis since then. The latest version of the Unicode Standard, version 13.0, was released in March 2020, and is available in electronic format from the consortium's website. The last version of the standard that was published completely in book form (including the code charts) was version 5.0 in 2006, but since version 5.2 (2009) the core specification of the standard has been published as a print-on-demand paperback.<ref name=version6.1PoD>{{cite web|title=Unicode 6.1 Paperback Available|url=https://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0240.html|work=announcements_at_unicode.org|accessdate=2012-05-30}}</ref> The entire text of each version of the standard, including the core specification, standard annexes and code charts, is freely available in [[PDF]] format on the Unicode website.
+In April 2020, Unicode announced that the release of the forthcoming version 14.0 had been postponed by six months from its initial release of March 2021 due to the [[COVID-19 pandemic]].<ref>{{cite web|title=Unicode 14.0 Delayed for 6 Months|url=https://home.unicode.org/unicode-14-0-delayed-for-6-months/|accessdate=2020-05-05}}</ref>
+Thus far, the following major and minor versions of the Unicode standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below.<ref>{{cite web | title = Enumerated Versions of The Unicode Standard | url = https://www.unicode.org/versions/enumeratedversions.html | accessdate = 2016-06-21}}</ref>
 {| class="wikitable"
 |-
+|+ Unicode versions
-|+ Версии Юникода
+|-
+!rowspan=2| Version
+!rowspan=2| Date
+!rowspan=2| Book
+!rowspan=2| Corresponding [[Universal Character Set|ISO/IEC 10646]] edition
+!rowspan=2| [[Script (Unicode)|Scripts]]
+!colspan=2| Characters
 |-
+! Total{{refn|The number of characters listed for each version of Unicode is the total number of graphic and format characters (i.e., excluding [[Private Use Area|private-use characters]], [[Unicode control characters|control characters]], [[noncharacter|noncharacters]] and [[surrogate code points]]).|group=tablenote}}
-! Номер вер&shy;сии
+! Notable additions
-! Дата публи&shy;ка&shy;ции
-! [[Международный стандартный книжный номер|ISBN]] книги
-! Изда&shy;ние ISO/IEC 10646
-! Коли&shy;че&shy;ство [[Письменность|пись&shy;мен&shy;но&shy;стей]]
-! Коли&shy;че&shy;ство сим&shy;во&shy;лов<ref group="A">'''Включая''' символы графические ({{lang-en|graphic}}), управляющие ({{lang-en|control}}) и символы форматирования ({{lang-en|format}}); '''не включая''' [[Области для частного использования|символы для частного использования]] ({{Lang-en|private-use}}), несимвольные знаки ({{Lang-en|noncharacters}}) и суррогаты ({{lang-en|surrogate code points}}).</ref>
-! Изменения
 |-
+| 1.0.0
-| 1.0.0<ref>{{cite web|title=Unicode® 1.0|url=http://www.unicode.org/versions/Unicode1.0.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Октябрь 1991
+| October 1991
-| ISBN 0-201-56788-1 (Vol.1)
+| {{ISBN|0-201-56788-1}} (Vol. 1)
 |
-| {{formatnum:24}}
+| 24
+| 7,129
-| {{formatnum:7161}}
+| Initial repertoire covers these scripts: [[Arabic script|Arabic]], [[Armenian alphabet|Armenian]], [[Bengali alphabet|Bengali]], [[Zhuyin|Bopomofo]], [[Cyrillic script|Cyrillic]], [[Devanagari]], [[Georgian alphabet|Georgian]], [[Greek alphabet|Greek and Coptic]], [[Gujarati alphabet|Gujarati]], [[Gurmukhi script|Gurmukhi]], [[Hangul]], [[Hebrew alphabet|Hebrew]], [[Hiragana]], [[Kannada alphabet|Kannada]], [[Katakana]], [[Lao script|Lao]], [[Latin script|Latin]], [[Malayalam script|Malayalam]], [[Oriya script|Oriya]], [[Tamil script|Tamil]], [[Telugu script|Telugu]], [[Thai alphabet|Thai]], and [[Tibetan script|Tibetan]].<ref>{{cite web| title = Unicode Data 1.0.0|url= https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt| accessdate = 2010-03-16}}</ref>
-| Изначально Юникод содержал символы следующих письменностей: [[арабское письмо]], [[армянское письмо]], [[бенгальское письмо]], [[Чжуинь|чжуиньское письмо]], [[кириллица]], [[деванагари]], [[грузинское письмо]], [[Греческий алфавит|греческое и коптское письмо]], [[Гуджарати (письмо)|гуджарати]], [[гурмукхи]], [[хангыль]], [[Еврейский алфавит|еврейское письмо]], [[хирагана]], [[Каннада (письмо)|каннада]], [[катакана]], [[лаосское письмо]], [[Латинский алфавит|латиница]], [[Малаялам (письмо)|малаялам]], [[Ория (письмо)|ория]], [[тамильское письмо]], [[Телугу (письмо)|телугу]], [[тайское письмо]] и [[тибетское письмо]]<ref>{{cite web
-| title = Unicode Data 1.0.0
-| url = http://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt
-| lang = en
-| accessdate = 2017-12-04
-}}</ref>
 |-
 | 1.0.1
-| Июнь 1992
+| June 1992
-| ISBN 0-201-60845-6 (Vol.2)
+| {{ISBN|0-201-60845-6}} (Vol. 2)
 |
-| {{formatnum:25}}
+| 25
+| 28,327<br />(21,204 added; 6 removed)
-| {{formatnum:28359}}
+| The initial set of 20,902 [[CJK Unified Ideographs]] is defined.<ref>
-| Добавлены {{formatnum:20902}} {{iw|Унифицированные идеограммы ККЯ|унифицированные идеограммы китайского, японского и корейского письма|en|CJK Unified Ideographs}}<ref>{{cite web
+{{cite web
 | title = Unicode Data 1.0.1
-| url = http://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt
+| url = https://www.unicode.org/Public/reconstructed/1.0.1/UnicodeData.txt
+| accessdate = 2010-03-16}}</ref>
-| lang = en
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 1.1
-| 1.1<ref>{{cite web|title=Unicode® 1.1|url=http://www.unicode.org/versions/Unicode1.1.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Июнь 1993
+| June 1993
 |
 | ISO/IEC 10646-1:1993
-| {{formatnum:24}}
+| 24
+| 34,168<br />(5,963 added; 89 removed; 33 reclassified as control characters)
-| {{formatnum:34233}}
+| 4,306 more [[Hangul]] syllables added to original set of 2,350 characters. [[Tibetan script|Tibetan]] removed.<ref>{{cite web
-| Добавлено {{formatnum:4306}} слогов [[Хангыль|хангыля]], дополнивших уже имеющиеся в кодировке {{formatnum:2350}} символов. Удалены символы [[Тибетское письмо|тибетского письма]]<ref>{{cite web
 | title = Unicode Data 1995
-| url = http://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt
+| url = https://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt
+| accessdate = 2010-03-16 }}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 2.0
-| 2.0<ref>{{cite web|title=Unicode 2.0.0|url=http://www.unicode.org/versions/Unicode2.0.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Июль 1996
+| July 1996
-| ISBN 0-201-48345-9
+| {{ISBN|0-201-48345-9}}
-| ISO/IEC 10646-1:1993 и Amendments 5, 6, 7
+| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7
-| {{formatnum:25}}
+| 25
+| 38,885<br />(11,373 added; 6,656 removed)
-| {{formatnum:38950}}
+| Original set of [[Hangul]] syllables removed, and a new set of 11,172 Hangul syllables added at a new location. [[Tibetan script|Tibetan]] added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 [[Private use (Unicode)|Private Use Areas]] allocated.<ref>{{cite web
-| Удалены добавленные ранее слоги [[Хангыль|хангыля]], и добавлены {{formatnum:11172}} новых слога хангыля с новыми кодами. Возвращены удалённые ранее символы [[Тибетское письмо|тибетского письма]]; символы получили новые коды и были размещены в разных таблицах. Введён механизм суррогатных ({{lang-en|surrogate}}) символов. Выделено место для плоскостей ({{lang-en|planes}}) [[Области для частного использования|15 и 16]]<ref>{{cite web
-| title = Unicode Data 2.0.14
+| title = Unicode Data-2.0.14
-| url = http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt
+| url = https://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt
+| accessdate = 2010-03-16}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 2.1
-| 2.1<ref>{{cite web|title=Unicode 2.1.0|url=http://www.unicode.org/versions/Unicode2.1.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Май 1998
+| May 1998
 |
-| ISO/IEC 10646-1:1993, Amendments 5, 6, 7, два символа из Amendment 18
+| ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18
 | 25
+| 38,887<br />(2 added)
-| {{formatnum:38952}}
-| Добавлен [[символ евро]]<ref>{{cite web
+| [[Euro sign]] and [[Specials (Unicode block)|Object Replacement Character]] added.<ref>{{cite web
-| title = Unicode Data 2.1.2
+| title = Unicode Data-2.1.2
-| url = http://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt
+| url = https://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt
+| accessdate = 2010-03-16}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 3.0
-| 3.0<ref>{{cite web|title=Unicode 3.0.0|url=http://www.unicode.org/versions/Unicode3.0.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Сентябрь 1999
+| September 1999
-| ISBN 0-201-61633-5
+| {{ISBN|0-201-61633-5}}
 | ISO/IEC 10646-1:2000
-| {{formatnum:38}}
+| 38
+| 49,194<br />(10,307 added)
-| {{formatnum:49259}}
+| [[Cherokee syllabary|Cherokee]], [[Ge'ez alphabet|Ethiopic]], [[Khmer script|Khmer]], [[Mongolian script|Mongolian]], [[Burmese script|Burmese]], [[Ogham]], [[Runic alphabet|Runic]], [[Sinhala script|Sinhala]], [[Syriac alphabet|Syriac]], [[Tāna|Thaana]], [[Canadian Aboriginal syllabics|Unified Canadian Aboriginal Syllabics]], and [[Yi script|Yi Syllables]] added, as well as a set of [[Braille]] patterns.<ref>{{cite web
-| Добавлены письмо [[Чероки (письмо)|чероки]], [[эфиопское письмо]], [[кхмерское письмо]], [[монгольские письменности]], [[бирманское письмо]], [[огамическое письмо]], [[руны]], [[сингальское письмо]], [[сирийское письмо]], [[Тана (письмо)|тана]], [[канадское слоговое письмо]] и [[письмо и]], а также символы [[Шрифт Брайля|шрифта Брайля]]<ref>{{cite web
-| title = Unicode Data 3.0.0
+| title = Unicode Data-3.0.0
-| url = http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
+| url = https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
+| accessdate = 2010-03-16}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 3.1
-| 3.1<ref>{{cite web|title=Unicode 3.1.0|url=http://www.unicode.org/versions/Unicode3.1.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Март 2001
+| March 2001
 |
 | ISO/IEC 10646-1:2000
 ISO/IEC 10646-2:2001
-| {{formatnum:41}}
+| 41
+| 94,140<br />(44,946 added)
-| {{formatnum:94205}}
+| [[Deseret alphabet|Deseret]], [[Gothic alphabet|Gothic]] and [[Old Italic alphabet|Old Italic]] added, as well as sets of symbols for [[Modern musical symbols|Western music]] and [[Byzantine music]], and 42,711 additional [[CJK Unified Ideographs]].<ref>{{cite web
-| Добавлены [[Дезеретский алфавит|дезеретское письмо]], [[готское письмо]] и {{iw|древнеиталийское письмо||en|Old Italic alphabet}}, а также символы [[Современная музыкальная нотация|западной]] и [[Византийская музыка|византийской]] музыки, {{formatnum:42711}} {{iw|Унифицированные идеограммы ККЯ|унифицированных идеограмм китайского, японского и корейского письма|en|CJK Unified Ideographs}}. Выделено место для плоскостей [[Плоскость (Юникод)#Дополнительная многоязычная плоскость|1]], [[Плоскость (Юникод)#Дополнительная идеографическая плоскость|2]] и [[Плоскость (Юникод)#Специализированная дополнительная плоскость|14]]<ref>{{cite web
-| title = Unicode Data 3.1.0
+| title =Unicode Data-3.1.0
-| url = http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt
+| url =https://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt
+| accessdate = 2010-03-16 }}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 3.2
-| 3.2<ref>{{cite web|title=Unicode 3.2.0|url=http://www.unicode.org/versions/Unicode3.2.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Март 2002
+| March 2002
 |
-| ISO/IEC 10646-1:2000 и Amendment 1
+| ISO/IEC 10646-1:2000 plus Amendment 1
 ISO/IEC 10646-2:2001
-| {{formatnum:45}}
+| 45
+| 95,156<br />(1,016 added)
-| {{formatnum:95221}}
+| [[Philippines|Philippine]] scripts [[Buhid script|Buhid]], [[Hanunó'o script|Hanunó'o]], [[Baybayin|Tagalog]], and [[Tagbanwa script|Tagbanwa]] added.<ref>{{cite web
-| Добавлены [[Бухид (письмо)|письмо бухид]], {{iw|Письмо хануноо|хануноо|en|Hanunó'o script}}, [[байбайин]] и [[Тагбанва (письмо)|письмо тагбанва]]<ref>{{cite web
-| title = Unicode Data 3.2.0
+| title = Unicode Data-3.2.0
-| url = http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
+| url = https://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
+| accessdate = 2010-03-16}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 4.0
-| 4.0<ref>{{cite web|title=Unicode 4.0.0|url=http://www.unicode.org/versions/Unicode4.0.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Апрель 2003
+| April 2003
-| ISBN 0-321-18578-1
+| {{ISBN|0-321-18578-1}}
 | ISO/IEC 10646:2003
-| {{formatnum:52}}
+| 52
+| 96,382<br />(1,226 added)
-| {{formatnum:96447}}
+| [[Cypriot syllabary]], [[Limbu script|Limbu]], [[Linear B]], [[Osmanya script|Osmanya]], [[Shavian alphabet|Shavian]], [[Tai Nüa language#Writing system|Tai Le]], and [[Ugaritic alphabet|Ugaritic]] added, as well as [[Hexagram (I Ching)|Hexagram symbols]].<ref>{{cite web
-| Добавлены [[кипрское письмо]], [[Лимбу (письмо)|письмо лимбу]], [[линейное письмо Б]], [[сомалийское письмо]], [[Алфавит Шоу|алфавит шоу]], [[Тай-ныа#Письменность|письмо лы]] и [[угаритское письмо]], а также символы [[Гексаграмма (Ицзин)|гексаграмм]]<ref>{{cite web
-| title = Unicode Data 4.0.0
+| title = Unicode Data-4.0.0
-| url = http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt
+| url = https://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt
+| accessdate = 2010-03-16}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 4.1
-| 4.1<ref>{{cite web|title=Unicode 4.1.0|url=http://www.unicode.org/versions/Unicode4.1.0/|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Март 2005
+| March 2005
 |
-| ISO/IEC 10646:2003 и Amendment 1
+| ISO/IEC 10646:2003 plus Amendment 1
-| {{formatnum:59}}
+| 59
+| 97,655<br />(1,273 added)
-| {{formatnum:97720}}
+| [[Lontara alphabet|Buginese]], [[Glagolitic alphabet|Glagolitic]], [[Kharoṣṭhī|Kharoshthi]], [[New Tai Lue alphabet|New Tai Lue]], [[Old Persian cuneiform script|Old Persian]], [[Sylheti Nagari|Syloti Nagri]], and [[Tifinagh]] added, and [[Coptic alphabet|Coptic]] was disunified from [[Greek alphabet|Greek]]. Ancient [[Unicode numerals#Ancient Greek numerals|Greek numbers]] and [[Musical notation#Ancient Greece|musical symbols]] were also added.<ref>{{cite web|url=https://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt|title=Unicode Data-4.1.0|accessdate=2010-03-16}}
-| Добавлены [[Лонтара|письмо лонтара]], [[глаголица]], [[Кхароштхи|письмо кхароштхи]], [[новое письмо лы]], [[древнеперсидская клинопись]], [[силхетское нагари]] и [[древнеливийское письмо]]. Символы [[Коптское письмо|коптского письма]] были отделены от символов [[Греческий алфавит|греческого письма]]. Также добавлены [[Аттическая система счисления|символы старых греческих цифр]], музыкальные символы Древней Греции и [[символ гривны]] (валюты [[Украина|Украины]])<ref>{{cite web
+</ref>
-| title = Unicode Data 4.1.0
-| url = http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt
-| lang = en
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 5.0
-| 5.0<ref>{{cite web|title=Unicode 5.0.0|url=http://www.unicode.org/versions/Unicode5.0.0/|date=2006-07-14|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Июль 2006
+| July 2006
-| ISBN 0-321-48091-0
+| {{ISBN|0-321-48091-0}}
-| ISO/IEC 10646:2003, Amendments 1, 2, четыре символа из Amendment 3
+| ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3
-| {{formatnum:64}}
+| 64
+| 99,024<br />(1,369 added)
-| {{formatnum:99089}}
+| [[Balinese alphabet|Balinese]], [[Cuneiform]], [[N'Ko alphabet|N'Ko]], [[Phags-pa script|Phags-pa]], and [[Phoenician alphabet|Phoenician]] added.<ref>{{cite web
-| Добавлены [[балийское письмо]], [[клинопись]], [[Нко (письмо)|письмо нко]], [[монгольское квадратное письмо]] и [[финикийское письмо]]<ref>{{cite web
 | title = Unicode Data 5.0.0
-| url = http://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt
+| accessdate = 2010-03-17}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 5.1
-| 5.1<ref>{{cite web|title=Unicode 5.1.0|url=http://www.unicode.org/versions/Unicode5.1.0/|date=2008-04-04|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Апрель 2008
+| April 2008
 |
-| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4
+| ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4
-| {{formatnum:75}}
+| 75
+| 100,648<br />(1,624 added)
-| {{formatnum:100713}}
+| [[Carian script|Carian]], [[Cham alphabet|Cham]], [[Kayah Li script|Kayah Li]], [[Lepcha script|Lepcha]], [[Lycian script|Lycian]], [[Lydian script|Lydian]], [[Ol Chiki script|Ol Chiki]], [[Rejang script|Rejang]], [[Saurashtra script|Saurashtra]], [[Sundanese script|Sundanese]], and [[Vai syllabary|Vai]] added, as well as sets of symbols for the [[Phaistos Disc]], [[Mahjong|Mahjong tiles]], and [[Dominoes|Domino tiles]]. There were also important additions for [[Burmese script|Burmese]], additions of letters and [[Scribal abbreviation]]s used in medieval [[manuscript]]s, and the addition of [[Capital ẞ]].<ref>{{cite web
-| Добавлены [[карийское письмо]], [[чамская письменность]], [[Кая-ли|письмо кая-ли]], [[Лепча (письмо)|письмо лепча]], [[Ликийский алфавит|ликийское письмо]], [[Лидийский алфавит|лидийское письмо]], [[Ол-чики|письмо ол-чики]], [[реджангское письмо]], [[Саураштра (письмо)|письмо саураштра]], [[сунданское письмо]],[[Древнетюркское письмо]] и [[Ваи (письмо)|письмо ваи]]. Добавлены [[Фестский диск|символы фестского диска]], символы костей для [[маджонг]]а и [[домино]], [[заглавная буква эсцет]] (ẞ), а также буквы латиницы, использовавшиеся в средневековых [[Рукопись|рукописях]] для {{iw|аббревация|аббревиации|en|Scribal abbreviation}}. Новыми символами дополнен набор символов [[Бирманское письмо|бирманского письма]]<ref>{{cite web
 | title = Unicode Data 5.1.0
-| url =  http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
+| accessdate = 2010-03-17 }}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 5.2
-| 5.2<ref>{{cite web|title=Unicode® 5.2.0|url=http://www.unicode.org/versions/Unicode5.2.0/|date=2009-10-01|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Октябрь 2009
+| October 2009
+| {{ISBN|978-1-936213-00-9}}
-|
-| ISO/IEC 10646:2003 и Amendments 1, 2, 3, 4, 5, 6
+| ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6
-| {{formatnum:90}}
+| 90
+| 107,296<br />(6,648 added)
-| {{formatnum:107361}}
+| [[Avestan alphabet|Avestan]], [[Bamum script|Bamum]], [[Egyptian hieroglyphs]] (the [[Gardiner's sign list|Gardiner Set]], comprising 1,071 characters), [[Imperial Aramaic]], [[Inscriptional Pahlavi]], [[Inscriptional Parthian]], [[Javanese script|Javanese]], [[Kaithi]], [[Fraser alphabet|Lisu]], [[Meitei Mayek script|Meetei Mayek]], [[South Arabian alphabet|Old South Arabian]], [[Old Turkic script|Old Turkic]], [[Samaritan script|Samaritan]], [[Tai Tham script|Tai Tham]] and [[Tai Viet script|Tai Viet]] added. 4,149 additional [[CJK Unified Ideographs]] (CJK-C), as well as extended Jamo for [[Hangul|Old Hangul]], and characters for [[Vedic Sanskrit]].<ref>{{cite web
-| Добавлены [[Авестийский алфавит|авестийское письмо]], [[Бамум (письменность)|письмо бамум]], [[египетское иероглифическое письмо]] (по {{iw|список Гардинера|списку Гардинера|en|Gardiner's sign list}}, содержащему {{formatnum:1071}} символ), [[имперское арамейское письмо]], {{iw|пахлевийское эпиграфическое письмо||en|Inscriptional Pahlavi}}, {{iw|парфянское эпиграфическое письмо||en|Inscriptional Parthian}}, [[яванское письмо]], [[Кайтхи|письмо кайтхи]], [[Алфавит Фрейзера|письмо лису]], [[Манипури (письмо)|письмо манипури]], [[южноаравийское письмо]], [[древнетюркское письмо]], [[самаритянское письмо]], [[Ланна (письмо)|письмо ланна]] и {{iw|письмо тай-вьет||en|Tai Viet script}}. Добавлены {{formatnum:4149}} новых {{iw|унифицированные идеограммы китайского, японского, корейского письма|унифицированных идеограмм китайского, японского и корейского письма|en|CJK Unified Ideographs}} (CJK-C), символы [[Ведийский язык|ведийского письма]], [[символ тенге]] (валюты [[Казахстан]]а), а также расширен набор символов чамо [[Хангыль|старого хангыля]]<ref>{{cite web
 | title = Unicode Data 5.2.0
-| url = http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
+| accessdate = 2010-03-17}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 6.0
-| 6.0<ref>{{cite web|title=Unicode® 6.0.0|url=http://www.unicode.org/versions/Unicode6.0.0/|date=2010-10-11|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Октябрь 2010
+| October 2010
+| {{ISBN|978-1-936213-01-6}}
-|
-| ISO/IEC 10646:2010 и [[символ индийской рупии]]
+| ISO/IEC 10646:2010 plus the [[Indian rupee sign]]
-| {{formatnum:93}}
+| 93
+| 109,384<br />(2,088 added)
-| {{formatnum:109449}}
+| [[Batak alphabet|Batak]], [[Brāhmī script|Brahmi]], [[Mandaic alphabet|Mandaic]], [[playing card]] symbols, [[Traffic sign|transport]] and [[map]] symbols, [[alchemical symbol]]s, [[emoticons]] and [[emoji]]. 222 additional [[CJK Unified Ideographs]] (CJK-D) added.<ref>{{cite web
-| Добавлены [[батакское письмо]], [[Брахми|письмо брахми]], [[мандейское письмо]]. Добавлены символы [[Игральные карты|игральных карт]], [[Дорожный знак|дорожных знаков]], [[Географическая карта|географических карт]], [[Алхимические символы|алхимии]], [[эмотикон]]а и [[эмодзи]], а также {{formatnum:222}} {{iw|унифицированные идеограммы китайского, японского и корейского письма||en|CJK Unified Ideographs}} (CJK-D)<ref>{{cite web
 | title = Unicode Data 6.0.0
-| url = http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt
+| accessdate = 2010-10-11}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 6.1
-| 6.1<ref>{{cite web|title=Unicode® 6.1.0|url=http://www.unicode.org/versions/Unicode6.1.0/|date=2012-01-31|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| Январь 2012
+| January 2012
+| {{ISBN|978-1-936213-02-3}}
-|
 | ISO/IEC 10646:2012
-| {{formatnum:100}}
+| 100
+| 110,116<br />(732 added)
-| {{formatnum:110181}}
+| [[Chakma alphabet|Chakma]], [[Meroitic alphabet|Meroitic cursive]], [[Meroitic alphabet|Meroitic hieroglyphs]], [[Pollard script|Miao]], [[Śāradā script|Sharada]], [[Sora Sompeng]], and [[Takri alphabet|Takri]].<ref>{{cite web
-| Добавлены [[Чакма (письмо)|письмо чакма]], [[Мероитское письмо|мероитский курсив и мероитские иероглифы]], [[Письмо Полларда|письмо мяо]], [[Шарада (письмо)|письмо шарада]], {{iw|письмо соранг-сомпенг||en|Sora Sompeng}} и [[Такри|письмо такри]]<ref>{{cite web
 | title = Unicode Data 6.1.0
-| url = http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt
+| accessdate = 2012-01-31}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 6.2
-| 6.2<ref>{{cite web|title=Unicode® 6.2.0|url=http://www.unicode.org/versions/Unicode6.2.0/|date=2012-09-26|work={{нп5|Unicode Consortium}}|accessdate=2017-12-07|language=en}}</ref>
-| Сентябрь 2012
+| September 2012
+| {{ISBN|978-1-936213-07-8}}
-|
-| ISO/IEC 10646:2012 и [[символ турецкой лиры]]
+| ISO/IEC 10646:2012 plus the [[Turkish lira sign]]
-| {{formatnum:100}}
+| 100
+| 110,117<br />(1 added)
-| {{formatnum:110182}}
+| [[Turkish lira sign]].<ref>{{cite web
-| Добавлен [[символ турецкой лиры]] (валюты [[Турция|Турции]])<ref>{{cite web
 | title = Unicode Data 6.2.0
-| url = http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt
+| accessdate = 2012-09-26}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 6.3
-| 6.3<ref>{{cite web|title=Unicode® 6.3.0|url=http://www.unicode.org/versions/Unicode6.3.0/|date=2012-09-30|work={{нп5|Unicode Consortium}}|accessdate=2017-12-07|language=en}}</ref>
-| Сентябрь 2013
+| September 2013
+| {{ISBN|978-1-936213-08-5}}
-|
-| ISO/IEC 10646:2012 и шесть символов
+| ISO/IEC 10646:2012 plus six characters
-| {{formatnum:100}}
+| 100
+| 110,122<br />(5 added)
-| {{formatnum:110187}}
+| 5 bidirectional formatting characters.<ref>{{cite web
-| Добавлено пять символов для форматирования двунаправленного текста<ref>{{cite web
 | title = Unicode Data 6.3.0
-| url = http://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/6.3.0/ucd/UnicodeData.txt
+| accessdate = 2013-09-30}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 7.0
-| 7.0<ref>{{cite web|title=Unicode® 7.0.0|url=http://www.unicode.org/versions/Unicode7.0.0/|date=2014-06-16|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| 16 июня 2014
+| June 2014
+| {{ISBN|978-1-936213-09-2}}
-|
-| ISO/IEC 10646:2012, Amendments 1, 2 и [[символ российского рубля|символ рубля]]
+| ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the [[Ruble sign]]
-| {{formatnum:123}}
+| 123
+| 112,956<br />(2,834 added)
-| {{formatnum:113021}}
+| [[Bassa alphabet|Bassa Vah]], [[Caucasian Albanian alphabet|Caucasian Albanian]], [[Duployan shorthand|Duployan]], [[Elbasan alphabet|Elbasan]], [[Grantha alphabet|Grantha]], [[Khojki]], [[Khudabadi alphabet|Khudawadi]], [[Linear A]], [[Mahajani]], [[Manichaean alphabet|Manichaean]], [[Mende script|Mende Kikakui]], [[Modi alphabet|Modi]], [[Mro script|Mro]], [[Nabataean alphabet|Nabataean]], [[Old North Arabian]], [[Old Permic alphabet|Old Permic]], [[Pahawh Hmong]], [[Palmyrene script|Palmyrene]], [[Pau Cin Hau script|Pau Cin Hau]], [[Psalter Pahlavi]], [[Siddhaṃ alphabet|Siddham]], [[Tirhuta]], [[Warang Citi]], and [[Dingbat]]s.<ref>{{cite web
-| Добавлены [[Басса (письмо)|письмо басса]], [[агванское письмо]], [[Система Дюплойе|стенография Дюплойе]], [[эльбасанское письмо]], [[Грантха|письмо грантха]], {{iw|письмо ходжики||en|Khojki}}, {{iw|письменность худавади||en|Khudabadi alphabet}}, [[линейное письмо А]], {{iw|письмо махаджани||en|Mahajani}}, [[манихейское письмо]], [[Кикакуи|письмо кикакуи]], [[Моди (письмо)|письмо моди]], {{iw|письмо мро||en|Mro script}}, [[набатейское письмо]], [[Северноаравийские языки|северноаравийское письмо]], [[древнепермское письмо]], [[Пахау|письмо пахау]], [[Пальмирский алфавит|пальмирское письмо]], {{iw|письмо по чин хо||en|Pau Cin Hau}}, {{iw|письмо псалтирь пехлеви||en|Psalter Pahlavi}}, [[сиддхаматрика]], [[Тирхута|письмо тирхута]], [[варанг-кшити]] и {{iw|дингбат|орнамент дингбат|en|Dingbat}}, а также [[символ российского рубля]] и [[символ азербайджанского маната]]<ref>{{cite web
 | title = Unicode Data 7.0.0
-| url = http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+| accessdate = 2014-06-15}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 8.0
-| 8.0<ref>{{cite web|title=Unicode® 8.0.0|url=http://www.unicode.org/versions/Unicode8.0.0/|date=2015-06-17|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| 17 июня 2015
+| June 2015
+| {{ISBN|978-1-936213-10-8}}
-|
+| ISO/IEC 10646:2014 plus Amendment 1, as well as the [[Georgian lari|Lari sign]], nine CJK unified ideographs, and 41 emoji characters<ref>{{Cite web | title=Unicode 8.0.0 | url=https://www.unicode.org/versions/Unicode8.0.0/ | publisher=Unicode Consortium | accessdate=2015-06-17 }}</ref>
-|ISO/IEC 10646:2014, Amendment 1, [[символ лари]], 9 уни&shy;фи&shy;ци&shy;ро&shy;ван&shy;ных идеограмм ККЯ, 41 [[эмодзи]]
-|129
+| 129
+| 120,672<br />(7,716 added)
-| {{formatnum:120737}}
+| [[Ahom alphabet|Ahom]], [[Anatolian hieroglyphs]], [[Hatran alphabet|Hatran]], [[Multani alphabet|Multani]], [[Old Hungarian alphabet|Old Hungarian]], [[SignWriting]], 5,771 [[CJK Unified Ideographs|CJK unified ideographs]], a set of lowercase letters for [[Cherokee syllabary|Cherokee]], and five emoji [[Fitzpatrick scale|skin tone]] modifiers<ref>{{cite web
-|Добавлены [[Ахом (письмо)|письмо ахом]], [[анатолийские иероглифы]], [[Хатран|письмо хатран]], [[Мултани|письмо мултани]], [[венгерские руны]], [[SignWriting]], {{formatnum:5776}} [[Унифицированные идеограммы ККЯ — расширение E]], строчные буквы письма [[чероки]], буквы латиницы для немецкой диалектологии, 41 [[эмодзи]], а также пять символов изменения [[Шкала Фитцпатрика|цвета кожи]] для эмотиконов. Добавлен [[символ лари]] (валюты [[Грузия|Грузии]])<ref>{{cite web
 | title = Unicode Data 8.0.0
-| url = http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt
+| accessdate = 2015-06-17}}
-| lang = en
+</ref>
-| accessdate = 2017-12-04
-}}</ref>
 |-
+| 9.0
-| 9.0<ref>{{cite web|title=Unicode® 9.0.0|url=http://www.unicode.org/versions/Unicode9.0.0/|date=2016-06-21|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| 21 июня 2016
+| June 2016
+| {{ISBN|978-1-936213-13-9}}
-|
+| ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols<ref>{{Cite web | title=Unicode 9.0.0 | url=https://www.unicode.org/versions/Unicode9.0.0/ | publisher=Unicode Consortium | accessdate=2016-06-21 }}</ref>
-|ISO/IEC 10646:2014, Amendments 1, 2, адлам, нева, японские символы для ТВ, 74 [[эмодзи]] и символов
-|135
+| 135
+| 128,172<br />(7,500 added)
-| {{formatnum:128237}}
+| [[Adlam script|Adlam]], [[Bhaiksuki alphabet|Bhaiksuki]], [[Zhang-Zhung language#Scripts|Marchen]], [[Prachalit Nepal alphabet|Newa]], [[Osage alphabet|Osage]], [[Tangut script|Tangut]], and 72 [[emoji]]<ref>{{cite web
-|Добавлены [[Адлам|письмо адлам]], [[Бхайкшуки|письмо бхайкшуки]], [[Марчен|письмо марчен]], [[Нева (письмо)|письмо нева]], [[Осейдж (письмо)|письмо осейдж]], [[тангутское письмо]], а также 72 [[эмодзи]] и японские символы для телевидения<ref>{{cite web
 | title = Unicode Data 9.0.0
-| url = http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt
+| url = https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt
+| accessdate = 2016-06-21}}
-| lang = en
+</ref><ref name=laobo>{{cite web|first=Martim|last=Lobao|url=https://www.androidpolice.com/2016/06/07/two-emoji-werent-approved-unicode-9-google-added-android-anyway/ |title=These Are The Two Emoji That Weren't Approved For Unicode 9 But Which Google Added To Android Anyway|website=Android Police|date= 7 June 2016|accessdate=4 September 2016}}</ref>
-| accessdate = 2017-12-06
-}}</ref>
 |-
+| 10.0
-| 10.0<ref>{{cite web|title=Unicode® 10.0.0|url=http://www.unicode.org/versions/Unicode10.0.0/|date=2017-06-27|work={{нп5|Unicode Consortium}}|accessdate=2017-12-08|language=en}}</ref>
-| 20 июня 2017
+| June 2017
+| {{ISBN|978-1-936213-16-0}}
-|
+| ISO/IEC 10646:2017 plus 56 [[emoji]] characters, 285 [[hentaigana]] characters, and 3 Zanabazar Square characters<ref>{{Cite web | title=Unicode 10.0.0 | url=https://www.unicode.org/versions/Unicode10.0.0/ | publisher=Unicode Consortium | accessdate=2017-06-20 }}</ref>
-|ISO/IEC 10646:2017, 56 [[эмодзи]], 285 символов [[Хэнтайгана|хэнтайганы]], 3 символа квадратного письма Дзанабадзара
-|139
+| 139
+| 136,690<br />(8,518 added)
-| {{formatnum:136755}}
+| [[Zanabazar Square alphabet|Zanabazar Square]], [[Soyombo alphabet|Soyombo]], [[Masaram Gondi script|Masaram Gondi]], [[Nüshu script|Nüshu]], [[hentaigana]] (non-standard [[hiragana]]), 7,494 [[CJK Unified Ideographs|CJK unified ideographs]], and 56 [[emoji]]
-|Добавлены [[Монгольские письменности#Горизонтальное квадратное письмо|квадратное письмо Дзанабадзара]], [[Соёмбо (письмо)|письмо соёмбо]], [[гонди Масарама]], [[Нюй-шу|письмо нюй-шу]], [[Хэнтайгана|письмо хэнтайгана]], {{formatnum:7494}} [[Унифицированные идеограммы ККЯ — расширение F]], а также 56 [[эмодзи]] и символ [[биткойн]]а<ref>{{cite web
-| title = Unicode Data 10.0.0
-| url = http://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt
-| lang = en
-| accessdate = 2017-12-07
-}}</ref>
 |-
 | 11.0
-| Июнь 2018
+| June 2018
+| {{ISBN|978-1-936213-19-1}}
-|
+| ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters.<ref>{{Cite web | title=The Unicode Standard, Version 11.0.0 Appendix C | url=https://www.unicode.org/versions/Unicode11.0.0/appC.pdf | publisher=Unicode Consortium | accessdate=2018-06-11 }}</ref>
-| ISO/IEC 10646:2017
 | 146
+| 137,374<br />(684 added)
-| {{formatnum:137439}}
-|Добавлены догра, [[грузинское письмо]] мтаврули, гунджалское гонди, [[ханифи]], индийские цифры сийяк,  [[Макасарский язык|макасарское]] письмо, медефайдрин, (древне-)[[согдийское письмо]], [[цифры майя]], 5 идеограмм ККЯ, символы [[сянци]] и половин звёздочек для оценки, а также 145 [[эмодзи]], четыре символа изменения причёски для эмотиконов и символ [[копилефт]]а<ref>{{Cite web|url=http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt|title=Unicode Data 11.0.0|author=|website=|date=|publisher=|accessdate=2019-04-12|lang=en}}</ref><ref>[http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html The Unicode Blog: Announcing The Unicode&#174; Standard, Version 11.0]</ref><ref>[http://www.unicode.org/versions/Unicode11.0.0/ Unicode 11.0.0]</ref>
+| [[Dogri script|Dogra]], [[Georgian scripts#Mkhedruli|Georgian Mtavruli]] capital letters, [[Gunjala Gondi Lipi|Gunjala Gondi]], [[Hanifi Rohingya script|Hanifi Rohingya]], [[Indic Siyaq Numbers (Unicode block)|Indic Siyaq numbers]], [[Makassarese language|Makasar]], [[Medefaidrin]], [[Sogdian alphabet|Old Sogdian and Sogdian]], [[Mayan numerals]], 5 urgently needed [[CJK Unified Ideographs|CJK unified ideographs]], symbols for [[xiangqi]] (Chinese chess) and [[Star (classification)|star ratings]], and 145 [[emoji]]<ref>{{Cite web|url=http://blog.unicode.org/2018/06/announcing-unicode-standard-version-110.html|title=Announcing The Unicode Standard, Version 11.0|website=blog.unicode.org|access-date=2018-06-06}}</ref>
 |-
-|12.0
+| 12.0
-|Март 2019
+| March 2019
+| {{ISBN|978-1-936213-22-1}}
-|
+| ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters.<ref>{{Cite web | title=The Unicode Standard, Version 12.0.0 Appendix C | url=https://www.unicode.org/versions/Unicode12.0.0/appC.pdf | publisher=Unicode Consortium | accessdate=2019-03-05 }}</ref>
-|ISO/IEC 10646:2017, Amendments 1, 2, а также 62 допол&shy;ни&shy;тель&shy;ных символов
-|150
+| 150
+| 137,928<br />(554 added)
-|{{formatnum:137993}}
+| [[Elymaic]], [[Nandinagari]], [[Nyiakeng Puachue Hmong]], [[Wancho script|Wancho]], [[Pollard script|Miao script]] additions for several Miao and Yi dialects in China, [[hiragana]] and [[katakana]] small letters for writing archaic Japanese, [[Tamil script|Tamil]] historic fractions and symbols, [[Lao alphabet|Lao]] letters for [[Pali]], Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, and 61 [[emoji]]<ref>{{Cite web|url=http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html|title=Announcing The Unicode Standard, Version 12.0|website=blog.unicode.org|access-date=2019-03-05}}</ref>
-|Добавлены элимайское письмо, {{Не переведено 3|надинагари|3=en|4=Nandinagari}}, хмонг, ванчо, дополнения для [[Письмо Полларда|письма Полларда]], малая [[кана]] для старых японских текстов, исторические дроби и символы [[Тамильское письмо|тамильского письма]], буквы [[Лаосское письмо|лаосского письма]] для [[пали]], буквы латиницы для транслитерации угаритского, управляющие символы форматирования египетских иероглифов, а также 61 [[эмодзи]]<ref>[http://blog.unicode.org/2019/03/announcing-unicode-standard-version-120.html The Unicode Blog: Announcing The Unicode&#174; Standard, Version 12.0]</ref><ref>[http://www.unicode.org/versions/Unicode12.0.0/ Unicode 12.0.0]</ref>
 |-
-|12.1
+| 12.1
-|Май 2019
+| May 2019
+| {{ISBN|978-1-936213-25-2}}
 |
+| 150
-|
+| 137,929<br />(1 added)
-|150
+| Adds a single character at U+32FF for the square ligature form of the name of the [[Reiwa|Reiwa era]].<ref>{{Cite web|url=http://blog.unicode.org/2019/05/unicode-12-1-en.html|title=Unicode Version 12.1 released in support of the Reiwa Era|website=blog.unicode.org|access-date=2019-05-07}}</ref>
-|{{formatnum:137994}}
-|Добавлен квадратный символ эпохи [[рэйва]]<ref>[http://blog.unicode.org/2019/05/unicode-12-1-en.html The Unicode Blog: Unicode Version 12.1 released in support of the Reiwa Era]</ref><ref>[http://www.unicode.org/versions/Unicode12.1.0/ Unicode 12.1.0]</ref>
 |-
+| [http://www.unicode.org/versions/Unicode13.0.0/ 13.0]
-|13.0
-|Март 2020
+| March 2020
+| {{ISBN|978-1-936213-26-9}}
-|
+| ISO/IEC 10646:2020<ref>{{Cite web | title=The Unicode Standard, Version 13.0– Core Specification Appendix C | url=https://www.unicode.org/versions/Unicode13.0.0/appC.pdf | publisher=Unicode Consortium | accessdate=2020-03-11 }}</ref>
-|
-|154
+| 154
+| 143,859<br />(5,930 added)
-|{{formatnum:143859}}
+| [[Khwarezmian_language#Writing_system|Chorasmian]], [[Dhives akuru|Dives Akuru]], [[Khitan small script]], [[Kurdish alphabets#Yezidi|Yezidi]], 4,969 CJK unified ideographs added (including 4,939 in [[CJK Unified Ideographs Extension G|Ext. G]]), Arabic script additions used to write [[Hausa language|Hausa]], [[Wolof language|Wolof]], and other languages in Africa and other additions used to write [[Hindko]] and [[Punjabi language|Punjabi]] in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems from the 1970s and 1980s, and 55 emoji<ref>{{Cite web|url=http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html|title=Announcing The Unicode Standard, Version 13.0|website=blog.unicode.org|access-date=2020-03-11}}</ref>
-|Добавлены [[хорезмийское письмо]], письмо [[дивес акуру]], [[малое киданьское письмо]], [[езидское письмо]], {{formatnum:4969}} идеограмм ККЯ (включая {{formatnum:4939}} [[Унифицированные идеограммы ККЯ — расширение G]]), а также 55 [[эмодзи]], символы [[Creative Commons]] и символы для унаследованной вычислительной техники. Выделено место для [[Плоскость (Юникод)#Третичная идеографическая плоскость|плоскости 3]]<ref>[http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html The Unicode Blog: Announcing The Unicode Standard, Version 13.0]</ref><ref>[http://www.unicode.org/versions/Unicode13.0.0/ Unicode 13.0.0]</ref>
-|-
-|colspan="7" | '''Примечания'''
-<references group="A" />
 |}
+{{Reflist|group=tablenote}}
-== Кодовое пространство ==
-Хотя формы записи UTF-8 и UTF-32 позволяют кодировать до 2<sup>31</sup> ({{formatnum:2147483648}}) кодовых позиций, было принято решение использовать лишь {{formatnum:1112064}} для совместимости с UTF-16. Впрочем, даже и этого в данный момент более чем достаточно — в версии 13.0 используется всего {{formatnum:143859}} кодовых позиций.
+==<span id="Upluslink"></span><span id="codespace"></span> Architecture and terminology==
-Кодовое пространство разбито на 17 ''[[Плоскость (Юникод)|плоскостей]]'' ({{lang-en|planes}}) по 2<sup>16</sup> ({{formatnum:65536}}) символов. Нулевая плоскость ({{lang-en2|plane{{nbsp}}0}}) называется ''базовой'' ({{lang-en2|basic}}) и содержит символы наиболее употребительных письменностей. Остальные плоскости — дополнительные ({{lang-en2|supplementary}}). Первая плоскость ({{lang-en2|plane{{nbsp}}1}}) используется в основном для исторических письменностей, вторая ({{lang-en2|plane{{nbsp}}2}}) — для редко используемых иероглифов [[CJK|китайского письма (ККЯ)]], третья ({{lang-en2|plane{{nbsp}}3}}) зарезервирована для архаичных китайских иероглифов<ref>[http://unicode.org/roadmaps/tip/ Roadmap to the TIP (Tertiary Ideographic Plane)]</ref>. Плоскость 14 отведена для символов, используемых по особому назначению. Плоскости 15 и 16 выделены для частного употребления<ref name='unicode-02' />.
+{{See also|Universal Character Set characters}}<!-- Template:U+ links to this paragraph -->
+The Unicode Standard defines a ''codespace''<ref name="Glossary">{{cite web|title = Glossary of Unicode Terms|url=https://unicode.org/glossary/|accessdate=2010-03-16}}</ref> of numerical values ranging from 0 through 10FFFF<sub>[[hexadecimal|16]]</sub>,<ref>{{cite book|title=The Unicode Standard, Version 13.0 |url=https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212|year=2019|page=19|chapter=3.4 Characters and Encoding}}</ref> called ''[[code point|code points]]''<ref name=":0">{{Cite book|url=http://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564|title=The Unicode Standard Version 12.0 – Core Specification|last=|first=|publisher=|year=2019|isbn=|location=|page=29|chapter=2.4 Code Points and Characters}}</ref> and denoted as U+0000 through U+10FFFF ("U+" plus the code point value in [[hexadecimal]], prepended with [[leading zero|leading zeros]] as necessary to result in a minimum of four digits, ''e.&nbsp;g.'', U+00F7 for the division sign, ÷, versus U+13254 for the [[Egyptian hieroglyph]] designating a [[List of hieroglyphs#O|reed shelter]] or a [[c:Category:Winding wall (h hieroglyph)|winding wall]] {{nowrap|( [[File:Hiero O4.png|text-bottom|15px]] )}}<ref>{{Cite web|url=https://www.unicode.org/versions/Unicode13.0.0/appA.pdf|date=March 2020|title=Appendix A: Notational Conventions|publisher=Unicode Consortium|work=The Unicode Standard}} In conformity with the bullet point relating to Unicode in [[MOS:ALLCAPS]], the formal Unicode names are not used in this paragraph.</ref>), respectively.  Out of these 2<sup>16</sup> + 2<sup>20</sup> defined code points, the code points from U+D800 through U+DFFF, which are used to encode surrogate pairs in [[UTF-16]], are reserved by the Standard and may not be used to encode valid characters, resulting in a net total of 2<sup>16</sup> &minus; 2<sup>11</sup> + 2<sup>20</sup> = 1,112,064 possible code points corresponding to valid Unicode characters.  Not all of these code points necessarily correspond to visible characters; several, for example, are assigned to control codes such as [[carriage return]].
-Для обозначения символов Unicode используется запись вида «U+''xxxx''» (для кодов 0…FFFF), или «U+''xxxxx''» (для кодов 10000…FFFFF), или «U+''xxxxxx''» (для кодов 100000…10FFFF), где ''xxx'' — [[шестнадцатеричная система счисления|шестнадцатеричные]] цифры. Например, символ «я» (U+044F) имеет код 044F{{sub|16}}{{nbsp}}= 1103{{sub|[[десятичная система счисления|10]]}}.
+===Code point planes and blocks===
-{| class="wikitable sortable collapsible collapsed"
+{{Main|Plane (Unicode)}}
-|-
+The Unicode codespace is divided into seventeen ''planes'', numbered 0 to 16:
-! colspan="3" | Плоскости Юникода
-|-
-! Плоскость !! Название !! Диапазон символов
-|-
-| 0 || Базовая многоязыковая плоскость ({{lang-en2|Basic multilingual plane, BMP}}) || U+0000…U+FFFF
-|-
-| 1 || Дополнительная многоязыковая плоскость ({{lang-en2|Supplementary multilingual plane, SMP}}) || U+10000…U+1FFFF
-|-
-| 2 || Дополнительная иероглифическая плоскость ({{lang-en2|Supplementary ideographic plane, SIP}}) || U+20000…U+2FFFF
-|-
-| 3 || Третичная иероглифическая плоскость ({{lang-en2|Tertiary ideographic plane, TIP}}) || U+30000…U+3FFFF
-|-
-| 4—13 || не используются || U+40000…U+DFFFF
-|-
-| 14 || Дополнительная плоскость особого назначения ({{lang-en2|Supplementary special-purpose plane, SSP}}) || U+E0000…U+EFFFF
-|-
-| 15—16 || Дополнительные области для частного использования ({{lang-en2|Supplementary private use area, SPUA-A/B}}) || U+F0000…U+10FFFF
-|-
-|}
+{{Planes (Unicode)}}
-== Система кодирования ==
-Универсальная система кодирования (Юникод) представляет собой набор графических символов и способ их кодирования для [[компьютер]]ной обработки текстовых данных.
+All code points in the BMP are accessed as a single code unit in [[UTF-16]] encoding and can be encoded in one, two or three bytes in [[UTF-8]]. Code points in Planes 1 through 16 (''supplementary planes'') are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.
-Графические символы — это символы, имеющие видимое изображение. Графическим символам противопоставляются управляющие символы и символы форматирования.
+Within each plane, characters are allocated within named ''[[Block (Unicode)|blocks]]'' of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks.
-Графические символы включают в себя следующие группы:
-* буквы, содержащиеся хотя бы в одном из обслуживаемых [[алфавит]]ов;
-* цифры;
-* знаки пунктуации;
-* специальные знаки ([[математика|математические]], технические, [[идеограмма|идеограммы]] и пр.);
-* разделители.
+===General Category property===
-Юникод — это система для линейного представления текста. Символы, имеющие дополнительные над- или подстрочные элементы, могут быть представлены в виде построенной по определённым правилам последовательности кодов (составной вариант, composite character) или в виде единого символа (монолитный вариант, precomposed character). С 2014 года считается, что все буквы крупных письменностей в Юникод внесены, и если символ доступен в составном варианте, дублировать его в монолитном виде не нужно.
+Each code point has a single [[Character property (Unicode)#General Category|General Category]] property. The major categories are denoted: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. In most cases other properties must be used to sufficiently specify the characteristics of a code point. The possible General Categories are:
+{{General Category (Unicode)}}
-=== Политика консорциума ===
-Консорциум не создаёт нового, а констатирует сложившийся порядок вещей<ref name="emoji">[http://www.unicode.org/faq/emoji_dingbats.html FAQ — Emoji{{nbsp}}& Dingbats]</ref>. Например, картинки «[[эмодзи]]» были добавлены потому, что японские операторы мобильной связи широко их использовали. Для этого добавление символа проходит через сложный процесс<ref name="emoji" />. И, например, [[символ российского рубля]] прошёл его за три месяца, как только получил официальный статус, причём до этого он много лет де-факто использовался и его отказывались включить в Юникод.
+Code points in the range U+D800–U+DBFF (1,024 code points) are known as high-'''surrogate''' code points, and code points in the range U+DC00–U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point form a surrogate pair in [[UTF-16]] to represent code points greater than U+FFFF. These code points otherwise cannot be used (this rule is ignored often in practice especially when not using UTF-16).
-[[Товарный знак|Товарные знаки]] кодируют только в порядке исключения. Так, в Юникоде нет флага [[Windows]] или яблока [[Apple]].
+A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six of these '''noncharacters''': U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.<ref name="stability-policy">{{cite web
-Как только символ появился в кодировке, он никогда не сдвинется и не исчезнет. Если же потребуется изменить порядок символов, это делается не переменой позиций, а национальным порядком сортировки. Есть и другие, более тонкие гарантии стабильности — например, не будут меняться таблицы нормализации<ref>[http://www.unicode.org/policies/stability_policy.html Unicode Character Encoding Stability Policy]</ref>.
+| title = Unicode Character Encoding Stability Policy
+| url = https://unicode.org/policies/stability_policy.html
+| accessdate = 2010-03-16}}
+</ref> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the [[byte order mark]] assumes that U+FFFE will never be the first code point in a text.
+Excluding surrogates and noncharacters leaves 1,111,998 code points available for use.
-=== Объединение и дублирование символов ===
-Один и тот же символ может иметь несколько форм; в Юникод эти формы входят одной кодовой позицией:
-* если это сложилось исторически. Например, у [[арабское письмо|арабских букв]] есть четыре формы: обособленная, в начале, в середине и в конце<ref>Впоследствии конкретным формам арабских букв отвели отдельные позиции. Но всё равно рекомендуется писать по-арабски «общими» вариантами букв.</ref>;
-* либо если в одном языке принята одна форма, а в другом — другая. [[Болгарская кириллица (типографика)|Болгарская кириллица]] отличается от русской, а китайские иероглифы — от японских.
+'''Private-use''' code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard<ref>{{cite web
-С другой стороны, если исторически в шрифтах у разных форм начертания были разные символы, то они остаются разными и в Юникоде. Например, строчная греческая [[Сигма (буква)|сигма]] имеет две формы, и в Юникоде у них разные коды; буква [[Расширенная латиница|расширенной латиницы]]{{nbsp}}[[Å (латиница)|Å]] ({{nobr|A с кружком}}) и знак [[ангстрем]]а{{nbsp}}Å, греческая буква{{nbsp}}[[Мю|μ]] и обозначение приставки «[[микро-]]»{{nbsp}}µ — тоже имеют разные кодовые позиции.
+| title = Properties
+| url = https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G43463
+| accessdate = 2020-03-15 }}
+</ref> so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private-use areas in the Unicode codespace:
+* Private Use Area: U+E000–U+F8FF (6,400 characters)
-Конечно же, похожие символы в неродственных письменностях также ставятся в разные кодовые позиции. Например, буква{{nbsp}}А в [[Латиница|латинице]], [[Кириллица|кириллице]], [[Греческий алфавит|греческом]] и [[Письмо чероки|чероки]] — разные символы.
+* Supplementary Private Use Area-A: U+F0000–U+FFFFD (65,534 characters)
+* Supplementary Private Use Area-B: U+100000–U+10FFFD (65,534 characters).
+Graphic characters are characters defined by Unicode to have particular semantics, and either have a visible [[glyph]] shape or represent a visible space. As of Unicode 13.0 there are 143,696 graphic characters.
-Крайне редко один и тот же символ ставится в две разные кодовые позиции для упрощения обработки текста. [[Штрих (математика)|Математический штрих]] и такой же штрих для индикации [[Мягкий звук|мягкости звуков]] — разные символы, второй считается буквой.
+'''Format''' characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. For example, {{unichar|200C|Zero-width non-joiner|nlink=}} and {{unichar|200D|Zero-width joiner|nlink=}} may be used to change the default shaping behavior of adjacent characters (e.g., to inhibit ligatures or request ligature formation). There are 163 format characters in Unicode 13.0.
-== Комбинируемые символы ==
-[[Файл:Diacritic-j.png|right|thumb|Представление символа «Й» (U+0419) в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306).]]
-Cимволы в Юникоде подразделяются на базовые ({{lang-en|base characters}}) и комбинируемые ({{lang-en|combining characters}}). Комбинируемые символы обычно следуют за базовым и изменяют его отображение определённым образом. К комбинируемым символам, например, относятся [[диакритические знаки]], знаки ударения. Например, русскую букву «Й» в Юникоде можно записать в виде базового символа «И» (U+0418) и комбинируемого символа « ̆» (U+0306), отображаемого над базовым.
+Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as '''control''' codes, and correspond to the [[C0 and C1 control codes]] defined in [[ISO/IEC 6429]]. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated ([[Mojibake]]) legacy [[Windows-1252]] characters used by some English and Western European texts with Windows technologies.
-Комбинируемые символы помечены в таблицах символов Юникода особыми категориями:
-* Nonspacing Mark — безинтервальный (непротяжённый) знак; таковые обычно отображаются над или под базовым символом и не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке;
-* Enclosing Mark — обрамляющий знак; эти символы также не занимают отдельной горизонтальной позиции (интервала) в отображаемой строке, но отображаются сразу с нескольких сторон базового символа;
-* Spacing Combining Mark — интервальный (протяжённый) комбинируемый знак; таковые, как и базовый символ, занимают отдельную горизонтальную позицию (интервал) в отображаемой строке.
+Graphic characters, format characters, control code characters, and private use characters are known collectively as ''assigned characters''. '''Reserved''' code points are those code points which are available for use, but are not yet assigned. As of Unicode 13.0 there are 830,606 reserved code points.
-Особый тип комбинируемых символов — селекторы варианта начертания ({{lang-en|variation selectors}}). Они действуют только на те базовые символы, для которых такие варианты определены. К примеру, в версии Юникода 5.0 варианты начертания определены для ряда математических символов, для символов традиционного [[монгольский алфавит|монгольского алфавита]] и для символов [[Монгольское квадратное письмо|монгольского квадратного письма]].
+===Abstract characters===
-== Алгоритмы нормализации ==
+The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of ''abstract characters'' that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.<ref>{{cite web
-Из-за наличия в Юникоде комбинируемых символов одни и те же знаки письменности можно представить различными кодами. Так, например, букву "Й" в примере выше можно записать как отдельным символом, так и сочетанием базового и комбинированного. Из-за этого сравнение строк байт за байтом становится невозможным. Алгоритмы нормализации ({{lang-en|normalization forms}}) решают эту проблему, выполняя приведение символов к определённому стандартному виду. Приведение осуществляется путём замены символов на эквивалентные с использованием таблиц и правил. «Декомпозицией» называется замена (разложение) одного символа на несколько составляющих символов, а «композицией», наоборот, — замена (соединение) нескольких составляющих символов на один символ.
+| title = Unicode Character Encoding Model
+| url = https://unicode.org/reports/tr17/
+| accessdate = 2010-03-16}}
+</ref> However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an [[ogonek]], a [[dot above]], and an [[acute accent]], which is required in [[Lithuanian language|Lithuanian]], is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.<ref>{{cite web
+| title = Unicode Named Sequences
+| url = https://unicode.org/Public/UNIDATA/NamedSequences.txt
+| accessdate = 2010-03-16}}
+</ref>
+All graphic, format, and private use characters have a unique and immutable name by which they may be identified. This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy.<ref name="stability-policy" /> In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, {{unichar|A015|YI SYLLABLE WU}} has the formal alias {{sc2|YI SYLLABLE ITERATION MARK}}, and {{unichar|FE18|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''KC'''ET|note=[[sic]]}} has the formal alias {{sc2|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''CK'''ET}}.<ref>{{cite web
-В стандарте Юникода определены четыре алгоритма нормализации текста: NFD, NFC, NFKD и NFKC.
+| title = Unicode Name Aliases
+| url = https://unicode.org/Public/UNIDATA/NameAliases.txt
+| accessdate = 2010-03-16}}</ref>
+===Ready-made versus composite characters===
-=== NFD ===
+Unicode includes a mechanism for modifying characters that greatly extends the supported glyph repertoire. This covers the use of [[combining diacritical mark]]s that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains [[precomposed character|precomposed]] versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, ''é'' can be represented in Unicode as [[#Upluslink|U+]]0065 ({{sc2|LATIN SMALL LETTER E}}) followed by U+0301 ({{sc2|COMBINING ACUTE ACCENT}}), but it can also be represented as the precomposed character U+00E9 ({{sc2|LATIN SMALL LETTER E WITH ACUTE}}). Thus, in many cases, users have multiple ways of encoding the same character. To deal with this, Unicode provides the mechanism of [[canonical equivalence]].
-NFD, {{lang-en|'''n'''ormalization '''f'''orm '''D'''}} («D» от {{lang-en|'''d'''ecomposition}}), форма нормализации D — каноническая декомпозиция — алгоритм, согласно которому выполняется рекурсивное разложение составных символов ({{lang-en|precomposed characters}}) на последовательность из одного или нескольких простых символов в соответствии с таблицами декомпозиции. Рекурсивное потому, что в процессе разложения составной символ может быть разложен на несколько других, некоторые из которых тоже являются составными, и к которым применяется дальнейшее разложение.
+An example of this arises with [[Hangul]], the Korean alphabet. Unicode provides a mechanism for composing Hangul syllables with their individual subcomponents, known as [[Hangul Jamo]]. However, it also provides 11,172 combinations of precomposed syllables made from the most common jamo.
-Примеры:
-{| style="text-align:center"
-|
-{| class="wikitable" style="text-align:center"
- | Ω
- |-
- | <small>U+2126</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable"
- | Ω
- |-
- | <small>U+03A9</small>
- |}
-|}
-{| style="text-align:center"
-|
-{| class="wikitable" style="text-align:center"
- | <big>&#x00C5;</big>
- |-
- | <small>U+00C5</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable"
- | <big>&#x0041;</big>
- |-
- | <small>U+0041</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x030A;</big>
- |-
- | <small>U+030A</small>
- |}
-|}
-{| style="text-align:center"
-|
-{| class="wikitable" style="text-align:center"
- | <big>&#x1E69;</big>
- |-
- | <small>U+1E69</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable"
- | <big>&#x0073;</big>
- |-
- | <small>U+0073</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x0323;</big>
- |-
- | <small>U+0323</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x0307;</big>
- |-
- | <small>U+0307</small>
- |}
-|}
-{| style="text-align:center"
-|
-{| class="wikitable" style="text-align:center"
- | colspan="2" | <big>&#x1E0B;&#x0323;</big>
- |-
- | <small>U+1E0B</small> || <small>U+0323</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable"
- | <big>&#x0064;</big>
- |-
- | <small>U+0064</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x0323;</big>
- |-
- | <small>U+0323</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x0307;</big>
- |-
- | <small>U+0307</small>
- |}
-|}
-{| style="text-align:center"
-|
-{| class="wikitable" style="text-align:center"
- | colspan="3" | <big>&#x0071;&#x0307;&#x0323;</big>
- |-
- | <small>U+0071</small> || <small>U+0307</small> || <small>U+0323</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable"
- | <big>&#x0071;</big>
- |-
- | <small>U+0071</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x0323;</big>
- |-
- | <small>U+0323</small>
- |}
-|
-{| class="wikitable"
- | <big> &#x0307;</big>
- |-
- | <small>U+0307</small>
- |}
-|}
+The [[CJK]] characters currently have codes only for their precomposed form. Still, most of those characters comprise simpler elements (called [[Radical_(Chinese_characters)|radicals]]), so in principle Unicode could have decomposed them as it did with Hangul. This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable character (which might do away with some of the problems caused by [[Han unification]]). A similar idea is used by some [[input method]]s, such as [[Cangjie method|Cangjie]] and [[Wubi method|Wubi]]. However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does.
-=== NFC ===
-NFC, {{lang-en|'''n'''ormalization '''f'''orm '''C'''}} («C» от {{lang-en|'''c'''omposition}}), форма нормализации C — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и каноническая композиция. Сначала каноническая декомпозиция (алгоритм NFD) приводит текст к форме D. Затем каноническая композиция — операция, обратная NFD, обрабатывает текст от начала к концу с учётом следующих правил:
-* символ <code>S</code> считается ''начальным'', если имеет нулевой класс комбинируемости ({{lang-en|combining class of zero}}) согласно таблице символов Юникода;
-* в любой последовательности символов, начинающейся с символа <code>S</code>, символ <code>C</code> блокируется от <code>S</code>, только если между <code>S</code> и <code>C</code> есть какой-либо символ <code>B</code>, который либо является начальным, либо имеет одинаковый или больший класс комбинируемости, чем <code>C</code>. Это правило распространяется только на строки, прошедшие каноническую декомпозицию;
-* символ считается ''первичным'' композитом, если имеет каноническую декомпозицию в таблице символов Юникода (или каноническую декомпозицию для [[Хангыль|хангыля]] и он не входит в [http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table список исключений]);
-* символ <code>X</code> может быть первично совмещён с символом <code>Y</code>, если и только если существует первичный композит <code>Z</code>, канонически эквивалентный последовательности &lt;<code>X</code>, <code>Y</code>&gt;;
-* если очередной символ <code>C</code> не блокируется последним встреченным начальным базовым символом <code>L</code> и он может быть успешно первично совмещён с ним, то <code>L</code> заменяется на композит <code>L-C</code>, а <code>C</code> удаляется.
+A set of [[Radical (Chinese character)|radicals]] was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 12.2 of Unicode 5.2) warns against using [[Ideographic Description Sequences|ideographic description sequences]] as an alternate representation for previously encoded characters:
-Пример:
-{| style="text-align:center"
-|
-{| class="wikitable" style="text-align:center"
- | <big>&#x006F;</big>
- |-
- | <small>U+006F</small>
- |}
-|
-{| class="wikitable" style="text-align:center"
- | <big> &#x0302;</big>
- |-
- | <small>U+0302</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable"
- | <big>&#x00F4;</big>
- |-
- | <small>U+00F4</small>
- |}
-|}
+{{quote|This process is different from a formal ''encoding'' of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase "an 'e' with an acute accent on it" than to the character sequence &lt;U+0065, U+0301&gt;.}}
-=== NFKD ===
-NFKD, {{lang-en|'''n'''ormalization '''f'''orm '''KD'''}}, форма нормализации KD — совместимая декомпозиция — алгоритм, согласно которому последовательно выполняются каноническая декомпозиция и замены символов текста по таблицам совместимой декомпозиции. Таблицы совместимой декомпозиции предусматривают замену на почти эквивалентные символы<ref>[http://habrahabr.ru/post/45489/ Нормализация Unicode]</ref>:
-* похожих на буквы (ℍ и ℌ);
-* обведённых кружками (①);
-* с изменёнными размерами (ｶ и カ);
-* повёрнутых (︷ и {);
-* степеней (⁹ и ₉);
-* дробей (¼);
-* других (™).
+===Ligatures===
-Примеры:
+Many scripts, including [[Arabic script|Arabic]] and [[Devanagari|Devanāgarī]], have special orthographic rules that require certain combinations of letterforms to be combined into special [[ligature (typography)|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard), which became the [[proof of concept]] for [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography|AAT]] (by Apple).
-{| style="text-align:center;"
-|
+Instructions are also embedded in fonts to tell the [[operating system]] how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally this approach is only effective in monospaced fonts, but may be used as a fallback rendering method when more complex methods fail.
-{| class="wikitable" style="text-align:center;"
- | <big>&#x210d;</big>
- |-
- | <small>U+210d</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x0048;</big>
- |-
- | <small>U+0048</small>
- |}
-|}
-{| style="text-align:center;"
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x2460;</big>
- |-
- | <small>U+2460</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x0031;</big>
- |-
- | <small>U+0031</small>
- |}
-|}
-{|
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#xFF76;</big>
- |-
- | <small>U+FF76</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x30AB;</big>
- |-
- | <small>U+30AB</small>
- |}
-|}
-{|
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#xFE37;</big>
- |-
- | <small>U+FE37</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x007B;</big>
- |-
- | <small>U+007B</small>
- |}
-|}
-{|
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x2079;</big>
- |-
- | <small>U+2079</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x0039;</big>
- |-
- | <small>U+0039</small>
- |}
-|}
-{|
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x00BC;</big>
- |-
- | <small>U+00BC</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x0031;</big> || <big> &#x2044; </big> || <big>&#x0034;</big>
- |-
- | <small>U+0031</small> || <small>U+2044</small> || <small>U+0034</small>
- |}
-|}
-{|
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x2122;</big>
- |-
- | <small>U+2122</small>
- |}
-| colspan="2" | →
-|
-{| class="wikitable" style="text-align:center;"
- | <big>&#x0054;</big> || <big>&#x004D;</big>
- |-
- | <small>U+0054</small> || <small>U+004D</small>
- |}
-|}
-=== NFKC ===
+===Standardized subsets===
+Several subsets of Unicode are standardized: Microsoft Windows since [[Windows NT 4.0]] supports [[WGL-4]] with 656 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets:<ref>[https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf CWA 13873:2000&nbsp;– Multilingual European Subsets in ISO/IEC 10646-1] [[European Committee for Standardization|CEN]] Workshop Agreement 13873</ref>
-NFKC, {{lang-en|'''n'''ormalization '''f'''orm '''KC'''}}, форма нормализации KC — алгоритм, согласно которому последовательно выполняются совместимая декомпозиция (алгоритм NFKD) и каноническая композиция (алгоритм NFC).
+MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters)<ref>[https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html Multilingual European Character Set 2 (MES-2) Rationale], [[Markus Kuhn (computer scientist)|Markus Kuhn]], 1998</ref> and MES-3A & MES-3B (two larger subsets, not shown here). Note that MES-2 includes every character in MES-1 and WGL-4.
-=== Примеры ===
-{| class="standard"
+{| class="wikitable"
- !Исходный текст||NFD||NFC||NFKD||NFKC
+|+ {{nobold|'''WGL-4''', ''MES-1'' and MES-2}}
- |-
- | <!-- fi -->
-{| class="wikitable" style="text-align:center;"
-| <big>ﬁ</big>
 |-
+! Row !! Cells !! Range(s)
-| <small>U+FB01</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ﬁ</big>
 |-
+!rowspan="2"| 00
-| <small>U+FB01</small>
+| '''''20–7E'''''
-|}
+| [[Basic Latin (Unicode block)|Basic Latin]] (00–7F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ﬁ</big>
 |-
+| '''''A0–FF'''''
-| <small>U+FB01</small>
+| [[Latin-1 Supplement (Unicode block)|Latin-1 Supplement]] (80–FF)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0066;</big> || <big>&#x0069;</big>
 |-
+!rowspan="2"| 01
-| <small>U+0066</small> || <small>U+0069</small>
+| '''''00–13,'' 14–15, ''16–2B,'' 2C–2D, ''2E–4D,'' 4E–4F, ''50–7E,'' 7F'''
-|}
+| [[Latin Extended-A]] (00–7F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0066;</big> || <big>&#x0069;</big>
 |-
+| 8F, '''92,''' B7, DE-EF, '''FA–FF'''
-| <small>U+0066</small> || <small>U+0069</small>
+| [[Latin Extended-B]] (80–FF <span title="U+024F">...</span>)
-|}
- |-
- | <!-- 2^5 -->
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0032;</big> || <big>⁵</big>
 |-
+!rowspan="3"| 02
-| <small>U+0032</small> || <small>U+2075</small>
+| 18–1B, 1E–1F
-|}
+| Latin Extended-B (<span title="U+00180">...</span> 00–4F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0032;</big> || <big>⁵</big>
 |-
+| 59, 7C, 92
-| <small>U+0032</small> || <small>U+2075</small>
+| [[IPA Extensions]] (50–AF)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0032;</big> || <big>⁵</big>
 |-
+| BB–BD, '''C6, ''C7,'' C9,''' D6, '''''D8–DB,'' DC, ''DD,''''' DF, EE
-| <small>U+0032</small> || <small>U+2075</small>
+| [[Spacing Modifier Letters]] (B0–FF)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0032;</big> || <big>&#x0035;</big>
 |-
+! 03
-| <small>U+0032</small> || <small>U+0035</small>
+| 74–75, 7A, 7E, '''84–8A, 8C, 8E–A1, A3–CE,''' D7, DA–E1
-|}
+| [[Greek and Coptic|Greek]] (70–FF)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0032;</big> || <big>&#x0035;</big>
 |-
+! 04
-| <small>U+0032</small> || <small>U+0035</small>
+| '''00–5F, 90–91,''' 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9
-|}
+| [[Cyrillic (Unicode block)|Cyrillic]] (00–FF)
- |-
- | <!-- "s" (looks like "f") with two dots -->
-{| class="wikitable" style="text-align:center;"
-| colspan="2" | <big>ẛ̣</big>
 |-
+! 1E
-| <small>U+1E9B</small> || <small>U+0323</small>
+| 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, '''80–85,''' 9B, '''F2–F3'''
-|}
+| [[Latin Extended Additional]] (00–FF)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ſ</big> || <big>̣</big> || <big>̇</big>
 |-
+! 1F
-| <small>U+017F</small> || <small>U+0323</small> || <small>U+0307</small>
+| 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE
-|}
+| [[Greek Extended]] (00–FF)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ẛ</big> || <big>̣</big>
 |-
+!rowspan="3"| 20
-| <small>U+1E9B</small> || <small>U+0323</small>
+| '''13–14, ''15,'' 17, ''18–19,'' 1A–1B, ''1C–1D,'' 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,''' 4A
-|}
+| [[General Punctuation]] (00–6F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0073;</big> || <big>̣</big> || <big>̇</big>
 |-
+| '''7F''', 82
-| <small>U+0073</small> || <small>U+0323</small> || <small>U+0307</small>
+| [[Superscripts and Subscripts]] (70–9F)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ṩ</big>
 |-
+| '''A3–A4, A7, ''AC,''''' AF
-| <small>U+1E69</small>
+| [[Currency Symbols (Unicode block)|Currency Symbols]] (A0–CF)
-|}
- |-
- | <!-- "й" -->
-{| class="wikitable" style="text-align:center;"
-| <big>й</big>
 |-
+!rowspan="3"| 21
-| <small>U+0439</small>
+| '''05, 13, 16, ''22, 26,'' 2E'''
-|}
+| [[Letterlike Symbols]] (00–4F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>и</big> || <big> ̆</big>
 |-
+| '''''5B–5E'''''
-| <small>U+0438</small> || <small>U+0306</small>
+| [[Number Forms]] (50–8F)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>й</big>
 |-
+| '''''90–93,'' 94–95, A8'''
-| <small>U+0439</small>
+| [[Arrows (Unicode block)|Arrows]] (90–FF)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>и</big> || <big> ̆</big>
 |-
+! 22
-| <small>U+0438</small> || <small>U+0306</small>
+| 00, '''02,''' 03, '''06,''' 08–09, '''0F, 11–12, 15, 19–1A, 1E–1F,''' 27–28, '''29,''' 2A, '''2B, 48,''' 59, '''60–61, 64–65,''' 82–83, 95, 97
-|}
+| [[Mathematical Operators]] (00–FF)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>й</big>
 |-
+! 23
-| <small>U+0439</small>
+| '''02, 0A, 20–21,''' 29–2A
-|}
+| [[Miscellaneous Technical]] (00–FF)
- |-
- | <!-- "ё" -->
-{| class="wikitable" style="text-align:center;"
-| <big>ё</big>
 |-
+!rowspan="3"| 25
-| <small>U+0451</small>
+| '''00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C'''
-|}
+| [[Box Drawing]] (00–7F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>е</big> || <big>̈</big>
 |-
+| '''80, 84, 88, 8C, 90–93'''
-| <small>U+0435</small> || <small>U+0308</small>
+| [[Block Elements]] (80–9F)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ё</big>
 |-
+| '''A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6'''
-| <small>U+0451</small>
+| [[Geometric Shapes]] (A0–FF)
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>е</big> || <big>̈</big>
 |-
+! 26
-| <small>U+0435</small> || <small>U+0308</small>
+| '''3A–3C, 40, 42, 60, 63, 65–66, ''6A,'' 6B'''
-|}
+| [[Miscellaneous Symbols]] (00–FF)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ё</big>
 |-
+! F0
-| <small>U+0451</small>
+| (01–02)<!--in WGL-4, but not in MES-2-->
-|}
+| [[Private Use Area (Unicode block)|Private Use Area]] (00–FF ...)
- |-
- | <!-- "А" -->
-{| class="wikitable" style="text-align:center;"
-| <big>А</big>
 |-
+! FB
-| <small>U+0410</small>
+| '''01–02'''
-|}
+| [[Alphabetic Presentation Forms]] (00–4F)
-|
-{| class="wikitable" style="text-align:center;"
-| <big>А</big>
 |-
+! FF
-| <small>U+0410</small>
-|}
+| FD
+| [[Specials (Unicode block)|Specials]]
-|
-{| class="wikitable" style="text-align:center;"
-| <big>А</big>
-|-
-| <small>U+0410</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>А</big>
-|-
-| <small>U+0410</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>А</big>
-|-
-| <small>U+0410</small>
-|}
- |-
- | <!-- "が" -->
-{| class="wikitable" style="text-align:center;"
-| <big>が</big>
-|-
-| <small>U+304C</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>か</big> || <big>゙</big>
-|-
-| <small>U+304B</small> || <small>U+3099</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>が</big>
-|-
-| <small>U+304C</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>か</big> || <big>゙</big>
-|-
-| <small>U+304B</small> || <small>U+3099</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>が</big>
-|-
-| <small>U+304C</small>
-|}
- |-
- | <!-- "VIII" -->
-{| class="wikitable" style="text-align:center;"
-| <big>Ⅷ</big>
-|-
-| <small>U+2167</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>Ⅷ</big>
-|-
-| <small>U+2167</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>Ⅷ</big>
-|-
-| <small>U+2167</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0056;</big> || <big>&#x0049;</big> || <big>&#x0049;</big> || <big>&#x0049;</big>
-|-
-| <small>U+0056</small> || <small>U+0049</small> || <small>U+0049</small> || <small>U+0049</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0056;</big> || <big>&#x0049;</big> || <big>&#x0049;</big> || <big>&#x0049;</big>
-|-
-| <small>U+0056</small> || <small>U+0049</small> || <small>U+0049</small> || <small>U+0049</small>
-|}
- |-
- | <!-- "ç" -->
-{| class="wikitable" style="text-align:center;"
-| <big>ç</big>
-|-
-| <small>U+00E7</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0063;</big> || <big>̧</big>
-|-
-| <small>U+0063</small> || <small>U+0327</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ç</big>
-|-
-| <small>U+00E7</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>&#x0063;</big> || <big>̧</big>
-|-
-| <small>U+0063</small> || <small>U+0327</small>
-|}
-|
-{| class="wikitable" style="text-align:center;"
-| <big>ç</big>
-|-
-| <small>U+00E7</small>
-|}
 |}
+Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "[[replacement character]]" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's [[Last Resort font]] will display a substitute glyph indicating the Unicode range of the character, and the [[SIL International]]'s [[Unicode fallback font|Unicode Fallback]] font will display a box showing the hexadecimal scalar value of the character.
-== Двунаправленное письмо ==
-Стандарт Юникод поддерживает письменности языков как с направлением написания слева направо ({{lang-en|left-to-right, LTR}}), так и с написанием справа налево ({{lang-en|right-to-left, RTL}}) — например, [[арабское письмо|арабское]] и [[еврейский алфавит|еврейское]] письмо. В обоих случаях символы хранятся в «естественном» порядке; их отображение с учётом нужного направления письма обеспечивается приложением.
+==={{anchor|UTF|UCS}}Mapping and encodings===
-Кроме того, Юникод поддерживает комбинированные тексты, сочетающие фрагменты с разным направлением письма. Данная возможность называется ''двунаправленность'' ({{lang-en|bidirectional text, BiDi}}). Некоторые упрощённые обработчики текста (например, в сотовых телефонах) могут поддерживать Юникод, но не иметь поддержки двунаправленности. Все символы Юникода поделены на несколько категорий: пишущиеся слева направо, пишущиеся справа налево, и пишущиеся в любом направлении. Символы последней категории (в основном это [[знаки пунктуации]]) при отображении принимают направление окружающего их текста.
+Several mechanisms have been specified for storing a series of code points as a series of bytes.
-== Представленные символы ==
-[[Файл:Roadmap to Unicode BMP multilingual.svg|lang=ru|right|500px|thumb|Схема [[Плоскость (Юникод)#Основная многоязычная плоскость|основной мультиязычной плоскости]] Юникода]]
-{{Main|Плоскость (Юникод)}}
+<!-- [[Unicode Transformation Format]] redirects here -->
-Юникод включает практически все современные [[письменность|письменности]], в том числе:
+Unicode defines two mapping methods: the ''Unicode Transformation Format'' (UTF) encodings, and the ''[[Universal Coded Character Set]]'' (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode ''code points'' to sequences of values in some fixed-size range, termed ''code units''. All UTF encodings map code points to a unique sequence of bytes.<ref>{{cite web|title=UTF-8, UTF-16, UTF-32 & BOM|url=https://unicode.org/faq/utf_bom.html|website=Unicode.org FAQ|accessdate=12 December 2016}}</ref> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.
-{{columns-list|2|
-* [[арабское письмо|арабскую]],
+UTF encodings include:
-* [[армянское письмо|армянскую]],
-* [[бенгальское письмо|бенгальскую]],
+* [[UTF-1]], a retired predecessor of UTF-8, maximizes compatibility with [[ISO/IEC 2022|ISO 2022]], no longer part of ''The Unicode Standard''
-* [[Бирманское письмо|бирманскую]],
+* [[UTF-7]], a 7-bit encoding sometimes used in e-mail, often considered obsolete (not part of ''The Unicode Standard'', but only documented as an informational [[Request for Comments|RFC]], i.e., not on the Internet Standards Track)
-* [[Глаголица|глаголицу]],
+* [[UTF-8]], uses one to four bytes for each code point, maximizes compatibility with [[ASCII]]
-* [[Греческий алфавит|греческую]],
+* [[UTF-EBCDIC]], similar to UTF-8 but designed for compatibility with [[EBCDIC]] (not part of ''The Unicode Standard'')
-* [[грузинское письмо|грузинскую]],
+* [[UTF-16]], uses one or two 16-bit code units per code point, cannot encode surrogates
-* [[деванагари]],
+* [[UTF-32]], uses one 32-bit code unit per code point
-* [[еврейский алфавит|еврейскую]],
-* [[Кириллица|кириллицу]],
+UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the ''de facto'' standard encoding for interchange of Unicode text. It is used by [[FreeBSD]] and most recent [[Linux distributions]] as a direct replacement for legacy encodings in general text handling.
-* [[китайское письмо|китайскую]] (китайские иероглифы активно используются в [[японский язык|японском языке]], а также изредка в [[корейский язык|корейском]]),
-* [[коптское письмо|коптскую]],
+The UCS-2 and UTF-16 encodings specify the Unicode [[Byte Order Mark]] (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or [[endianness|byte endianness]] detection). The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of [[ligature (typography)|ligatures]]).
-* [[Кхмерское письмо|кхмерскую]],
-* [[Латинский алфавит|латинскую]],
+The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".<ref>{{Cite book | title=The Unicode Standard, Version 6.2 | publisher=The Unicode Consortium | year=2013 | isbn=978-1-936213-08-5 | page=561 }}</ref> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit [[code page]]s. However {{IETF RFC|3629}}, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM.
-* [[Тамильское письмо|тамильскую]],
-* [[Хангыль|корейскую (хангыль)]],
-* [[письмо чероки|чероки]],
-* [[Эфиопское письмо|эфиопскую]],
-* [[японское письмо|японскую]] (которая включает в себя, кроме [[кана|слоговой азбуки]], ещё и [[кандзи|китайские иероглифы]])
-}}
-и другие.
+In UTF-32 and UCS-4, one [[32-bit]] code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the [[GNU Compiler Collection|gcc]] compilers to generate software uses it as the standard "[[wide character]]" encoding. Some programming languages, such as [[Seed7]], use UTF-32 as internal representation for strings and characters. Recent versions of the [[Python (programming language)|Python]] programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in [[high-level programming language|high-level]] coded software.
-С академическими целями добавлены многие исторические письменности, в том числе: [[руны|германские руны]], [[Древнетюркское письмо|древнетюркские руны]], [[древнегреческий язык|древнегреческая письменность]], [[египетские иероглифы]], [[клинопись]], [[письменность майя]], [[этрусский алфавит]].
+[[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]] (DNS). The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all scripts that are supported by Unicode. Earlier and now historical proposals include [[UTF-5]] and [[UTF-6]].
-В Юникоде представлен широкий набор [[таблица математических символов|математических]] и [[музыка]]льных символов, а также [[пиктограмма|пиктограмм]].
+[[GB 18030|GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the [[People's Republic of China]] (PRC). [[Binary Ordered Compression for Unicode|BOCU-1]] and [[Standard Compression Scheme for Unicode|SCSU]] are Unicode compression schemes. The [[April Fools' Day RFC]] of 2005 specified two [[parody]] UTF encodings, [[UTF-9]] and [[UTF-18]].
-[[государственный флаг|Государственные флаги]] не включены в Юникод напрямую. Для их кодирования используются пары из 26 буквенных символов, предназначенных для представления двухбуквенных кодов стран по стандарту [[ISO 3166-1 alpha-2]]. Эти буквы закодированы в диапазоне от {{unichar|1F1E6|regional indicator symbol letter a|html=}} до {{unichar|1F1FF|regional indicator symbol letter z|html=}}.
+==Adoption==
-В Юникод принципиально не включаются [[логотип]]ы компаний и продуктов, хотя они и встречаются в шрифтах (например, логотип [[Apple]] в кодировке [[MacRoman]] (0xF0) или логотип [[Microsoft Windows|Windows]] в шрифте Wingdings (0xFF)). В юникодовских шрифтах логотипы должны размещаться только в области пользовательских символов.
+===Operating systems===
-== ISO/IEC 10646 ==
+Unicode has become the dominant scheme for internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use [[UCS-2]] (the fixed-width two-byte precursor to UTF-16) and later moved to [[UTF-16]] (the variable-width current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000]], [[Windows XP]], [[Windows Vista]], [[Windows 7]], [[Windows 8]] and [[Windows 10]]), which uses UTF-16 as the sole internal character encoding. The [[Java virtual machine|Java]] and [[.NET Framework|.NET]] bytecode environments, [[macOS]], and [[KDE]] also use it for internal representation. Partial support for Unicode can be installed on [[Windows 9x]] through the [[Microsoft Layer for Unicode]].
-Консорциум Юникода работает в тесной связи с рабочей группой ISO/IEC/JTC1/SC2/WG2, которая занимается разработкой международного стандарта 10646 ([[ISO]]/[[IEC]] 10646). Между стандартом Юникода и ISO/IEC 10646 установлена синхронизация, хотя каждый стандарт использует свою терминологию и систему документации.
+[[UTF-8]] (originally developed for [[Plan 9 from Bell Labs|Plan 9]])<ref>{{cite web
-Сотрудничество Консорциума Юникода с Международной организацией по стандартизации ({{lang-en|International Organization for Standardization, ISO}}) началось в [[1991 год]]у. В [[1993 год]]у ISO выпустила стандарт DIS 10646.1. Для синхронизации с ним Консорциум утвердил стандарт Юникода версии 1.1, в который были внесены дополнительные символы из DIS 10646.1. В результате значения закодированных символов в Unicode 1.1 и DIS 10646.1 полностью совпали.
+ | url = https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
+ | title = UTF-8 history
+ | first = Rob | last = Pike | authorlink = Rob Pike
+ | date = 2003-04-30
+}}</ref> has become the main storage encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets. UTF-8 is also the most common Unicode encoding used in [[HTML]] documents on the [[World Wide Web]].
+Multilingual text-rendering engines which use Unicode include [[Uniscribe]] and [[DirectWrite]] for Microsoft Windows, [[ATSUI]] and [[Core Text]] for macOS, and [[Pango]] for [[GTK+]] and the [[GNOME]] desktop.
-В дальнейшем сотрудничество двух организаций продолжилось. В 2000 году стандарт Unicode 3.0 был синхронизирован с ISO/IEC 10646-1:2000. Предстоящая третья версия ISO/IEC 10646 будет синхронизирована с Unicode 4.0. Возможно, эти спецификации даже будут опубликованы как единый стандарт.
+===Input methods===
-Аналогично форматам UTF-16 и UTF-32 в стандарте Юникода, стандарт ISO/IEC 10646 также имеет две основные формы кодирования символов: UCS-2 (2 байта на символ, аналогично UTF-16) и UCS-4 (4 байта на символ, аналогично UTF-32). UCS значит ''универсальный набор кодированных символов'' ({{lang-en|universal coded character set}}). UCS-2 можно считать подмножеством UTF-16 (UTF-16 без суррогатных пар), а UCS-4 является синонимом для UTF-32.
+{{Main|Unicode input}}
+Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire.
-Различия стандартов Юникод и ISO/IEC 10646:
-* небольшие различия в терминологии;
-* ISO/IEC 10646 не включает разделы, необходимые для полноценной реализации поддержки Юникода:
-** нет данных о двоичном кодировании символов;
-** нет описания алгоритмов сравнения ({{lang-en|collation}}) и отрисовки ({{lang-en|rendering}}) символов;
-** нет перечня свойств символов (например, нет перечня свойств, необходимых для реализации поддержки двунаправленного ({{lang-en|bi-directional}}) письма).
+[[ISO/IEC 14755]],<ref>{{cite web|url=https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf |title=ISO/IEC JTC1/SC 18/WG 9 N |date= |accessdate=2012-06-04}}</ref> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the ''Basic method'', where a ''beginning sequence'' is followed by the hexadecimal representation of the code point and the ''ending sequence''. There is also a ''screen-selection entry method'' specified, where the characters are listed in a table in a screen, such as with a character map program.
-== Способы представления ==
-Юникод имеет несколько форм представления ({{lang-en|Unicode transformation format, UTF}}): [[UTF-8]], [[UTF-16]] (UTF-16BE, UTF-16LE) и [[UTF-32]] (UTF-32BE, UTF-32LE). Была разработана также форма представления [[UTF-7]] для передачи по семибитным каналам, но из-за несовместимости с [[ASCII]] она не получила распространения и не включена в стандарт. 1 апреля 2005 года были предложены две [[День смеха|шуточные]] формы представления: UTF-9 и UTF-18 ([http://tools.ietf.org/html/rfc4042 RFC{{nbsp}}4042]).
+Online tools for finding the code point for a known character include Unicode Lookup<ref>{{cite web|url=https://unicodelookup.com/|title=Unicode Lookup|last=Hedley|first=Jonathan|date=2009}}</ref> by Jonathan Hedley and Shapecatcher<ref>{{cite web|url=http://shapecatcher.com/|title=Unicode Character Recognition|last=Milde|first=Benjamin|date=2011}}</ref> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on [[Shape context]], one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned.
-В [[Microsoft]] [[Windows NT]] и основанных на ней системах [[Windows 2000]] и [[Windows XP]] в основном [[Юникод в операционных системах семейства Microsoft Windows|используется]] форма UTF-16LE. В [[UNIX]]-подобных [[Операционная система|операционных системах]] [[GNU/Linux]], [[BSD]] и [[Mac OS X]] принята форма UTF-8 для файлов и UTF-32 или UTF-8 для обработки символов в [[оперативная память|оперативной памяти]].
+===Email===
-[[Punycode]] — другая форма кодирования последовательностей Unicode-символов в так называемые ACE-последовательности, которые состоят только из алфавитно-цифровых символов, как это разрешено в доменных именах.
+{{Main|Unicode and email}}
+[[MIME]] defines two different mechanisms for encoding non-ASCII characters in [[email]], depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the [[UTF-8]] character set and the [[Base64]] or the [[Quoted-printable]] transfer encoding are recommended, depending on whether much of the message consists of [[ASCII]] characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software.
-=== UTF-8 ===
-{{Основная статья|UTF-8}}
-UTF-8 — представление Юникода, обеспечивающее наибольшую компактность и обратную совместимость с 7-битной системой [[ASCII]]; текст, состоящий только из символов с номерами меньше 128, при записи в UTF-8 превращается в обычный текст [[ASCII]] и может быть отображён любой программой, работающей с ASCII; и наоборот, текст, закодированный 7-битной ASCII может быть отображён программой, предназначенной для работы с UTF-8. Остальные символы Юникода изображаются последовательностями длиной от 2 до 4 байт, в которых первый байт всегда имеет маску <code>11xxxxxx</code>, а остальные — <code>10xxxxxx</code>. В UTF-8 не используются суррогатные пары.
+The adoption of Unicode in email has been very slow. Some East Asian text is still encoded in encodings such as [[ISO-2022]], and some devices, such as mobile phones, still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as [[Yahoo]], [[Google]] ([[Gmail]]), and [[Microsoft]] ([[Outlook.com]]) support it.
-Формат UTF-8 был изобретён [[2 сентября]] [[1992 год]]а [[Томпсон, Кен|Кеном Томпсоном]] и [[Пайк, Роб|Робом Пайком]] и реализован в ОС [[Plan 9]]<ref>http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt{{ref-en}}{{Недоступная ссылка|date=Октябрь 2019 |bot=InternetArchiveBot }}</ref>. Сейчас стандарт UTF-8 официально закреплён в документах RFC 3629 и ISO/IEC 10646 Annex D.
-=== UTF-16 и UTF-32 ===
+===Web===
+{{Main|Unicode and HTML}}
-{{Основная статья|UTF-16|UTF-32}}
-UTF-16 — кодировка, позволяющая записывать символы Юникода в диапазонах U+0000...U+D7FF и U+E000...U+10FFFF (общим количеством 1 112 064). При этом каждый символ записывается одним или двумя словами (суррогатная пара). Кодировка UTF-16 описана в приложении Q к международному стандарту ISO/IEC 10646, а также ей посвящён документ IETF RFC 2781 под названием «UTF-16, an encoding of ISO 10646».
+All [[W3C]] recommendations have used Unicode as their ''document character set'' since HTML 4.0. [[Web browser]]s have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from [[typeface|font]] related issues; e.g. v 6 and older of Microsoft [[Internet Explorer]] did not render many code points unless explicitly told to use a font that contains them.<ref>{{cite web|first=Alan |last=Wood |url=http://www.alanwood.net/unicode/explorer.html#ie5 |title=Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support |publisher=Alan Wood |date= |accessdate=2012-06-04}}</ref>
-UTF-32 — способ представления Юникода, при котором каждый символ занимает ровно 4 байта. Главное преимущество UTF-32 перед кодировками переменной длины заключается в том, что символы Юникод в ней непосредственно индексируемы, поэтому найти символ по номеру его позиции в файле можно чрезвычайно быстро, и получение любого символа ''n''-й позиции при этом является операцией, занимающей всегда одинаковое время. Это также делает замену символов в строках UTF-32 очень простой. Напротив, кодировки с переменной длиной требуют последовательного доступа к символу ''n''-й позиции, что может быть очень затратной по времени операцией. Главный недостаток UTF-32 — это неэффективное использование пространства, так как для хранения любого символа используется четыре байта. Символы, лежащие за пределами нулевой (базовой) плоскости кодового пространства, редко используются в большинстве текстов. Поэтому удвоение, в сравнении с UTF-16, занимаемого строками в UTF-32 пространства, зачастую не оправдано.
+Although syntax rules may affect the order in which characters are allowed to appear, [[XML]] (including [[XHTML]]) documents, by definition,<ref>{{cite web|title=Extensible Markup Language (XML) 1.1 (Second Edition)|url=https://www.w3.org/TR/xml11|accessdate=2013-11-01}}</ref> comprise characters from most of the Unicode code points, with the exception of:
-==== Порядок байтов ====
-{{Основная статья|Порядок байтов}}
-В потоке данных UTF-16 младший байт может записываться либо перед старшим ({{lang-en|UTF-16 little-endian, UTF-16LE}}), либо после старшего ({{lang-en|UTF-16 big-endian, UTF-16BE}}). Аналогично существует два варианта четырёхбайтной кодировки — UTF-32LE и UTF-32BE.
+* most of the [[C0 and C1 control codes|C0 control codes]]
-=== Маркер последовательности байтов ===
+* the permanently unassigned code points D800–DFFF
-{{Основная статья|Маркер последовательности байтов}}
+* FFFE or FFFF
-Для указания на использование Юникода, в начале текстового файла или потока может передаваться [[Маркер последовательности байтов]] ({{lang-en|byte order mark (BOM)}}) — символ U+FEFF (неразрывный пробел нулевой ширины). По его виду можно легко различить как формат представления Юникода, так и последовательность байтов. Маркер последовательности байтов может принимать следующий вид:
-; UTF-8 : EF BB BF
-; UTF-16BE : FE FF
-; UTF-16LE : FF FE
-; UTF-32BE : 00 00 FE FF
-; UTF-32LE : FF FE 00 00
+HTML characters manifest either directly as [[byte]]s according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&amp;#916;</code>, <code>&amp;#1049;</code>, <code>&amp;#1511;</code>, <code>&amp;#1605;</code>, <code>&amp;#3671;</code>, <code>&amp;#12354;</code>, <code>&amp;#21494;</code>, <code>&amp;#33865;</code>, and <code>&amp;#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&amp;#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말.
-=== Юникод и традиционные кодировки ===
-Внедрение Юникода привело к изменению подхода к традиционным 8-битным кодировкам. Если раньше такая кодировка всегда задавалась непосредственно, то теперь она может задаваться таблицей соответствия между данной кодировкой и Юникодом. Фактически почти все 8-битные кодировки теперь можно рассматривать как форму представления некоторого подмножества Юникода. И это намного упростило создание программ, которые должны работать с множеством разных кодировок: теперь, чтобы добавить поддержку ещё одной кодировки, надо всего лишь добавить ещё одну таблицу перекодировки символов в Юникод.
+When specifying [[Uniform Resource Identifier|URIs]], for example as [[URL]]s in [[HTTP]] requests, non-ASCII characters must be [[percent encoding|percent-encoded]].
-Кроме того, многие форматы данных позволяют вставлять любые символы Юникода, даже если документ записан в старой 8-битной кодировке. Например, в HTML можно использовать [[Мнемоники в HTML|коды с амперсандом]].
-=== Реализации ===
+===Fonts===
+{{Main|Unicode font}}
-Большинство современных операционных систем в той или иной степени обеспечивает поддержку Юникода.
+Unicode is not in principle concerned with fonts ''per se'', seeing them as implementation choices.<ref>{{cite journal |url = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf  | title = The design of a Unicode font | journal = Electronic Publishing | volume = VOL. 6(3), 289–305 | date = September 1993 | page = 292 |last1 = Bigelow | first1=Charles | last2 = Holmes | first2 = Kris}}</ref> Any given character may have many [[allograph]]s, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode standard.<ref>{{cite web | url= https://www.unicode.org/faq/font_keyboard.html | title = Fonts and keyboards | publisher = Unicode Consortium | date = 28 June 2017 | accessdate= 13 October 2019}}</ref> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire.
-В операционных системах семейства [[Windows NT]] для внутреннего представления имён файлов и других системных строк используется двухбайтовая кодировка UTF-16LE. Системные вызовы, принимающие строковые параметры, существуют в однобайтном и двухбайтном вариантах. Подробнее см. в статье [[Юникод в операционных системах семейства Microsoft Windows]].
+Free and retail [[font]]s based on Unicode are widely available, since [[TrueType]] and [[OpenType]] support Unicode. These font formats map Unicode code points to glyphs, but TrueType font is restricted to 65,535 glyphs.
-[[UNIX]]-подобные операционные системы, в том числе [[GNU/Linux]], [[BSD]], [[OS X]], используют для представления Юникода кодировку UTF-8. Большинство программ может работать с UTF-8 как с традиционными однобайтными кодировками, не обращая внимания на то, что символ представляется как несколько последовательных байт. Для работы с отдельными символами строки обычно перекодируются в UCS-4, так что каждому символу соответствует [[машинное слово]].
+[[List of typefaces|Thousands of fonts]] exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of Unicode fonts|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., [[font substitution]]. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces.
-Одной из первых успешных коммерческих реализаций Юникода стала среда программирования [[Java]]. В ней принципиально отказались от 8-битного представления символов в пользу 16-битного. Это решение увеличило расход памяти, но позволило вернуть в программирование важную абстракцию: произвольный одиночный символ (тип <code>char</code>). В частности, программист мог работать со строкой, как с простым массивом. К сожалению, успех не был окончательным, Юникод перерос ограничение в 16 бит и к версии J2SE 5.0 произвольный символ снова стал занимать переменное число единиц памяти — один <code>char</code> или два (см. [[UTF-16|суррогатная пара]]).
+===Newlines===
-Сейчас большинство языков программирования поддерживает строки Юникода, хотя их представление может различаться в зависимости от реализации.
+Unicode partially addresses the [[newline]] problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of [[Newline#Unicode|characters]] that conforming applications should recognize as line terminators.
+In terms of the newline, Unicode introduced {{unichar|2028|LINE SEPARATOR}} and {{unichar|2029|PARAGRAPH SEPARATOR}}. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding.
-== Методы ввода ==
-Поскольку ни одна [[раскладка клавиатуры]] не может позволить вводить все символы Юникода одновременно, от [[операционная система|операционных систем]] и [[прикладное программное обеспечение|прикладных программ]] требуется поддержка альтернативных методов ввода произвольных символов Юникода.
+==Issues==
-=== [[Microsoft Windows]] ===
-{{main|Юникод в операционных системах семейства Microsoft Windows}}
-Хотя, начиная с [[Windows 2000]], служебная программа «Таблица символов» (charmap.exe) поддерживает символы Юникода и позволяет копировать их в [[буфер обмена]],  эта поддержка ограничена только базовой плоскостью (коды символов U+0000…U+FFFF). Символы с кодами от U+10000 «Таблица символов» не отображает.
+===Philosophical and completeness criticisms===
-Похожая таблица есть, например, в [[Microsoft Word]].
+[[Han unification]] (the identification of forms in the [[East Asian language]]s which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the [[Ideographic Research Group]] (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification.<ref>[http://tronweb.super-nova.co.jp/characcodehist.html A Brief History of Character Codes], Steven J. Searle, originally written [https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html 1999], last updated 2004</ref>
+Unicode has been criticized for failing to separately encode older and alternative forms of [[kanji]] which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). Unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged.<ref name="dw2001">[https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html The secret life of Unicode: A peek at Unicode's soft underbelly], Suzanne Topping, 1 May 2001 ''(Internet Archive)''</ref>{{clarify|date=April 2010|reason="and, contains" and meaning of statement}} There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. An example of one is [[TRON (encoding)|TRON]] (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it).
-Иногда можно набрать [[Шестнадцатеричная система счисления|шестнадцатеричный]] код, нажать {{key|[[Alt (клавиша)|Alt]]|X}}, и код будет заменён на соответствующий символ, например, в [[WordPad]], Microsoft Word. В редакторах {{key|Alt|X}} выполняет и обратное преобразование.
+Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 92,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam.
-Во многих программах MS Windows, чтобы получить символ Unicode, нужно при нажатой клавише Alt набрать десятичное значение кода символа на цифровой клавиатуре. Например, полезными при наборе кириллических текстов будут комбинации Alt+0171 (<!-- защита от Викификатора --><nowiki>«</nowiki>), Alt+0187 (<nowiki>»</nowiki>) и Alt+0769 ([[знак ударения]]). Интересны также комбинации Alt+0133 (…) и Alt+0151 (—).
+Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations, in the form of [[variation Selectors|Unicode variation sequences]]. For example, the Advanced Typographic tables of [[OpenType]] permit one of a number of alternative glyph representations to be selected when performing the character to glyph mapping process. In this case, information can be provided within plain text to designate which alternate character form to select.
-=== [[Macintosh]] ===
-В [[Mac OS]] 8.5 и более поздних версиях поддерживается метод ввода, называемый «Unicode Hex Input». При зажатой клавише Option требуется набрать четырёхзначный шестнадцатеричный код требуемого символа. Этот метод позволяет вводить символы с кодами, большими U+FFFF, используя пары суррогатов; такие пары операционной системой будут автоматически заменены на одиночные символы. Этот метод ввода перед использованием нужно активизировать в соответствующем разделе системных настроек и затем выбрать как текущий метод ввода в меню клавиатуры.
+[[File:Cyrillic cursive.svg|thumb|right|Various [[Cyrillic]] characters shown with and without italics]]
-Начиная с [[Mac OS X]] 10.2, существует также приложение «Character Palette», позволяющее выбирать символы из таблицы, в которой можно выделять символы определённого блока или символы, поддерживаемые конкретным шрифтом.
+If the difference in the appropriate glyphs for two characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison between Russian (labeled standard) and Serbian characters at right, meaning that the differences are displayed through smart font technology or manually changing fonts.
+===Mapping to legacy character sets===
-=== [[GNU/Linux]] ===
+Unicode was designed to provide code-point-by-code-point [[round-trip format conversion]] to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. That has meant that inconsistent legacy architectures, such as [[combining character|combining diacritics]] and [[precomposed character]]s, both exist in Unicode, giving more than one method of representing some text. This is most pronounced in the three different encoding forms for Korean [[Hangul]]. Since version 3.0, any precomposed characters that can be represented by a combining sequence of already existing characters can no longer be added to the standard in order to preserve interoperability between software using different versions of Unicode.
-В [[GNOME]] также есть утилита «[[Таблица символов GNOME|Таблица символов]]» (ранее gucharmap), позволяющая отображать символы определённого блока или системы письма и предоставляющая возможность поиска по названию или описанию символа. Когда код нужного символа известен, его можно ввести в соответствии со стандартом [[Международная организация по стандартизации|ISO]]{{nbsp}}14755: при зажатых клавишах {{key|Ctrl|Shift}} ввести шестнадцатеричный код (начиная с некоторой версии GTK+, ввод кода нужно предварить нажатием клавиши ''«U»''). Вводимый шестнадцатеричный код может иметь до {{num|32|бит}} в длину, позволяя вводить любые символы Юникода без использования суррогатных пар.
+[[Injective]] mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as [[Shift-JIS]] or [[EUC-JP]] and Unicode led to [[round-trip format conversion]] mismatches, particularly the mapping of the character JIS X 0208 '～' (1-33, WAVE DASH), heavily used in legacy database data, to either {{unichar|FF5E|FULLWIDTH TILDE}} (in [[Microsoft Windows]]) or {{unichar|301C|WAVE DASH}} (other vendors).<ref>
-Все приложения [[X Window System|X{{nbsp}}Window]], включая GNOME и [[KDE]], поддерживают ввод при помощи клавиши {{Key|[[Compose]]}}. Для клавиатур, на которых нет отдельной клавиши [[Compose]], для этой цели можно назначить любую клавишу — например, {{Key|[[Caps Lock]]}}.
+[http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc AFII contribution about WAVE DASH], {{Cite web|url=http://www.ingrid.org/java/i18n/unicode.html|archiveurl=https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html|title=An Unicode vendor-specific character table for japanese|date=2011-04-22|archive-date=2011-04-22|website=web.archive.org<!--|access-date=2019-05-20-->}}</ref>
+Some Japanese computer programmers objected to Unicode because it requires them to separate the use of {{unichar|005C|REVERSE SOLIDUS|note=backslash}} and {{unichar|00A5|YEN SIGN}}, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage.<ref>[https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem ''ISO 646-* Problem''], Section 4.4.3.5 of ''Introduction to I18n'', Tomohiro KUBOTA, 2001</ref> (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in [[ISO 8859-1]], from long before Unicode.
-Консоль GNU/Linux также допускает ввод символа Юникода по его коду — для этого десятичный код символа нужно ввести цифрами расширенного блока клавиатуры при зажатой клавише {{Key|[[Alt (клавиша)|Alt]]}}. Можно вводить символы и по их шестнадцатеричному коду: для этого нужно зажать клавишу {{key|AltGr}}, и для ввода цифр A—F использовать клавиши расширенного блока клавиатуры от {{Key|NumLock}} до {{Key|Enter}} (по часовой стрелке). Поддерживается также и ввод в соответствии с ISO{{nbsp}}14755. Для того чтобы перечисленные способы могли работать, нужно включить в консоли режим Юникода вызовом <code>unicode_start</code>(1) и выбрать подходящий шрифт вызовом <code>setfont</code>(8).
+===Indic scripts===
-[[Mozilla Firefox]] для Linux поддерживает ввод символов по ISO{{nbsp}}14755.
+[[Indic script]]s such as [[Tamil script|Tamil]] and [[Devanagari]] are each allocated only 128 code points, matching the [[ISCII]] standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (aka conjuncts) out of components. Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only.<ref>{{cite web
+| title = Arabic Presentation Forms-A
+| url = https://www.unicode.org/charts/PDF/UFB50.pdf
+| accessdate = 2010-03-20}}
+</ref><ref>{{cite web
+| title = Arabic Presentation Forms-B
+| url = https://www.unicode.org/charts/PDF/UFE70.pdf
+| accessdate = 2010-03-20}}</ref><ref>{{cite web
+| title = Alphabetic Presentation Forms
+| url = https://www.unicode.org/charts/PDF/UFB00.pdf
+| accessdate = 2010-03-20}}</ref> Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for the [[Tibetan script]] in 2003 when the [[Standardization Administration of China]] proposed encoding 956 precomposed Tibetan syllables,<ref>{{Cite web | author=China | title=Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP | url=https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf | date=2 December 2002 }}</ref> but these were rejected for encoding by the relevant ISO committee ([[ISO/IEC JTC 1/SC 2]]).<ref>{{Cite web | author= V. S. Umamaheswaran | title=Resolutions of WG 2 meeting 44 | url=https://www.unicode.org/L2/L2003/03390r-n2654.pdf | at=Resolution M44.20 | date=7 November 2003 }}</ref>
+[[Thai alphabet]] support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the [[TIS-620|Thai Industrial Standard 620]], which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation.<ref name="dw2001" /> Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word {{wiktth|แสดง}} {{IPA-th|sa dɛːŋ|}} "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส.
-== Проблемы Юникода ==
-В Юникоде английское «a» и польское «a» — один и тот же символ. Точно так же одним и тем же символом (но отличающимся от «a» латинского) считаются русское «а» и сербское «а». Такой принцип кодирования не универсален; по-видимому, решения «на все случаи жизни» вообще не может существовать.
-* Тексты на [[китайский язык|китайском]], [[корейский язык|корейском]] и [[японский язык|японском]] языках имеют традиционное написание сверху вниз, начиная с правого верхнего угла. Переключение горизонтального и вертикального написания для этих языков не предусмотрено в Юникоде — это должно осуществляться средствами [[язык разметки|языков разметки]] или внутренними механизмами [[текстовый процессор|текстовых процессоров]].
-* Наличие или отсутствие в Юникоде разных начертаний одного и того же символа в зависимости от языка. Нужно следить, чтобы текст всегда был правильно помечен как относящийся к тому или другому языку.
-*: Так, [[китайское письмо|китайские иероглифы]] могут иметь разные начертания в китайском, японском ([[кандзи]]) и корейском ([[ханча]]), но при этом в Юникоде обозначаются одним и тем же символом (так называемая CJK-унификация), хотя упрощённые и полные иероглифы всё же имеют разные коды.
-*: Аналогично, [[русский язык|русский]] и [[сербский язык|сербский]] <!-- защита от Викификатора --><nowiki>языки</nowiki> используют разное начертание курсивных букв ''п'' и ''т'' (в сербском они выглядят как <span style="text-decoration: overline; font-style: italic">и</span> и <span style="text-decoration: overline; font-style: italic">ш</span>, см. [[сербский курсив]]).
-* Перевод из строчных букв в заглавные тоже зависит от языка. Например: в [[турецкий язык|турецком]] существуют буквы [[i без точки|İi и Iı]] — таким образом, турецкие правила изменения регистра конфликтуют с [[английский язык|английскими]], которые предписывают «i» переводить в «I». Подобные проблемы есть и в других языках — например, в канадском диалекте французского языка регистр переводится немного не так, как во Франции<ref>[http://www.transl-gunsmoker.ru/2008/11/unicode.html Регистр в Unicode — это непросто]</ref>.
-* Даже с [[арабские цифры|арабскими цифрами]] есть определённые типографские тонкости: цифры бывают «прописными» и «[[минускульные цифры|строчными]]», пропорциональными и [[моноширинный шрифт|моноширинными]]<ref>В большинстве шрифтов для ПК реализованы «прописные» (маюскульные) моноширинные цифры.</ref> — для Юникода разницы между ними нет. Подобные нюансы остаются за программным обеспечением.
+===Combining characters===
-Некоторые недостатки связаны не с самим Юникодом, а с возможностями обработчиков текста.
+{{Main|Combining character}}
-* Файлы нелатинского текста в Юникоде всегда занимают больше места, так как один символ кодируется не одним байтом, как в различных национальных кодировках, а последовательностью байтов (исключение составляет UTF-8 для языков, алфавит которых укладывается в ASCII, а также наличие в тексте символов двух и более языков, алфавит которых ''не'' укладывается в ASCII<ref>В некоторых случаях документ (не простой текст) в Юникоде может занимать существенно меньше места, чем документ в однобайтовой кодировке. Например, если некая веб-страница содержит примерно поровну русского и греческого текста, то в однобайтовой кодировке придётся либо русские, либо греческие буквы записывать, используя возможности формата документов, в виде кодов с амперсандом, которые занимают 6—7 байт на символ (при использовании десятичных кодов), то есть в среднем на букву придётся 3,5—4 байта, в то время как UTF-8 занимает только 2 байта на греческую или русскую букву.</ref>). Файл шрифта, необходимый для отображения всех символов таблицы Юникод, занимает сравнительно много места в памяти и требует бо́льших вычислительных ресурсов, чем шрифт только одного национального языка пользователя<ref>Один из файлов шрифтов Arial Unicode имеет размер 24 мегабайта; существует Times New Roman размером 120 мегабайт, он содержит количество символов, близкое к 65536.</ref>. С увеличением мощности компьютерных систем и удешевлением памяти и дискового пространства эта проблема становится всё менее существенной; тем не менее, она остаётся актуальной для портативных устройств, например, для мобильных телефонов.
+{{See also|Unicode normalization#Normalization}}
-* Хотя поддержка Юникода реализована в наиболее распространённых операционных системах, до сих пор не всё прикладное программное обеспечение поддерживает корректную работу с ним. В частности, не всегда обрабатываются метки порядка байтов ([[Byte order mark|BOM]]) и плохо поддерживаются диакритические символы. Проблема является временной и есть следствие сравнительной новизны стандартов Юникода (в сравнении с однобайтовыми национальными кодировками).
-* Производительность всех программ обработки строк (в том числе и сортировок в БД) снижается при использовании Юникода вместо однобайтовых кодировок.
+Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. For example, ḗ (precomposed e with macron and acute above) and e&#772;&#769; (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[Macron (diacritic)|macron]] and [[acute accent]], but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Similarly, [[dot (diacritic)|underdots]], as needed in the [[romanization]] of [[Indo-Aryan languages|Indic]], will often be placed incorrectly.{{Citation needed|date=July 2011}}. Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded the problem can often be solved by using a specialist Unicode font such as [[Charis SIL]] that uses [[Graphite (SIL)|Graphite]], [[OpenType]], or [[Apple Advanced Typography|AAT]] technologies for advanced rendering features.
-Некоторые редкие системы письма всё ещё не представлены должным образом в Юникоде. Изображение «длинных» надстрочных символов, простирающихся над несколькими буквами, как, например, в [[церковнославянский язык|церковнославянском языке]], пока не реализовано.
+===Anomalies===
-== «Юникод» или «Уникод»? ==
+{{main|Unicode alias names and abbreviations}}
-«Unicode» — одновременно и имя собственное (или часть имени, например, Unicode Consortium), и имя нарицательное, происходящее из английского языка.
+The Unicode standard has imposed rules intended to guarantee stability.<ref>[https://www.unicode.org/policies/stability_policy.html Unicode stability policy]</ref> Depending on the strictness of a rule, a change can be prohibited or allowed. For example, a "name" given to a code point cannot and will not change. But a "script" property is more flexible, by Unicode's own rules. In version 2.0, Unicode changed many code point "names" from version 1. At the same moment, Unicode stated that from then on, an assigned name to a code point will never change anymore. This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling {{sc2|{{typo|BRAKCET}}}} for {{sc2|BRACKET}} in a character name). In 2006 a list of anomalies in character names was first published, and, as of April 2017, there were 94 characters with identified issues,<ref name="tn17">{{cite web |url=https://unicode.org/notes/tn27/ |title=Unicode Technical Note #27: Known Anomalies in Unicode Character Names |date=10 April 2017 |website=unicode.org}}</ref> for example:
+* {{unichar|2118|script capital p|nlink=Weierstrass p}}: This is a small letter. The capital is {{unichar|1D4AB|MATHEMATICAL SCRIPT CAPITAL P}}<ref>[https://www.unicode.org/charts/PDF/U2100.pdf Unicode chart: "actually this has the form of a lowercase calligraphic p, despite its name"]</ref>
-На первый взгляд предпочтительнее использовать написание «Уникод». В [[Русский язык|русском языке]] уже есть [[Морфема|морфемы]] «уни-» (слова с латинским элементом «uni-» традиционно переводились и писались через «уни-»: универсальный, униполярный, унификация, униформа) и «код». Напротив, торговые марки, заимствованные из [[Английский язык|английского языка]], обычно передаются посредством практической транскрипции, в которой деэтимологизированное сочетание букв «uni-» записывается в виде «юни-» («[[Юнилевер]]», «[[UNIX|Юникс]]» и т. п.), то есть точно так же, как в случае с побуквенными сокращениями, вроде [[UNICEF]] «United Nations International Children’s Emergency Fund» — [[ЮНИСЕФ]].
+* {{unichar|034F|COMBINING GRAPHEME JOINER|nlink=Combining grapheme joiner}}: Does not join graphemes.<ref name="tn17" />
+* {{unichar|A015|YI SYLLABLE WU|nlink=Yi language}}: This is not a Yi syllable, but a Yi iteration mark.
+* {{unichar|FE18|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR {{typo|BRAKCET}}}}: ''bracket'' is spelled incorrectly.<ref>[https://www.unicode.org/charts/PDF/UFE10.pdf "Misspelling of BRACKET in character name is a known defect"]</ref>
+Spelling errors are resolved by using [[Unicode alias names and abbreviations]].
-Написание «Юникод» уже твёрдо вошло в русскоязычные тексты. В [[Википедия|Википедии]] используется более распространённый вариант. В [[MS Windows]] используется вариант «Юникод».
+==See also==
-На сайте Консорциума есть специальная страница, где рассматриваются проблемы передачи слова «Unicode» в различных языках и системах письма. Для русской кириллицы указан вариант «Юникод»<ref name=autogenerated1 />.
+* [[Comparison of Unicode encodings]]
+* [[Cultural, political, and religious symbols in Unicode]]
+* [[International Components for Unicode]] (ICU), now as ICU-<abbr title="technical committee">TC</abbr> a part of Unicode
+* [[List of binary codes]]
+* [[List of Unicode characters]]
+* [[List of XML and HTML character entity references]]
+* [[Open-source Unicode typefaces]]
+* [[Standards related to Unicode]]
+* [[Unicode symbols]]
+* [[Universal Coded Character Set]]
+* [[Lotus Multi-Byte Character Set]] (LMBCS), a parallel development with similar intentions
+==Further reading==
-Формы, принятые иностранными организациями для русской передачи слова «Unicode», являются рекомендательными.
+{{refbegin}}
+* ''The Unicode Standard, Version 3.0'', The Unicode Consortium, Addison-Wesley Longman, Inc., April 2000. {{ISBN|0-201-61633-5}}
+* ''The Unicode Standard, Version 4.0'', The Unicode Consortium, Addison-Wesley Professional, 27 August 2003. {{ISBN|0-321-18578-1}}
+* ''The Unicode Standard, Version 5.0, Fifth Edition'', The [[Unicode Consortium]], Addison-Wesley Professional, 27 October 2006. {{ISBN|0-321-48091-0}}
+* Julie D. Allen. ''The Unicode Standard, Version 6.0'', The [[Unicode Consortium]], Mountain View, 2011, {{ISBN|9781936213016}}, ([https://www.unicode.org/versions/Unicode6.0.0/]).
+* ''The Complete Manual of Typography'', James Felici, Adobe Press; 1st edition, 2002. {{ISBN|0-321-12730-7}}
+* ''Unicode: A Primer'', Tony Graham, M&amp;T books, 2000. {{ISBN|0-7645-4625-2}}.
+* ''Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard'', Richard Gillam, Addison-Wesley Professional; 1st edition, 2002. {{ISBN|0-201-70052-2}}
+* ''Unicode Explained'', Jukka K. Korpela, O'Reilly; 1st edition, 2006. {{ISBN|0-596-10121-X}}
+{{refend}}
+*{{cite book |author1=Yannis Haralambous |author2=Martin Dürst |editor1-last=Haralambous |editor1-first=Yannis |title=Proceedings of Graphemics in the 21st Century, Brest 2018 |date=2019 |publisher=Fluxus Editions |location=Brest |isbn=978-2-9570549-1-6 |pages=167-183 |url=http://www.fluxus-editions.fr/gla1-hara1.php |ref=https://doi.org/10.36824/2018-graf-hara1 |chapter=Unicode from a Linguistic Point of View}}
-== См. также ==
-* [[Символы, представленные в Юникоде]]
-* [[ASCII]]
-* [[ISO 8859-1]]
-* [[UTF-8]]
-* [[UTF-16]]
-* [[UTF-32]]
-* [[Кириллица в Юникоде]]
-* [[Дроби в Юникоде]]
-* [[XeTeX]]
-* [[Свободные универсальные шрифты]]
-* [[Windows Glyph List 4]]
-* [[Широкий символ]]
-* Библиотека [[GLib]] содержит широкий набор функций для работы c символами и строками в кодировке Unicode
+==Notes==
-* [[Проект:Внесение символов алфавитов народов России в Юникод]]
+{{notelist|group=note}}
+==References==
-== Примечания ==
+{{reflist|30em}}
-{{примечания|2}}
+==External links==
-== Ссылки ==
+{{Sister project links|n=no|v=no|q=no|s=no|voy=no|m=Unicode|mw=no|species=no}}
-* [http://www.unicode.org/ Официальный сайт Консорциума Юникода]{{ref-en}}
+* {{official website|name=Official website}} {{middot}} {{official website|url=https://unicode.org/main.html|name=Official technical site}}
-* {{dmoz|Computers/Software/Globalization/Character_Encoding/Unicode/|Unicode}}{{ref-en}}
+* {{DMOZ|Computers/Software/Globalization/Character_Encoding/Unicode/}}
-* Статья «[http://www.unicode.org/standard/translations/russian.html Что такое Unicode?]»{{ref-ru}} на официальном сайте Консорциума
+* [http://www.alanwood.net/unicode/ Alan Wood's Unicode Resources]{{snd}} Contains lists of word processors with Unicode capability; fonts and characters are grouped by type; characters are presented in lists, not grids.
-* [http://www.unicode.org/versions/latest/ Последняя версия стандарта Юникод]{{ref-en}}
+* [https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeBMPFallbackFont Unicode BMP Fallback Font] Displays the Unicode value of any character in a document, including in the Private Use Area, rather than the glyph itself.
-* Последнюю версию стандарта ISO/IEC 10646 ищите в [http://standards.iso.org/ittf/PubliclyAvailableStandards/ списке доступных стандартов]{{ref-en}}. Документы, соответствующие стандарту Unicode 7.0: [http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012.zip ISO/IEC 10646] (файл ZIP){{ref-en}}, [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip Amendments 1] (файл ZIP){{ref-en}}, Amendments 2 (по состоянию 2014-08-06 ещё недоступен)
+{{Unicode navigation|state=uncollapsed}}
-* [http://unicode-table.com/ Таблица символов Юникода с названиями и описаниями]{{ref-ru}}{{ref-en}}{{ref-de}}
+{{Character encoding}}
-* [http://www.unicode.org/versions/Unicode5.0.0/appC.pdf Связь Юникода версии 5.0.0 и ISO/IEC 10646] (файл PDF){{ref-en}}
-* [http://www.cl.cam.ac.uk/~mgk25/unicode.html FAQ по UTF-8 и Unicode]{{ref-en}}
-* [[Кириллица в Юникоде]]: http://www.unicode.org/charts/PDF/U0400.pdf, http://www.unicode.org/charts/PDF/U0500.pdf, http://www.unicode.org/charts/PDF/U2DE0.pdf, http://www.unicode.org/charts/PDF/UA640.pdf{{ref-en}}{{Недоступная ссылка|date=Январь 2020 |bot=InternetArchiveBot }}
-* [http://www.i18nguy.com/surrogates.html Включение поддержки дополнительных символов Юникода в Windows]{{ref-en}}
-* [http://www.fileformat.info/info/unicode/char/search.htm Поиск по символам Юникода]{{ref-en}}
+{{Authority control}}
-{{Стандарты ISO}}
-{{Шрифтовой дизайн}}
+[[Category:Unicode| ]]
-[[Категория:Юникод| ]]
+[[Category:Character encoding]]
-[[Категория:Стандарты Интернета]]
+[[Category:Digital typography]]
-[[Категория:Стандарты ISO]]

Журнал фильтра правок

Изменения, сделанные в правке

Параметры действия

Навигация

Поиск