Jump to content

Talk:List of Unicode characters

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Why is U+00A0 not in the control character section?

[edit]

Its function is a control character no? — Preceding unsigned comment added by 76.81.249.42 (talk) 01:52, 9 October 2019 (UTC)[reply]

U+00A0 has a general category of Zs (Separator, space), not Cc (Other, control) per UnicodeData.txt. BTW: I've removed U+0020 from the control character section's table because it too has a Unicode general category of Zs and the text before the table correctly states there are "65 characters, including DEL but not SP". DRMcCreedy (talk) 04:13, 9 October 2019 (UTC)[reply]

Octal Entity Reference Code

[edit]

Octal code is very useful & still need to be used in some programs, for example: in bash/shell programming, escape sequence, JS(javascript), perl, postscript, etc, etc. Various OS core (low-level) libraries/programs still use octal, & its especially need to be viewed for Control-Characters, Basic-Latin, etc Unicode characater ranges.
To see/obtain more octal chart/code, you may go here: https://utf8-chartable.de/unicode-utf8-table.pl?utf8=oct
More info: https://en.wikipedia.org/wiki/UTF-8#Examples ,
Wiki page on Octal needs to be updated further with a more detail on how octal numbers are actually used in different type of computer programs. Literal conversion from hex/dec to oct is not enough for all cases. But one sentence that has "\3nn", does mention the UTF-8 based octal usage, but needs elaboration. In shell terminal, 3-digits octal code can be used, for-example, we will try to show ÷ (U+00F7) and € (U+20AC) sign: this code ‟printf "Not-Bold. \303\267 . \342\202\254 (1) \xE2\x82\xAC (2) \x20AC (3) \u20AC (4) \U000020AC (5). \u \033[1mBold\033[0m.\n";
Or this code ‟echo $'Not-Bold. \303\267 . \342\202\254 (1) \xE2\x82\xAC (2) \x20AC (3) \u20AC (4) \U000020AC (5). \033[1mBold\033[0m.';
both will be displayed as: ‟No-Bold. ÷ . € (1) € (2) \x20AC (3) \u20AC (4) \U000020AC (5). Bold.” (in macOS-catalina(10.15.x) old bash v3.2.57 shell did not support (3)(4)(5) format) . € = U+20AC = Decimal code-point 8364 = Octal code-point 20254 = UTF-8-Octal \342\202\254 = UTF-8-Hex \xE2\x82\xAC.
To convert a symbol/character into octal, you may do this1:
printf 👍 | od -t o1
0000000 360 237 221 215 <-- Octal Unicode code-point 372115 (U+1F44D)
          ^  ^^  ^^  ^^.  --atErik1 (talk) 13:43, 5 September 2020 (UTC)[reply]

The mysterious # column

[edit]

Hi, most of the tables from Basic_Latin through Cyrillic have a rightmost column headed #. What is the significance? Without an explanation the naive reader is left to guess. =8~/ Thx, ... PeterEasthope (talk) 02:59, 18 November 2022 (UTC)[reply]

It's the decimal value for the hexidecimal Unicode code point. I agree it should definitely be labeled better. DRMcCreedy (talk) 03:26, 18 November 2022 (UTC)[reply]
No, it isn't. The numbers start with "001" at the space, and increment through Latin Extended-A. Then select characters in Latin Extended-B and Additional, IPA Extensions, Spacing Modifier Letters, then take up again in Greek and Coptic and Cyrillic. I have sheparded a script through the Unicode / ISO 10646 process, and I am confident I've never seen those values before. VanIsaac, GHTV contWpWS 04:47, 18 November 2022 (UTC)[reply]
Sorry, I was looking at the wrong column. My best guess is it's some enumeration of the characters in WGL-4, MES-1 and MES-2. Maybe just MES-2 since the article says MES-2 contains all the characters in WGL-4 and MES-1. The WGL-4, MES-1 and MES-2 table splits the Unicode code point up by "row" and "cells" but you can see it going from U+0020–7E, 00A0–FF, 0100-017F, 018F, 0192, 01B7, etc, which matches the # column. No idea why this as added to the List of Unicode characters article. Although the lede says "This article includes the 1062 characters in the Multilingual European Character Set 2 (MES-2) subset, and some additional related characters." DRMcCreedy (talk) 08:24, 18 November 2022 (UTC)[reply]
I noticed that the change is made by @Wbm1058:. Perhaps it would be best to ask him about the rationale behind it? Smbat.petrosyan (talk) 14:01, 11 March 2025 (UTC)[reply]
Been a long time since I spent any significant time working on this page. Note that I expanded the lead section on 15 August 2016 to explain this, and apparently since then, someone decided that this was too much information, and shortened the lead to remove my more detailed explanation. Perhaps this longer explanation can be put back. The column was just my way of counting the MES-2 characters to make sure that they were all accounted for in this list. I guess I got up to 0926 before I ran out of steam and moved on to work on other things. 0927–1062 would still be in the bottom tables which haven't been converted to lists which include a Description column yet. Note the column heading MES-2 Rationale starting at List of Unicode characters#Latin Extended-B where MES-2 starts being selective, and doesn't include everything. – wbm1058 (talk) 14:58, 11 March 2025 (UTC)[reply]
This 29 December 2022 edit was a misguided move of my text as a "self-reference in the opening to a proper hatnote." – wbm1058 (talk) 15:10, 11 March 2025 (UTC)[reply]
And then this 10 September 2023 edit removed the misguided hatnote. – wbm1058 (talk) 15:18, 11 March 2025 (UTC)[reply]

The really problem is the rejected/boxed ones.

[edit]

they are just boxes! No significance. 2804:663C:2D07:97C0:B103:6474:A7EA:4A7F (talk) 20:40, 7 April 2025 (UTC)[reply]

Many Unicode characters will no doubt show as boxes unless you have supporting fonts installed on your device. See Help:Multilingual support for more information. DRMcCreedy (talk) 00:00, 8 April 2025 (UTC)[reply]