UTF-8 から UCS-4 を取り出す〜エンコーディングから文字コードへ

UTF-8 から UCS-4 を取り出す処理が必要になりました。ということで、書いてみました。下記。

const std::uint32_t toUcs4(const char* iString, const std::size_t iLength, const char** oEnd)
{
 assert(iString);
 assert(oEnd);
 
 const char* aCur = iString;
 const char* aEnd = aCur + ((iLength > 4) ? 4 : iLength);
 
 if (aCur == aEnd) {
  goto onError_;
 }
 
 std::uint32_t aUcs4 = '\0';
 
 const std::uint8_t a1stByte = *aCur++;
 auto aBits = a1stByte;
 auto aMask = 0x7fu, aShift = 0x0u;
 if ((aBits & 0x80) == 0x80) {
  aBits <<= 1;
  aMask >>= 1;
  if ((aBits & 0x80) != 0x80) {
   goto onError_;
  }
  do {
   if (aCur == aEnd) {
    goto onError_;
   }
   const std::uint8_t aByte = *aCur++;
   if ((aByte & 0xC0) != 0x80) {
    goto onError_;
   }
   
   aUcs4 <<= 6;
   aUcs4 |= aByte & ~0xC0;
   
   aBits <<= 1;
   aMask >>= 1;
   aShift += 6;
  } while ((aBits & 0x80) == 0x80);
 }
 aUcs4 |= (a1stByte & aMask) << aShift;
 
 *oEnd = aCur;
 return aUcs4;
 
onError_:
 *oEnd = iString;
 return '\0';
}

最新の仕様では UTF-8 における 1 文字あたりのマルチバイトの上限は 4 バイトなんですね。昔、仕事で触れたときにはまだそのあたりの制限がなくて、5-6 バイトあたりまで考慮してコードを書いていたように記憶してます。時代は移り行きますね。

Chiharu の日記

絵描き C/C++ プログラマーの日記です。

UTF-8 から UCS-4 を取り出す〜エンコーディングから文字コードへ