Reading wide characters with wifstream

Should we want to use wifstream to read unicode characters directly from a file, we will encounter an implementation quirk with newline detection in the standard library. Carriage returns and line feed (CR/LF) characters contained within certain characters as a part of a character’s unicode encoding can be mistaken by std::getline for the newline character. This mix-up then causes std::getline to split lines of text in half.

wifstream provides a unicode wide character interface to the std library’s file operation functions, but the actual file operations are performed on standard, narrow char, strings before being automatically converted to unicode.

This problem only occurs if the data in the file has been written directly in a unicode encoded format. If the data was converted to use a mulitbyte, narrow char, character encoding before being written to disk, this problem should not exist.

To solve this problem, we can modify the locale used by the standard library to cause wifstream to operate directly with unicode characters:

typedef std::codecvt<wchar_t , char , mbstate_t> null_wcodecvt_base;
/** \brief Converter to handle wide chars in formatted string output
\note http://forums.codeguru.com/showthread.php?457106-Unicode-text-file&s=3901c97bd2e7d4a77943430d1ff75411&p=1741409#post1741409
*/
class null_wcodecvt : public null_wcodecvt_base {
public:
    explicit null_wcodecvt(size_t refs=0) : null_wcodecvt_base(refs) {}
protected:
    virtual result do_out(mbstate_t&, const wchar_t* from, const wchar_t* from_end, const wchar_t*& from_next, char* to, char* to_end, char*& to_next) const {
        size_t len = (from_end - from) * sizeof(wchar_t);
        memcpy(to, from, len);
        from_next = from_end;
        to_next = to + len;
        return ok;
    }
    virtual result do_in(mbstate_t&, const char* from, const char* from_end, const char*& from_next, wchar_t* to, wchar_t* to_end, wchar_t*& to_next) const {
        size_t len = (from_end - from);
        memcpy(to, from, len);
        from_next = from_end;
        to_next = to + (len/sizeof(wchar_t));
        return ok;
    }
    virtual result do_unshift(mbstate_t&, char* to, char*, char*& to_next) const {
        to_next = to;
        return noconv;
    }
    virtual int do_length(mbstate_t&, const char* from, const char* end, size_t max) const {
        return (int)((max<(size_t)(end-from)) ? max : (end-from));
    }
    virtual bool do_always_noconv() const throw() {
        return true;
    }
    virtual int do_encoding() const throw() {
        return sizeof(wchar_t);
    }
    virtual int do_max_length() const throw() {
        return sizeof(wchar_t);
    }
};

We can then call the above locale:

std::wifstream file;
null_wcodecvt wcodec(1);
std::locale wloc(std::locale::classic(), &wcodec);
file.imbue(wloc);
file.open(filename.c_str(), std::ios::binary);

if(file.is_open()) {
    file >> std::noskipws;
    while(file.good()) {
        std::wstring line;
        std::getline(file, line);
        //do something with line
    }
}
else {
    //throw exception or other error condition
}
file.close();

Leave a Reply

Your email address will not be published. Required fields are marked *

*