java处理含有中文的字符串.

1. 问题描述:

原始数据是以行为单位的, 每行固定长度931个字节, 汉字占2个字节, 按照字典描述,共有96个字典,只有第32个字典为中文地址, 所以需要单独处理. 由于项目设计保密,故删除敏感数据. 供实验的数据是测试数据.

在处理过程中,按照规定的字典长度截取字符串的时候,发现处理到汉字的时候出错. 那就需要单独处理汉字. 比较麻烦. 所以写了如下简便方法, 如有更好的解决方案,还请多多交流.

2. 解决方案:

源码:

package com.dk.rf;

import java.io.*;
import java.util.ArrayList;
import java.util.List;

/**
 * Created by zzy on 17/1/9.
 */
public class ReadFile {
    public static void main(String[] args) {
     String path = "/Users/zzy/Downloads/QQdownload/test-readhanzi.txt";
        readFileByLines(path);

    }

    /**
     * 以行为单位读取文件，常用于读面向行的格式化文件
     */
    public static void readFileByLines(String fileName) {
        File file = new File(fileName);
        BufferedReader reader = null;
        try {
            System.out.println("以行为单位读取文件内容，一次读一整行：");
//            reader = new BufferedReader(new FileReader(file));
            reader = new BufferedReader(new InputStreamReader(new FileInputStream(file),"GBK"));
            String tempString = null;
            int line = 1;
            // 一次读入一行，直到读入null为文件结束

            while ((tempString = reader.readLine()) != null) {

                handleLines(tempString);
                char [] chars;
                chars = tempString.toCharArray();

                line++;
                if (line > 100){
                    break ;
                }
            }
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (reader != null) {
                try {
                    reader.close();
                } catch (IOException e1) {
                }
            }
        }
    }


    /**
     * 处理一行
     * @param line
     */
    public static void handleLines(String line){
//        System.out.println(line.length());
        // 每一行数据分为96个字段

        List strList = new ArrayList();
        int start = 0;
        int end = 0;
        int [] ss = {42,42,42,8,3,1,1,1,1,1,
                    6,10,11,11,11,11,11,21,21,21,
                    4,6,12,4,6,4,3,2,12,6,
                    8,15,40,3,4,6,10,1,1,5,
                    2,2,2,2,4,4,11,11,12,12,
                    12,12,3,3,8,1,8,8,8,8,
                    8,8,8,8,8,8,8,1,16,8,
                    8,8,8,8,8,32,2,1,2,14,
                    4,3,9,12,3,1,8,1,12,15,
                    21,1,2,1,1,97
                    };

        for (int i = 0; i < ss.length; i++ ){
            if (i == 32){ // 单独处理地址
                char[] cc = line.toCharArray();
                int ss_32=0 ;//
                int ff = 0;
                System.out.println("-------"+start);
                for (int j = start; j < start+ss[i]; j++) {
                    ss_32++;
                    ff ++;
                    if (!isLetter(cc[j])){
                        // 如果是汉字
                        ss_32++;
                    }
                    if (ss_32 == 40){
                        ss[i] = ff;

                        break;
                    }
                }
            }

            end = start + ss[i];
            if(start>=line.length())
                return;

            String temp = line.substring(start, end);
            start = end;
            strList.add(temp);
            System.out.println("ss["+ i+ "]"+ss[i]+"temp="+temp);
            // TO ,设计业务,需要继续,春节后交接

        }


    }

    /**
     * 判断一个字符是Ascill字符还是其它字符（如汉，日，韩文字符）
     *
     * @param c
     * @return
     */
    public static boolean isLetter(char c) {
        int k = 0x80;
        return (c / k) == 0 ? true : false;
    }


}

3. 相关文件:

test-readhanzi.txt 下载链接