Read fixed width text file?

Can anyone tell me how I can make GAUSS read a fixed width text file, that looks somewhat like the example below?

I have many such files with around 50,000 lines. The first three variables are strings, and then 8 integers and finally a couple of strings again.

F773300       0100      960000                  2179                                                                                          F
F773300        3110      09130                760466            0    0    0    185085                     945551
H030000     0100      030000                10114                                                                                          F
H030000      3110      01130                    10114             0    0   0    0                                   10114           FFFF
H107120       0100      100040               20645                                                                                           F
H107120        3110      01110                   20645             0    0    0    0                                 20645           FFFF
H140000      0100      140000                 8258                                                                                           F
H140000       3110      03113                     8258             0    0    0    0                                   8258           FFFF
H160000      0100      160000                33032                                                                                          F

Best regards

 

3 Answers



1



accepted

The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).

Save the following contents to a file named 'fixedwidth.csv':

F773300     0100      960000         2179                                                                  F 
F773300     3110      09130          760466            0    0   0       185085             945551 
H030000     0100      030000         10114                                                                 F 
H030000     3110      01130          10114             0    0   0   0                      10114           FFFF 
H107120     0100      100040         20645                                                                 F 
H107120     3110      01110          20645             0    0   0   0                      20645           FFFF 
H140000     0100      140000         8258                                                                  F 
H140000     3110      03113          8258              0    0   0   0                      8258            FFFF 
H160000     0100      160000         33032                                                                 F 

and run the following example

// Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");

// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
        META_TYPE_CATEGORY|
        META_TYPE_STRING|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_CATEGORY;

// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;

// Example with the given data above
df = offset_strs_to_df(data, types, fields);

proc offset_strs_to_df(data, types, fields);
    local df;

    // When GAUSS 23.0.1 is released, this won't be necessary.
    data = asdf(data);

    // Create output symbol to write to.
    df = asdf(zeros(rows(data), rows(types)));

    // Add -1 to the end of the field offsets so the last one is considered "to end of line"
    fields = fields|-1;

    for i(1, rows(fields) - 1, 1);
        // Grab value from each row.
        df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));

        // Change empty values into missings, and convert to a numeric vector.
        if types[i] == META_TYPE_NUMBER;
            df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
        endif;

        // We could assign the column directly (as shown below), or concat, but this more performant.
        df = dftype(move(df), types[i], i);
    endfor;

    retp(df);
endp;

aptech

1,773


0



It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?

Also, do you know the offsets to the start of each column?

aptech

1,773


0



Yes, not every record has observations for all variables, so they are represented by spaces

No, the file does not have a header row

Yes, I know the offset to the start of every variable

 

Your Answer

3 Answers

1
accepted

The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).

Save the following contents to a file named 'fixedwidth.csv':

F773300     0100      960000         2179                                                                  F 
F773300     3110      09130          760466            0    0   0       185085             945551 
H030000     0100      030000         10114                                                                 F 
H030000     3110      01130          10114             0    0   0   0                      10114           FFFF 
H107120     0100      100040         20645                                                                 F 
H107120     3110      01110          20645             0    0   0   0                      20645           FFFF 
H140000     0100      140000         8258                                                                  F 
H140000     3110      03113          8258              0    0   0   0                      8258            FFFF 
H160000     0100      160000         33032                                                                 F 

and run the following example

// Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");

// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
        META_TYPE_CATEGORY|
        META_TYPE_STRING|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_CATEGORY;

// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;

// Example with the given data above
df = offset_strs_to_df(data, types, fields);

proc offset_strs_to_df(data, types, fields);
    local df;

    // When GAUSS 23.0.1 is released, this won't be necessary.
    data = asdf(data);

    // Create output symbol to write to.
    df = asdf(zeros(rows(data), rows(types)));

    // Add -1 to the end of the field offsets so the last one is considered "to end of line"
    fields = fields|-1;

    for i(1, rows(fields) - 1, 1);
        // Grab value from each row.
        df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));

        // Change empty values into missings, and convert to a numeric vector.
        if types[i] == META_TYPE_NUMBER;
            df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
        endif;

        // We could assign the column directly (as shown below), or concat, but this more performant.
        df = dftype(move(df), types[i], i);
    endfor;

    retp(df);
endp;

0

It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?

Also, do you know the offsets to the start of each column?

0

Yes, not every record has observations for all variables, so they are represented by spaces

No, the file does not have a header row

Yes, I know the offset to the start of every variable

 


You must login to post answers.

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.