Who didn't ever run into the problem when transfering a file using FTP from for instance Windows to an Unix system and found themselves with those strange ^M characters at the end of each line? Or the opposite; transfering a file from Unix to Windows and all at sudden all text appears on one, single line? Or again transfering a file which all at sudden seems corrupt while the transfer went fine?
This might of course not be what you wanted, but yet may be perfectly normal depending on the file type and the transfer mode.
Why would someone see ^M characters?
In short, it means that a DOS (Windows) formatted text file got transferred in binary mode to a Unix system and when a file gets transferred between two natively different systems rather than homogeneous systems, they may use different codes.
To understand what this actually means, it should be noted that the "carriage return/linefeed" characters never get transferred.
On Windows systems, the "carriage return/linefeed" are two hexadecimal codes : OD 0A.
One does not see them as such, as they are part of the control characters and thus non-printable, but one sees them as "hey, the text continues on the next line".
An example :
A file called "test.txt" on Windows is going to be transferred to Unix and contains the following :
Or at least, that is what one sees, the file actually contains, in hexadecimal :
(this could be made visible with a hex editor, I will not go as far as breaking it down in bits)
The definition of a Variable, Ascii Text file is that it contains only standard 7-bit ascii codes, each line can have a different length, all lines, including the last line should be terminated with OD OA (DOS). Of course, some will say "well yes, but I have a program that will make the file correctly visible anyway, even if the last line doesn't have 0D 0A, or even if it is Unix formatted", but even when an editor can visualise it correctly, by cheating or interpreting/assuming, doesn't change or modify the actual structure of a file which it natively really has ;)
What happens if the above file is transferred in ascii mode to an Unix system
It is a misunderstanding that FTP would "modify" or "translate" the EOL (ODOA) character. Really, it doesn't. Even better, the ODOA will never get transferred!
When transferring in ascii mode, the FTP client will locally read the Variable, Ascii textfile up to each ODOA and transforms it in Variable :
The records are read and concatenated, having a header containing the length of the record that follows (2 bytes), fills the data buffer, but note that there are no ODOA characters in it.
That also means that "a record" can be maximum FFFF, thus 65535 characters long.
A receiving Unix FTP server in this example receives the data buffer, reads the leading headers, knows the length of a record that follows and inserts the EOL character natively used (OA), which will give :
6565656565OA 666666OA 6767OA00
A handy Unix command that can be used is od -x or eventually od -x | grep OA | wc -l
It should also be noted that for this example the filesize will be 3 bytes less on Unix (because 3x OD character = 3 bytes) while it still contains the same original data.
If one does see ^M characters, it simply means that the file was transferred in binary (streaming) mode, which doesn't "strip" the EOL characters and both ODOA have been transferred as being part of the data. As Unix doesn't use the OD character, it will be printed as part of the text on the screen as ^M.
Yet, I've heared people saying for instance : I have an .XML file, which is 10Mb, and all data on one, single line.
And of course, transferring in ascii mode will fail, as a recordlength can not exceed FFFF. Said otherwise, then it is not a Variable, Ascii textfile. Sometimes it back-fires at me by people asking : "so, then it is Variable ascii?" (without the "text")
Nice tried, but NO. Windows doesn't know "variable", as, which can be seen above, "Variable" would have a 4 byte header to each record. Some systems do know "Variable" though; for instance MVS Mainframes can have datasets in which variable or fixed is used.
You can then wait for their next question : "yes, well, but the file only contains 7-bit standard ascii codes, no binary codes, then how to transfer it with FTP?"
Simple, in binary mode, as the file doesn't have all the critera to qualify itself as a Variable, Ascii textfile ;)
Then when is a file a Fixed ascii textfile?
On Windows nor on Unix, it actually exists; It would mean that all records in the file have the exact same length. Neither for Windows nor Unix does it have any special meaning; it is the same as a Variable, ascii textfile it is just that, co´ncidental, all records happen to have the same length. Between applications, this is often obtained by applying padding or truncation. A quick check to see if a textfile could potentially be qualified as "fixed text" is to devide the filesize by the record length-1 (OA for Unix) or -2 (ODOA for Windows), which should result in a whole number of records. This is only a "quick" check, it should be noted that it is not a garantee. The only way to be sure would be to check the length of each record in a file.
If you still want to know what exact file you are dealing with, you can also use the File Analyser