move "extra" to src repository

This commit is contained in:
NunoSempere 2023-09-15 10:43:04 +03:00
parent b987143eb6
commit d36eb1e505
11 changed files with 289 additions and 28 deletions

View File

@ -1,47 +1,71 @@
# ww: count words in 50 lines of C
# wc: count words in <50 lines of C
## Desiderata
- Simplicity: Just count words, as delimited by: spaces, tabs, newlines.
- No flags.
- Simplicity: Just count words as delimited by spaces, tabs, newlines.
- Allow: reading files, piping to the utility, and reading from stdin.
- Separate utilities for counting different things, like lines and characters.
- Avoid off-by-one errors.
- Allow piping, as well as reading files.
- Small.
- Linux only.
- Small.
## Comparison with wc.
## Comparison with other historical versions wc.
The GNU utils version ([github](https://github.com/coreutils/coreutils/tree/master/src/wc.c), [savannah](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/wc.c;hb=HEAD)) is a bit over 1K lines of C. It does many things and checks many possible failure modes. I think it detects whether it should be reading from stdin using some very wrapped fstat.
The version in wc/wc.c in this repository sits at 43 lines. It decides to read from stdin if the number of arguments fed to it is otherwise zero, and uses the linux `read` function to read character by character. It doesn't have flags, instead, there are further utilities in the extra/ folder for counting characters and lines, sitting at 33 and 35 lines of code, respectively. This version also has little error checking.
The busybox version ([git.busybox.net](https://git.busybox.net/busybox/tree/coreutils/wc.c)) of wc is much shorter, at 257 lines, while striving to be [POSIX-compliant](https://pubs.opengroup.org/onlinepubs/9699919799/), meaning it has flags.
[Here](https://github.com/dspinellis/unix-history-repo/blob/Research-V7-Snapshot-Development/usr/src/cmd/wc.c) is a version of wc from UNIX V7, at 86 lines. It allows for counting characters, words and lines. I couldn't find a version in UNIX V6, so I'm guessing this is one of the earliest versions of this program (?). It decides to read from stdin if the number of arguments fed to it is zero, and reads character by character using the standard C `getc` function.
The plan9port version of wc ([github](https://github.com/9fans/plan9port/blob/master/src/cmd/wc.c)) implements some sort of table method, in 352 lines. So does the [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version, which is worse documented, but shorter.
The busybox version ([git.busybox.net](https://git.busybox.net/busybox/tree/coreutils/wc.c)) of wc sits at 257 lines (162 with comments stripped), while striving to be [POSIX-compliant](https://pubs.opengroup.org/onlinepubs/9699919799/), meaning it has a fair number of flags and a bit of complexity. It reads character by character by using the standard `getc` function, and decides to read from stdin or not using its own `fopen_or_warn_stdin` function. It uses two GOTOs to get around, and has some incomplete Unicode support.
[Here](https://github.com/dspinellis/unix-history-repo/blob/Research-V7-Snapshot-Development/usr/src/cmd/wc.c) is a version of wc from UNIX V7, at 86 lines, and allowing for both word and line counts. I couldn't find a version in UNIX V6. Of all the versions, I think I understand this one best.
The [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version implements some sort of table method in 331 lines. It uses plan9 rather than Unix libraries and methods, and seems to read from stdin if the number of args is 0.
The plan9port version of wc ([github](https://github.com/9fans/plan9port/blob/master/src/cmd/wc.c)) also implements some sort of table method, in 352 lines. It reads from stdin if the number of args is 0, and uses the Linux `read` function to read character by character.
The GNU utils version ([github](https://github.com/coreutils/coreutils/tree/master/src/wc.c), [savannah](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/wc.c;hb=HEAD)) is a bit over 1K lines of C. It does many things and checks many possible failure modes. I think it detects whether it should be reading from stdin using some very wrapped fstat, and it reads character by character using its own custom function.
So this utility started out reasonably small, then started getting more and more complex. [The POSIX committee](https://pubs.opengroup.org/onlinepubs/9699919799/) ended up codifying that implementation, and now we are stuck with it because even implementations like busybox which strive to be quite small try to keep to POSIX.
## Usage examples
```
echo "En un lugar de la Mancha" | ww
cat README.md | ww
ww README.md
ww # write something, then exit with Ctrl+D
```
## Relationship with cat-v
Does one really need to spend 1k lines of C code to count characters, words and lines? There are many versions of this rant one could give, but the best and probably best known is [this one](to do: locate) by cat-v, named for the explosion of options.
[ add sad busybox comment on its cat implementation ]
## Steps:
- [x] Look into how C utilities both read from stdin and from files.
- [x] Program first version of the utility
- [x] Compare with other implementations, see how they do it, after I've read my own version
- [x] Compare with other implementations, see how they do it, after I've created my own version
- [x] Compare with gnu utils.
- [x] Compare with musl/busybox implementations,
- ~~Maybe make some pull requests, if I'm doing something better? => doesn't seem like it~~
- [ ] Install to ww, but check that ww is empty (installing to wc2 or smth would mean that you don't save that many keypresses vs wc -w)
- ~~[ ] Could use zig? => Not for now~~
- [ ] Look specifically at how other versions do stuff.
- [ ] Distinguish between reading from stdin and reading from a file
- [x] Compare with busybox implementation
- Compare with other projects: <https://github.com/leecannon/zig-coreutils>, <https://github.com/keiranrowan/tiny-core/tree/master>.
- [x] Install to ww, but check that ww is empty (installing to wc2 or smth would mean that you don't save that many keypresses vs wc -w)
- [x] Look specifically at how other versions do stuff.
- [x] Distinguish between reading from stdin and reading from a file
- If it doesn't have arguments, read from stdin.
- [ ] Open files, read characters.
- [ ] Write version that counts lines
- [x] Open files, read characters.
- [x] Write version that counts lines (lc)
- [ ] Take into account what happens if file doesn't end in newline.
- [ ] Count EOF as word & line separator
- [ ] Document it
- [ ] Document reading from user-inputed stdin (end with Ctrl+D)
- [ ] Write man files?
- [ ] Write a version for other coreutils? <https://git.busybox.net/busybox/tree/coreutils/>? Would be a really nice project.
- Simple utils.
- zig?
- https://github.com/leecannon/zig-coreutils
- https://github.com/keiranrowan/tiny-core/tree/master
- [x] Add lc
- Take into account what happens if file doesn't end in newline.
- [ ] add chc (cc is "c compiler")
- [ ] Possible follow-up: Write simple versions for other coreutils. <https://git.busybox.net/busybox/tree/coreutils/>? Would be a really nice project.
- Get it working on a DuskOS/CollapseOS machine? Or, find a minimalistic kernel that could use them?
- [ ] add chc, or charcounter (cc is "c compiler")
- [ ] Pitch to lwn.net?
Discarded:
- ~~[ ] Could use zig? => Not for now~~
- ~~Maybe make some pull requests, if I'm doing something better? => doesn't seem like it~~
- ~~[ ] Write man files?~~

View File

@ -0,0 +1,162 @@
#include "libbb.h"
#include "unicode.h"
#if !ENABLE_LOCALE_SUPPORT
# undef isprint
# undef isspace
# define isprint(c) ((unsigned)((c) - 0x20) <= (0x7e - 0x20))
# define isspace(c) ((c) == ' ')
#endif
#if ENABLE_FEATURE_WC_LARGE
# define COUNT_T unsigned long long
# define COUNT_FMT "llu"
#else
# define COUNT_T unsigned
# define COUNT_FMT "u"
#endif
enum {
WC_LINES = 0, /* -l */
WC_WORDS = 1, /* -w */
WC_UNICHARS = 2, /* -m */
WC_BYTES = 3, /* -c */
WC_LENGTH = 4, /* -L */
NUM_WCS = 5,
};
int wc_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE;
int wc_main(int argc UNUSED_PARAM, char **argv)
{
const char *arg;
const char *start_fmt = " %9"COUNT_FMT + 1;
const char *fname_fmt = " %s\n";
COUNT_T *pcounts;
COUNT_T counts[NUM_WCS];
COUNT_T totals[NUM_WCS];
int num_files;
smallint status = EXIT_SUCCESS;
unsigned print_type;
init_unicode();
print_type = getopt32(argv, "lwmcL");
if (print_type == 0) {
print_type = (1 << WC_LINES) | (1 << WC_WORDS) | (1 << WC_BYTES);
}
argv += optind;
if (!argv[0]) {
*--argv = (char *) bb_msg_standard_input;
fname_fmt = "\n";
}
if (!argv[1]) { /* zero or one filename? */
if (!((print_type-1) & print_type)) /* exactly one option? */
start_fmt = "%"COUNT_FMT;
}
memset(totals, 0, sizeof(totals));
pcounts = counts;
num_files = 0;
while ((arg = *argv++) != NULL) {
FILE *fp;
const char *s;
unsigned u;
unsigned linepos;
smallint in_word;
++num_files;
fp = fopen_or_warn_stdin(arg);
if (!fp) {
status = EXIT_FAILURE;
continue;
}
memset(counts, 0, sizeof(counts));
linepos = 0;
in_word = 0;
while (1) {
int c;
c = getc(fp);
if (c == EOF) {
if (ferror(fp)) {
bb_simple_perror_msg(arg);
status = EXIT_FAILURE;
}
goto DO_EOF; /* Treat an EOF as '\r'. */
}
++counts[WC_BYTES];
if (unicode_status != UNICODE_ON /* every byte is a new char */
|| (c & 0xc0) != 0x80 /* it isn't a 2nd+ byte of a Unicode char */
) {
++counts[WC_UNICHARS];
}
if (isprint_asciionly(c)) { /* FIXME: not unicode-aware */
++linepos;
if (!isspace(c)) {
in_word = 1;
continue;
}
} else if ((unsigned)(c - 9) <= 4) {
if (c == '\t') {
linepos = (linepos | 7) + 1;
} else { /* '\n', '\r', '\f', or '\v' */
DO_EOF:
if (linepos > counts[WC_LENGTH]) {
counts[WC_LENGTH] = linepos;
}
if (c == '\n') {
++counts[WC_LINES];
}
if (c != '\v') {
linepos = 0;
}
}
} else {
continue;
}
counts[WC_WORDS] += in_word;
in_word = 0;
if (c == EOF) {
break;
}
}
fclose_if_not_stdin(fp);
if (totals[WC_LENGTH] < counts[WC_LENGTH]) {
totals[WC_LENGTH] = counts[WC_LENGTH];
}
totals[WC_LENGTH] -= counts[WC_LENGTH];
OUTPUT:
s = start_fmt;
u = 0;
do {
if (print_type & (1 << u)) {
printf(s, pcounts[u]);
s = " %9"COUNT_FMT; /* Ok... restore the leading space. */
}
totals[u] += pcounts[u];
} while (++u < NUM_WCS);
printf(fname_fmt, arg);
}
if (num_files > 1) {
num_files = 0; /* Make sure we don't get here again. */
arg = "total";
pcounts = totals;
--argv;
goto OUTPUT;
}
fflush_stdout_and_exit(status);
}

BIN
src/extra/chc/chc Executable file

Binary file not shown.

33
src/extra/chc/chc.c Normal file
View File

@ -0,0 +1,33 @@
#include <stdio.h>
#include <unistd.h>
int chc(FILE* fp)
{
char c[1];
int num_chars = 0;
int fn = fileno(fp);
while (read(fn, c, sizeof(c)) > 0) {
num_chars ++;
}
printf("%i\n", num_chars);
return 0;
}
int main(int argc, char** argv)
{
if (argc == 1) {
return chc(stdin);
} else if (argc > 1) {
FILE* fp = fopen(argv[1], "r");
if (!fp) {
perror("Could not open file");
return 1;
}
return chc(fp) && fclose(fp);
} else {
printf("Usage: chc file.txt\n");
printf(" or: cat file.txt | chc\n");
printf(" or: chc # read from user-inputted stdin\n");
}
return 0;
}

40
src/extra/chc/makefile Normal file
View File

@ -0,0 +1,40 @@
# Interface:
# make
# make build
# make format
# Compiler
CC=gcc
# CC=tcc # <= faster compilation
# Main file
SRC=chc.c
OUT=chc
## Flags
DEBUG= #'-g'
STANDARD=-std=c99
WARNINGS=-Wall
OPTIMIZED=-O3
# OPTIMIZED=-O3 #-Ofast
## Formatter
STYLE_BLUEPRINT=webkit
FORMATTER=clang-format -i -style=$(STYLE_BLUEPRINT)
## make build
build: $(SRC)
$(CC) $(OPTIMIZED) $(DEBUG) $(SRC) -o $(OUT)
format: $(SRC)
$(FORMATTER) $(SRC)
install:
cp -n $(OUT) /bin/$(OUT)
test: $(OUT)
/bin/echo -e "123\n45 67" | ./$(OUT)
/bin/echo -n "" | ./chc
/bin/echo " xx x" | ./$(OUT)
./$(OUT) $(SRC)
./$(OUT) nonexistent_file || true

View File

View File

@ -7,6 +7,8 @@ int wc(FILE* fp)
int word = 0, num_words = 0;
int fn = fileno(fp);
while (read(fn, c, sizeof(c)) > 0) {
// could add error checking to that read call
// could also use getc or fgetc instead of read.
if (*c != ' ' && *c != '\n' && *c != '\t') {
word = 1;
} else {