Process Log Files

First PublishedMar 25, 2026Last UpdatedMar 27, 2026ByAtif Alam

This page walks through how to process log files with Python: read input, extract fields, aggregate, and report. The script uses the standard library only — no third-party packages and no regular expressions. You add one piece at a time and run python process_logs.py after each step so nothing piles up in your head.

Create a file process_logs.py and an app.log in the same folder (use the sample below). Python 3.9+ is enough for the typing used here. For more on looping over files, see Loops.

Scenario and Goal

You have plain-text application logs. Each line looks like: date, time, severity, then the rest of the message. You want to:

Count how many lines you see per severity (for example INFO, WARN, ERROR).
For ERROR lines, count how many fall into each hour so you can spot spikes.
Print a short summary to the terminal.

Real logs are often messy: skip lines that do not match the shape you expect instead of crashing.

Sample Log Format

Create app.log in the same folder as your script with content like this:

1
2024-03-19 10:15:01 INFO service=auth msg="Startup complete"
2
2024-03-19 10:45:22 ERROR service=auth msg="Login failed" user=guest
3
2024-03-19 10:47:01 WARN service=api msg="High latency"
4
2024-03-19 11:02:15 ERROR service=db msg="Connection timeout"
5
2024-03-19 11:15:00 INFO service=auth msg="User logged in" user=atif

Each line: YYYY-MM-DD HH:MM:SS LEVEL message... (fields separated by spaces; the message may contain spaces).

What you’ll build: A generator that reads lines without loading the whole file, a split-based parser, an analyze function that counts and buckets, and a small report printer. You’ll replace the if __name__ == "__main__": block in each step (or grow it as shown).

Step 1 — Read Lines From a File

Goal: Open app.log and stream lines one at a time. Prove it with a line count and a short preview.

pathlib.Path keeps paths readable. Iterating for line in f avoids reading the entire file into memory at once (important for large logs).

The preview truncates long lines with first[:80] — string slicing, same idea as list slices in Language basics.

1
from pathlib import Path
2

3
LOG_FILE = Path("app.log")
4

5

6
def read_lines(path: Path):
7
    """Yield stripped lines one at a time (memory-friendly for big files)."""
8
    # utf-8 is typical; errors="replace" avoids crashing on odd bytes
9
    with path.open("r", encoding="utf-8", errors="replace") as f:
10
        for line in f:
11
            yield line.strip()
12

13

14
if __name__ == "__main__":
15
    if not LOG_FILE.is_file():
16
        print(f"Missing {LOG_FILE}. Create it using the sample log above.")
17
        raise SystemExit(1)
18

19
    # One pass: count every line; keep only
20
    # the first line for preview (no list of all lines).
21
    n = 0
22
    first = None
23
    for line in read_lines(LOG_FILE):
24
        if first is None:
25
            first = line
26
        n += 1
27

28
    print("Step 1 OK:", n, "lines")
29

30
    # If the file had at least one line,
31
    # show a short preview (truncate long lines).
32
    if first:
33
        preview = first if len(first) <= 80 else first[:80] + "..."
34
        print("First line preview:", preview)

Check: Run python process_logs.py. You should see Step 1 OK: 5 lines (or your line count) and a preview of the first line.

Step 2 — Parse One Line With `split`

Goal: Turn a single log line into a small dict (timestamp, severity, message), or None if the line does not fit.

line.split(maxsplit=3) splits into at most four parts: date, time, severity, and everything else as the message (so spaces inside the message stay intact).

Add parse_line below read_lines (still above if __name__). Replace your __main__ block with the one below. Highlighted lines are what you add or change in this step (the rest matches Step 1).

1
from pathlib import Path
2
from typing import Optional
3

4
LOG_FILE = Path("app.log")
5

6

7
def read_lines(path: Path):
8
    """Yield stripped lines one at a time (memory-friendly for big files)."""
9
    with path.open("r", encoding="utf-8", errors="replace") as f:
10
        for line in f:
11
            yield line.strip()
12

13

14
def parse_line(line: str) -> Optional[dict[str, str]]:
15
    """
16
    Expect: date time LEVEL rest...
17
    maxsplit=3 → four parts; the last part is the full message tail.
18
    """
19
    if not line:
20
        return None
21
    # Split on whitespace into: date, time, severity, and the remaining message.
22
    parts = line.split(maxsplit=3)
23
    if len(parts) < 4:  # malformed / too short — skip later instead of crashing
24
        return None
25
    date_s, time_s, severity, message = parts
26
    return {
27
        "timestamp": f"{date_s} {time_s}",
28
        "severity": severity,
29
        "message": message,
30
    }
31

32

33
# One known-good line so you can test the parser without a file
34
SAMPLE_LINE = '2024-03-19 10:15:01 INFO service=auth msg="Startup complete"'
35

36

37
if __name__ == "__main__":
38
    print("Step 2 sample parse:", parse_line(SAMPLE_LINE))
39

40
    if not LOG_FILE.is_file():
41
        print(f"Missing {LOG_FILE}. Create it using the sample log above.")
42
        raise SystemExit(1)
43

44
    for line in read_lines(LOG_FILE):
45
        row = parse_line(line)
46
        print(line[:60], "->", row)
47
        break  # only show first parsed row from file

Check: Run the script. The sample should print a dict with timestamp, severity, and message. The first file line should show a similar shape.

Match your parser to a stable log layout; if the layout changes, update the parser (or move to structured logs — see below).

Step 3 — Count Severities and Bucket Errors by Hour

Goal: Walk the whole file, count each severity, and for ERROR rows count how many fall in each clock hour.

Add imports: Counter, defaultdict, and datetime. Add analyze after parse_line. Replace __main__ with the version below (you can remove SAMPLE_LINE and the single-line loop from Step 2, or leave SAMPLE_LINE for quick experiments).

Counter tallies severities. defaultdict(int) lets you do errors_by_hour[hour_bucket] += 1 without checking if the key exists. strptime / strftime turn the timestamp string into an hour bucket.

1
from collections import Counter, defaultdict
2
from datetime import datetime
3
from pathlib import Path
4
from typing import Optional
5

6
LOG_FILE = Path("app.log")
7

8

9
def read_lines(path: Path):
10
    """Yield stripped lines one at a time (memory-friendly for big files)."""
11
    with path.open("r", encoding="utf-8", errors="replace") as f:
12
        for line in f:
13
            yield line.strip()
14

15

16
def parse_line(line: str) -> Optional[dict[str, str]]:
17
    if not line:
18
        return None
19
    parts = line.split(maxsplit=3)
20
    if len(parts) < 4:
21
        return None
22
    date_s, time_s, severity, message = parts
23
    return {
24
        "timestamp": f"{date_s} {time_s}",
25
        "severity": severity,
26
        "message": message,
27
    }
28

29

30
def analyze(path: Path):
31
    severity_counts: Counter[str] = Counter()
32
    errors_by_hour: dict[str, int] = defaultdict(int)
33
    errors: list[dict] = []
34

35
    for line in read_lines(path):
36
        row = parse_line(line)
37
        if row is None:
38
            continue  # skip garbage lines
39

40
        severity_counts[row["severity"]] += 1
41

42
        if row["severity"] == "ERROR":
43
            # Parse timestamp once, then bucket to the top of the hour
44
            ts = datetime.strptime(row["timestamp"], "%Y-%m-%d %H:%M:%S")
45
            hour_bucket = ts.strftime("%Y-%m-%d %H:00")
46
            errors_by_hour[hour_bucket] += 1
47
            errors.append(row)
48

49
    return severity_counts, errors_by_hour, errors
50

51

52
if __name__ == "__main__":
53
    if not LOG_FILE.is_file():
54
        print(f"Missing {LOG_FILE}. Create it using the sample log above.")
55
        raise SystemExit(1)
56

57
    counts, by_hour, err_rows = analyze(LOG_FILE)
58
    print("Step 3 OK — severity counts:", dict(counts))
59
    print("Errors by hour:", dict(by_hour))
60
    print("Total ERROR rows stored:", len(err_rows))

Check: With the sample app.log, you should see two ERROR lines split across hours 2024-03-19 10:00 and 2024-03-19 11:00, and counts for INFO, WARN, and ERROR.

Step 4 — Print a Readable Report

Goal: Format the aggregates for humans: severity totals, a simple per-hour bar for errors, and the last few error messages.

Add print_report after analyze. Replace __main__ to call print_report instead of printing raw dicts. The # bar uses min(n, 40) so one huge count does not flood the terminal.

1
from collections import Counter, defaultdict
2
from datetime import datetime
3
from pathlib import Path
4
from typing import Optional
5

6
LOG_FILE = Path("app.log")
7

8

9
def read_lines(path: Path):
10
    """Yield stripped lines one at a time (memory-friendly for big files)."""
11
    with path.open("r", encoding="utf-8", errors="replace") as f:
12
        for line in f:
13
            yield line.strip()
14

15

16
def parse_line(line: str) -> Optional[dict[str, str]]:
17
    if not line:
18
        return None
19
    parts = line.split(maxsplit=3)
20
    if len(parts) < 4:
21
        return None
22
    date_s, time_s, severity, message = parts
23
    return {
24
        "timestamp": f"{date_s} {time_s}",
25
        "severity": severity,
26
        "message": message,
27
    }
28

29

30
def analyze(path: Path):
31
    severity_counts: Counter[str] = Counter()
32
    errors_by_hour: dict[str, int] = defaultdict(int)
33
    errors: list[dict] = []
34

35
    for line in read_lines(path):
36
        row = parse_line(line)
37
        if row is None:
38
            continue  # skip malformed lines
39

40
        severity_counts[row["severity"]] += 1
41

42
        if row["severity"] == "ERROR":
43
            ts = datetime.strptime(row["timestamp"], "%Y-%m-%d %H:%M:%S")
44
            hour_bucket = ts.strftime("%Y-%m-%d %H:00")
45
            errors_by_hour[hour_bucket] += 1
46
            errors.append(row)
47

48
    return severity_counts, errors_by_hour, errors
49

50

51
def print_report(severity_counts, errors_by_hour, errors) -> None:
52
    print("=== Counts by Severity ===")
53
    for severity, n in severity_counts.most_common():
54
        print(f"  {severity}: {n}")
55

56
    print("\n=== Errors by Hour ===")
57
    for hour in sorted(errors_by_hour):
58
        n = errors_by_hour[hour]
59
        bar = "#" * min(n, 40)  # cap width for huge counts
60
        print(f"  {hour}  {bar} ({n})")
61

62
    print("\n=== Last Few Errors ===")
63
    for row in errors[-5:]:
64
        print(f"  [{row['timestamp']}] {row['message']}")
65

66

67
if __name__ == "__main__":
68
    if not LOG_FILE.is_file():
69
        print(f"Missing {LOG_FILE}. Create it using the sample log above.")
70
        raise SystemExit(1)
71
    counts, by_hour, err_rows = analyze(LOG_FILE)
72
    print_report(counts, by_hour, err_rows)

Check: Run the script. You should see three sections: counts by severity, error bars by hour, and the last few error lines.

Other Simple Log Formats

These patterns also avoid regex. Use whichever matches how the logs are actually written.

Situation	Approach
One JSON object per line	`json.loads(line)` inside `try` / `except json.JSONDecodeError`; then read fields from the dict.
Mostly `key=value` tokens	Loop tokens with `str.partition("=")` and build a dict (strip quotes from values if needed).
Comma- or tab-separated exports	The `csv` module or a careful `split` if quoting is simple.

Some logs are free-form or inconsistent; teams often prefer structured logging (JSON) so analysis stays simple. You do not need regex for the common cases above.

Patterns Worth Keeping

Memory: Iterate over the file object instead of read() or readlines() for large inputs.
Robustness: Malformed lines → skip or log a count; do not assume every line is perfect.
Aggregation: Counter and defaultdict are compact ways to tally and bucket in the standard library.
Next steps: You can extend the same idea with Path.glob for multiple files, writing CSV or JSON output, or alerting when counts cross a threshold in a time window.