Commit graph

4 commits

Author SHA1 Message Date
Fernando Barbosa
01699eaaab fix: k8s agent fails to tail logs starving the cpu
Proposal to fix https://github.com/woodpecker-ci/woodpecker/issues/2253

We have observed several possibly-related issues on a Kubernetes
backend:

1. Agents behave erractly when dealing with certain log payloads. A common
   observation here is that steps that produce a large volume of logs will cause
   some steps to be stuck "pending" forever.

2. Agents use way more CPU than should be expected, we often see 200-300
   millicores of CPU per Workflow per agent (as reported on #2253).

3. We commonly see Agents displaying thousands of error lines about
   parsing logs, often with very close timestamps, which may explain issues 1
   and 2 (as reported on #2253).

```
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
```

4. We've also observed that agents will sometimes drop out of the worker queue,
also as reported on #2253.

Seeing as the logs point to `client_grpc.go:335`, this pull request
fixes the issue by:

1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC
calls that fail with codes. Internal will not be retried. There's not an
agreement on what GRPC codes should be retried but Internal does not seem to be
a common one to retry -- if ever.

2. Add a timeout of 30 seconds to any retries. Currently, the exponential
retries have a maximum timeout of _15 minutes_. I assume this might be
required by some other functions so Agents resume their operation in
case the webserver restarts. Still this is likely the cause behind the
large cpu increase as agents can be stuck trying thousands of requests for
a large windown of time. The previous change alone should be enough to
solve this issue but I think this might be a good idea to prevent
similar problems from arising in the future.
2024-04-08 16:32:29 -03:00
runephilosof-karnovgroup
adb2c82790
Update go module path for major version 2 (#2905)
https://go.dev/doc/modules/release-workflow#breaking

Fixes https://github.com/woodpecker-ci/woodpecker/issues/2913 fixes
#2654
```
runephilosof@fedora:~/code/platform-woodpecker/woodpecker-repo-configurator (master)$ go get go.woodpecker-ci.org/woodpecker@v2.0.0
go: go.woodpecker-ci.org/woodpecker@v2.0.0: invalid version: module contains a go.mod file, so module path must match major version ("go.woodpecker-ci.org/woodpecker/v2")
```

---------

Co-authored-by: qwerty287 <80460567+qwerty287@users.noreply.github.com>
2023-12-08 08:15:08 +01:00
6543
5a7b689e30
Switch to go vanity urls (#2706)
Co-authored-by: Anbraten <anton@ju60.de>
2023-11-07 08:04:33 +01:00
Anbraten
556607b525
Rework log streaming and related functions (#1802)
closes #1801
closes #1815 
closes #1144
closes  #983
closes  #557
closes #1827
regression of #1791

# TODO
- [x] adjust log model
- [x] add migration for logs
- [x] send log line via grpc using step-id
- [x] save log-line to db
- [x] stream log-lines to UI
- [x] use less structs for log-data
- [x] make web UI work
  - [x] display logs loaded from db
  - [x] display streaming logs
- [ ] ~~make migration work~~ -> dedicated pull (#1828)

# TESTED
- [x] new logs are stored in database
- [x] log retrieval via cli (of new logs) works
- [x] log streaming works (tested via curl & webui)
- [x] log retrieval via web (of new logs) works

---------

Co-authored-by: 6543 <6543@obermui.de>
2023-06-06 09:52:08 +02:00