Compare commits

...

11 commits

Author SHA1 Message Date
Fernando Barbosa e52355972b
Merge 12b3f9745f into c6b2cd8a48 2024-04-28 11:26:34 +02:00
renovate[bot] c6b2cd8a48
chore(deps): update node.js to v22 (#3659) 2024-04-28 11:14:03 +02:00
renovate[bot] 325b1b5e57
chore(deps): update dependency trim to v1 (#3658) 2024-04-28 10:50:39 +02:00
Robert Kaussow 4b1ff6d1a7
Compare to pipeline created timestamp while using before/after filter (#3654) 2024-04-28 10:32:31 +02:00
renovate[bot] 2c3cd83402
chore(deps): update dependency got to v14 (#3657) 2024-04-28 10:16:25 +02:00
renovate[bot] a230e88c3a
chore(deps): lock file maintenance (#3656)
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Update | Change |
|---|---|
| lockFileMaintenance | All locks refreshed |

🔧 This Pull Request updates lock files to use the latest dependency
versions.

---

### Configuration

📅 **Schedule**: Branch creation - "before 4am on Monday" (UTC),
Automerge - "before 4am" (UTC).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get
[config help](https://togithub.com/renovatebot/renovate/discussions) if
that's undesired.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/woodpecker-ci/woodpecker).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zMjEuMiIsInVwZGF0ZWRJblZlciI6IjM3LjMyMS4yIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiLCJkb2N1bWVudGF0aW9uIiwidWkiXX0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2024-04-28 08:18:02 +02:00
Fernando Barbosa 12b3f9745f
Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock 2024-04-15 10:16:11 -03:00
Fernando Barbosa 6a063b6e7b Add //nolint: gomnd to peer.Log context 2024-04-12 11:32:45 -03:00
Fernando Barbosa 8fd8b8f748
Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock 2024-04-10 14:45:46 -03:00
Fernando Barbosa 85d03a63b0
Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock 2024-04-09 19:14:21 -03:00
Fernando Barbosa 01699eaaab fix: k8s agent fails to tail logs starving the cpu
Proposal to fix https://github.com/woodpecker-ci/woodpecker/issues/2253

We have observed several possibly-related issues on a Kubernetes
backend:

1. Agents behave erractly when dealing with certain log payloads. A common
   observation here is that steps that produce a large volume of logs will cause
   some steps to be stuck "pending" forever.

2. Agents use way more CPU than should be expected, we often see 200-300
   millicores of CPU per Workflow per agent (as reported on #2253).

3. We commonly see Agents displaying thousands of error lines about
   parsing logs, often with very close timestamps, which may explain issues 1
   and 2 (as reported on #2253).

```
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
```

4. We've also observed that agents will sometimes drop out of the worker queue,
also as reported on #2253.

Seeing as the logs point to `client_grpc.go:335`, this pull request
fixes the issue by:

1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC
calls that fail with codes. Internal will not be retried. There's not an
agreement on what GRPC codes should be retried but Internal does not seem to be
a common one to retry -- if ever.

2. Add a timeout of 30 seconds to any retries. Currently, the exponential
retries have a maximum timeout of _15 minutes_. I assume this might be
required by some other functions so Agents resume their operation in
case the webserver restarts. Still this is likely the cause behind the
large cpu increase as agents can be stuck trying thousands of requests for
a large windown of time. The previous change alone should be enough to
solve this issue but I think this might be a good idea to prevent
similar problems from arising in the future.
2024-04-08 16:32:29 -03:00
12 changed files with 815 additions and 925 deletions

View file

@ -3,7 +3,7 @@ when:
variables:
- &golang_image 'docker.io/golang:1.22.2'
- &node_image 'docker.io/node:21-alpine'
- &node_image 'docker.io/node:22-alpine'
- &xgo_image 'docker.io/techknowlogick/xgo:go-1.22.1'
- &xgo_version 'go-1.21.2'

View file

@ -1,6 +1,6 @@
variables:
- &golang_image 'docker.io/golang:1.22.2'
- &node_image 'docker.io/node:21-alpine'
- &node_image 'docker.io/node:22-alpine'
- &xgo_image 'docker.io/techknowlogick/xgo:go-1.22.1'
- &xgo_version 'go-1.21.2'
- &buildx_plugin 'docker.io/woodpeckerci/plugin-docker-buildx:3.2.1'

View file

@ -13,7 +13,7 @@ steps:
branch: renovate/*
- name: spellcheck
image: docker.io/node:21-alpine
image: docker.io/node:22-alpine
depends_on: []
commands:
- corepack enable

View file

@ -6,7 +6,7 @@ when:
- renovate/*
variables:
- &node_image 'docker.io/node:21-alpine'
- &node_image 'docker.io/node:22-alpine'
- &when
path:
# related config files

View file

@ -98,7 +98,6 @@ func (c *client) Next(ctx context.Context, f rpc.Filter) (*rpc.Workflow, error)
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -144,7 +143,6 @@ func (c *client) Wait(ctx context.Context, id string) (err error) {
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -185,7 +183,6 @@ func (c *client) Init(ctx context.Context, id string, state rpc.State) (err erro
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -226,7 +223,6 @@ func (c *client) Done(ctx context.Context, id string, state rpc.State) (err erro
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -260,7 +256,6 @@ func (c *client) Extend(ctx context.Context, id string) (err error) {
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -301,7 +296,6 @@ func (c *client) Update(ctx context.Context, id string, state rpc.State) (err er
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -340,7 +334,6 @@ func (c *client) Log(ctx context.Context, logEntry *rpc.LogEntry) (err error) {
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:
@ -387,7 +380,6 @@ func (c *client) ReportHealth(ctx context.Context) (err error) {
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
codes.Unavailable:
// non-fatal errors
default:

View file

@ -1,6 +1,6 @@
# docker build --rm -f docker/Dockerfile.make -t woodpecker/make:local .
FROM docker.io/golang:1.22-alpine3.19 as golang_image
FROM docker.io/node:21-alpine3.19
FROM docker.io/node:22-alpine3.19
# renovate: datasource=repology depName=alpine_3_19/make versioning=loose
ENV MAKE_VERSION="4.4.1-r2"

View file

@ -53,8 +53,8 @@
},
"pnpm": {
"overrides": {
"trim": "^0.0.3",
"got": "^11.8.5"
"trim": "^1.0.0",
"got": "^14.0.0"
}
}
}

File diff suppressed because it is too large Load diff

View file

@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) {
Type: LogEntryStdout,
Line: w.num,
}
if err := w.peer.Log(context.Background(), line); err != nil {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd
defer cancel()
if err := w.peer.Log(ctx, line); err != nil {
log.Error().Err(err).Str("step-uuid", w.stepUUID).Msg("fail to write pipeline log to peer")
}
w.num++

View file

@ -59,11 +59,11 @@ func (s storage) GetPipelineList(repo *model.Repo, p *model.ListOptions, f *mode
if f != nil {
if f.After != 0 {
cond = cond.And(builder.Gt{"pipeline_started": f.After})
cond = cond.And(builder.Gt{"pipeline_created": f.After})
}
if f.Before != 0 {
cond = cond.And(builder.Lt{"pipeline_started": f.Before})
cond = cond.And(builder.Lt{"pipeline_created": f.Before})
}
}

View file

@ -231,21 +231,19 @@ func TestPipelines(t *testing.T) {
})
g.It("Should get filtered pipelines", func() {
dt1, _ := time.Parse(time.RFC3339, "2023-01-15T15:00:00Z")
pipeline1 := &model.Pipeline{
RepoID: repo.ID,
Started: dt1.Unix(),
RepoID: repo.ID,
}
dt2, _ := time.Parse(time.RFC3339, "2023-01-15T16:30:00Z")
pipeline2 := &model.Pipeline{
RepoID: repo.ID,
Started: dt2.Unix(),
RepoID: repo.ID,
}
err1 := store.CreatePipeline(pipeline1, []*model.Step{}...)
g.Assert(err1).IsNil()
time.Sleep(1 * time.Second)
before := time.Now().Unix()
err2 := store.CreatePipeline(pipeline2, []*model.Step{}...)
g.Assert(err2).IsNil()
pipelines, err3 := store.GetPipelineList(&model.Repo{ID: 1}, &model.ListOptions{Page: 1, PerPage: 50}, &model.PipelineFilter{Before: dt2.Unix()})
pipelines, err3 := store.GetPipelineList(&model.Repo{ID: 1}, &model.ListOptions{Page: 1, PerPage: 50}, &model.PipelineFilter{Before: before})
g.Assert(err3).IsNil()
g.Assert(len(pipelines)).Equal(1)
g.Assert(pipelines[0].ID).Equal(pipeline1.ID)

File diff suppressed because it is too large Load diff