fix: k8s agent fails to tail logs starving the cpu

Proposal to fix https://github.com/woodpecker-ci/woodpecker/issues/2253 We have observed several possibly-related issues on a Kubernetes backend: 1. Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever. 2. Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on #2253). 3. We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on #2253). ``` {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} ``` 4. We've also observed that agents will sometimes drop out of the worker queue, also as reported on #2253. Seeing as the logs point to `client_grpc.go:335`, this pull request fixes the issue by: 1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever. 2. Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of _15 minutes_. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.
2024-06-06 15:39:33 +00:00 · 2024-04-05 19:22:59 -03:00 · 2024-04-05 19:22:59 -03:00 · 01699eaaab
parent d0e63375fa
commit 01699eaaab
2 changed files with 5 additions and 9 deletions
--- a/agent/rpc/client_grpc.go
+++ b/agent/rpc/client_grpc.go
@ -97,7 +97,6 @@ func (c *client) Next(ctx context.Context, f rpc.Filter) (*rpc.Workflow, error)
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -143,7 +142,6 @@ func (c *client) Wait(ctx context.Context, id string) (err error) {
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -184,7 +182,6 @@ func (c *client) Init(ctx context.Context, id string, state rpc.State) (err erro
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -225,7 +222,6 @@ func (c *client) Done(ctx context.Context, id string, state rpc.State) (err erro
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -259,7 +255,6 @@ func (c *client) Extend(ctx context.Context, id string) (err error) {
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -300,7 +295,6 @@ func (c *client) Update(ctx context.Context, id string, state rpc.State) (err er
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -339,7 +333,6 @@ func (c *client) Log(ctx context.Context, logEntry *rpc.LogEntry) (err error) {
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
@ -386,7 +379,6 @@ func (c *client) ReportHealth(ctx context.Context) (err error) {
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,
 			codes.Unavailable:
 			// non-fatal errors
 		default:
--- a/pipeline/rpc/log_entry.go
+++ b/pipeline/rpc/log_entry.go
@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) {
 		Type:     LogEntryStdout,
 		Line:     w.num,
 	}
-	if err := w.peer.Log(context.Background(), line); err != nil {
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	if err := w.peer.Log(ctx, line); err != nil {
 		log.Error().Err(err).Str("step-uuid", w.stepUUID).Msg("fail to write pipeline log to peer")
 	}
 	w.num++