Commit graph

79 commits

Author SHA1 Message Date
Markus Heiser d5ecda9930 [mod] move language recognition to get_search_query_from_webapp
To set the language from language recognition and hold the value selected by the
client, the previous implementation creates a copy of the SearchQuery object and
manipulates the SearchQuery object by calling function replace_auto_language().

This patch tries to implement a similar functionality in a more central place,
in function get_search_query_from_webapp() when the SearchQuery object is build
up.

Additional this patch uses the language preferred by the client, if language
recognition does not have a match / the existing implementation does not care
about client preferences and uses 'all' in case of no match.

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-04-15 22:23:33 +02:00
Markus Heiser 27369ebec2 [fix] searxng_extra/update/update_engine_descriptions.py (part 1)
Follow up of #2269

The script to update the descriptions of the engines does no longer work since
PR #2269 has been merged.

searx/engines/wikipedia.py
==========================

1. There was a misusage of zh-classical.wikipedia.org:

   - `zh-classical` is dedicate to classical Chinese [1] which is not
     traditional Chinese [2].

   - zh.wikipedia.org has LanguageConverter enabled [3] and is going to
     dynamically show simplified or traditional Chinese according to the
     HTTP Accept-Language header.

2. The update_engine_descriptions.py needs a list of all wikipedias.  The
   implementation from #2269 included only a reduced list:

   - https://meta.wikimedia.org/wiki/Wikipedia_article_depth
   - https://meta.wikimedia.org/wiki/List_of_Wikipedias

searxng_extra/update/update_engine_descriptions.py
==================================================

Before PR #2269 there was a match_language() function that did an approximation
using various methods.  With PR #2269 there are only the types in the data model
of the languages, which can be recognized by babel.  The approximation methods,
which are needed (only here) in the determination of the descriptions, must be
replaced by other methods.

[1] https://en.wikipedia.org/wiki/Classical_Chinese
[2] https://en.wikipedia.org/wiki/Traditional_Chinese_characters
[3] https://www.mediawiki.org/wiki/Writing_systems#LanguageConverter

Closes: https://github.com/searxng/searxng/issues/2330
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-04-15 16:03:59 +02:00
Markus Heiser 4d4aa13e1f [mod] remove obsolete EngineTraits.supported_languages
All engines has been migrated from ``supported_languages`` to the
``fetch_traits`` concept.  There is no longer a need for the obsolete code that
implements the ``supported_languages`` concept.

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-03-24 10:37:42 +01:00
Markus Heiser 2499899554 [mod] Google: reversed engineered & upgrade to data_type: traits_v1
Partial reverse engineering of the Google engines including a improved language
and region handling based on the engine.traits_v1 data.

When ever possible the implementations of the Google engines try to make use of
the async REST APIs.  The get_lang_info() has been generalized to a
get_google_info() function / especially the region handling has been improved by
adding the cr parameter.

searx/data/engine_traits.json
  Add data type "traits_v1" generated by the fetch_traits() functions from:

  - Google (WEB),
  - Google images,
  - Google news,
  - Google scholar and
  - Google videos

  and remove data from obsolete data type "supported_languages".

  A traits.custom type that maps region codes to *supported_domains* is fetched
  from https://www.google.com/supported_domains

searx/autocomplete.py:
  Reversed engineered autocomplete from Google WEB.  Supports Google's languages and
  subdomains.  The old API suggestqueries.google.com/complete has been replaced
  by the async REST API: https://{subdomain}/complete/search?{args}

searx/engines/google.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
  - always use the async REST API (formally known as 'use_mobile_ui')
  - use *supported_domains* from traits
  - improved the result list by fetching './/div[@data-content-feature]'
    and parsing the type of the various *content features* --> thumbnails are
    added

searx/engines/google_images.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - if exists, freshness_date is added to the result
  - issue 1864: result list has been improved a lot (due to the new cr parameter)

searx/engines/google_news.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
    *supported_domains* is not needed but a ceid list has been added.
  - different region handling compared to Google WEB
  - fixed for various languages & regions (due to the new ceid parameter) /
    avoid CONSENT page
  - Google News do no longer support time range
  - result list has been fixed: XPath of pub_date and pub_origin

searx/engines/google_videos.py
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - add paging support
  - implement a async request ('asearch': 'arc' & 'async':
    'use_ac:true,_fmt:html')
  - simplified code (thanks to '_fmt:html' request)
  - issue 1359: fixed xpath of video length data

searx/engines/google_scholar.py
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - request(): include patents & citations
  - response(): fixed CAPTCHA detection (Scholar has its own CATCHA manager)
  - hardening XPath to iterate over results
  - fixed XPath of pub_type (has been change from gs_ct1 to gs_cgt2 class)
  - issue 1769 fixed: new request implementation is no longer incompatible

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-03-24 10:37:42 +01:00
Markus Heiser 6e5f22e558 [mod] replace engines_languages.json by engines_traits.json
Implementations of the *traits* of the engines.

Engine's traits are fetched from the origin engine and stored in a JSON file in
the *data folder*.  Most often traits are languages and region codes and their
mapping from SearXNG's representation to the representation in the origin search
engine.

To load traits from the persistence::

    searx.enginelib.traits.EngineTraitsMap.from_data()

For new traits new properties can be added to the class::

    searx.enginelib.traits.EngineTraits

.. hint::

   Implementation is downward compatible to the deprecated *supported_languages
   method* from the vintage implementation.

   The vintage code is tagged as *deprecated* an can be removed when all engines
   has been ported to the *traits method*.

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-03-24 10:37:42 +01:00
Markus Heiser 150a90c84e [fix] fix threshold in replace_auto_language
[1] https://github.com/searxng/searxng/pull/2027#pullrequestreview-1322157677
[2] https://github.com/searxng/searxng/pull/1969#issuecomment-1345354529

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-03-05 08:29:58 +01:00
Alexandre Flament 6748e8e2d5 Add "Auto-detected" as a language.
When the user choose "Auto-detected", the choice remains on the following queries.
The detected language is displayed.

For example "Auto-detected (en)":
* the next query language is going to be auto detected
* for the current query, the detected language is English.

This replace the autodetect_search_language plugin.
2023-02-17 15:17:36 +00:00
Markus Heiser 4c06837a50 [mod] make python code pylint 2.16.1 compliant
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2023-02-10 13:59:21 +01:00
ArtikusHG 735e388cec
Merge branch 'master' into fasttext 2022-12-16 19:43:10 +00:00
ArtikusHG 1f8f8c1e91 Replace langdetect with fasttext 2022-12-16 21:07:39 +02:00
Alexandre Flament b971167ced move searx.shared.redisdb to searx.redisdb 2022-12-10 09:26:38 +01:00
Alexandre FLAMENT e92755d358 Initialize Redis in searx/webapp.py
settings.yml:
* The default URL was unix:///usr/local/searxng-redis/run/redis.sock?db=0
* The default URL is now "false"

The default URL makes the log difficult to deal with:
if the admin didn't install a Redis instance, the logs record a false error.

It worked before because SearXNG initialized the Redis connection when the limiter started.

In this commit, SearXNG initializes Redis in searx/webapp.py
so various components can use Redis without taking care of the initialization step.
2022-11-05 17:45:52 +01:00
Alexandre Flament fe419e355b The checker requires Redis
Remove the abstraction in searx.shared.SharedDict.
Implement a basic and dedicated scheduler for the checker using a Redis script.
2022-11-05 12:04:50 +01:00
Alexandre Flament 32e8c2cf09 searx.network: add "verify" option to the networks
Each network can define a verify option:
* false to disable certificate verification
* a path to existing certificate.

SearXNG uses SSL_CERT_FILE and SSL_CERT_DIR when they are defined
see https://www.python-httpx.org/environment_variables/#ssl_cert_file
2022-10-14 13:59:22 +00:00
Markus Heiser ba8959ad7c [fix] typos / reported by @kianmeng in searx PR-3366
[PR-3366] https://github.com/searx/searx/pull/3366

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-09-27 18:32:14 +02:00
Markus Heiser 8df1f0c47e [mod] add 'Accept-Language' HTTP header to online processores
Most engines that support languages (and regions) use the Accept-Language from
the WEB browser to build a response that fits to the language (and region).

- add new engine option: send_accept_language_header

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-08-01 17:01:59 +02:00
Markus Heiser a2badb4fe4 [doc] add description of method EngineProcessor.get_params()
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-08-01 16:42:33 +02:00
Alexandre Flament 2babf59adc [fix] pyright repported errors
The errors make pyright usage useless since a new error won't be seen [1].

[1] https://github.com/searxng/searxng/pull/1569

```
  searx/compat.py:11:27 - error: Expression of type "Type[cached_property[_T@cached_property]]" cannot be assigned to declared type "Type[cached_property]"
    "Type[cached_property[_T@cached_property]]" is incompatible with "Type[cached_property]"
    Type "Type[cached_property[_T@cached_property]]" cannot be assigned to type "Type[cached_property]" (reportGeneralTypeIssues)
  searx/utils.py:69:36 - error: Expression of type "None" cannot be assigned to parameter of type "str"
    Type "None" cannot be assigned to type "str" (reportGeneralTypeIssues)
  searx/utils.py:573:85 - error: Expression of type "None" cannot be assigned to parameter of type "int"
    Type "None" cannot be assigned to type "int" (reportGeneralTypeIssues)
  searx/webapp.py:1306:22 - error: Argument of type "str" cannot be assigned to parameter "__a" of type "BytesPath" in function "join"
    Type "str" cannot be assigned to type "BytesPath"
      "str" is incompatible with "bytes"
      "str" is incompatible with protocol "PathLike[bytes]"
        "__fspath__" is not present (reportGeneralTypeIssues)
  searx/webapp.py:1306:68 - error: Argument of type "Literal['themes']" cannot be assigned to parameter "paths" of type "BytesPath" in function "join"
    Type "Literal['themes']" cannot be assigned to type "BytesPath"
      "Literal['themes']" is incompatible with "bytes"
      "Literal['themes']" is incompatible with protocol "PathLike[bytes]"
        "__fspath__" is not present (reportGeneralTypeIssues)
  searx/webapp.py:1306:78 - error: Argument of type "str | Any | None" cannot be assigned to parameter "paths" of type "BytesPath" in function "join"
    Type "str | Any | None" cannot be assigned to type "BytesPath"
      Type "str" cannot be assigned to type "BytesPath"
        "str" is incompatible with "bytes"
        "str" is incompatible with protocol "PathLike[bytes]"
          "__fspath__" is not present (reportGeneralTypeIssues)
  searx/webapp.py:1306:85 - error: Argument of type "Literal['img']" cannot be assigned to parameter "paths" of type "BytesPath" in function "join"
    Type "Literal['img']" cannot be assigned to type "BytesPath"
      "Literal['img']" is incompatible with "bytes"
      "Literal['img']" is incompatible with protocol "PathLike[bytes]"
        "__fspath__" is not present (reportGeneralTypeIssues)
  searx/engines/mongodb.py:8:6 - warning: Import "pymongo" could not be resolved (reportMissingImports)
  searx/engines/mysql_server.py:9:8 - warning: Import "mysql.connector" could not be resolved (reportMissingImports)
  searx/engines/postgresql.py:9:8 - warning: Import "psycopg2" could not be resolved from source (reportMissingModuleSource)
  searx/engines/xpath.py:187:28 - warning: "categories" is not defined (reportUndefinedVariable)
  searx/search/__init__.py:184:82 - warning: "flask" is not defined (reportUndefinedVariable)
  searx/search/checker/background.py:19:26 - error: Type of "schedule" is partially unknown
    Type of "schedule" is "(delay: Any, func: Any, *args: Any) -> Literal[True]" (reportUnknownVariableType)
  searx/shared/__init__.py:8:12 - warning: Import "uwsgi" could not be resolved (reportMissingImports)
  searx/shared/shared_uwsgi.py:5:8 - warning: Import "uwsgi" could not be resolved (reportMissingImports)
```
2022-07-30 18:04:44 +02:00
Markus Heiser c63fab6928
Merge pull request #1443 from return42/fix-online_dictionary
[fix] online_dictionary: regular expression
2022-07-07 16:25:10 +02:00
Markus Heiser 480476fdf3 [fix] online_dictionary: regular expression
The query term of a engine-type `online_dictionary` can consist of more than one
word.

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-07-07 15:58:29 +02:00
Émilien Devos 63a995b8c1 Better explanation for the use of use_mobile_ui 2022-07-06 00:10:09 +02:00
Emilien Devos 0d4c066119 notify the user that use_mobile_ui parameter exist 2022-06-11 17:20:56 +02:00
Markus Heiser 2de007138c [fix] prepare for pylint 2.14.0
Remove issue reported by Pylint 2.14.0:

- no-self-use: has been moved to optional extension [1]
- The refactoring checker now also raises 'consider-using-generator' messages
  for max(), min() and sum(). [2]

.pylintrc:
  - <option name>-hint has been removed since long, Pylint 2.14.0 raises an
    error on invalid options
  - bad-continuation and bad-whitespace have been removed [3]

[1] https://pylint.pycqa.org/en/latest/whatsnew/2/2.14/summary.html#removed-checkers
[2] https://pylint.pycqa.org/en/latest/whatsnew/2/2.14/full.html#what-s-new-in-pylint-2-14-0
[2] https://pylint.pycqa.org/en/latest/whatsnew/2/2.6/summary.html#summary-release-highlights

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-06-03 15:41:52 +02:00
Markus Heiser e92d40c854 [enh] implement a OnlineUrlSearchProcessor
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2022-01-30 16:05:08 +01:00
Martin Fischer 640c404844 [pyright:strict] searx.search.checker.background 2022-01-27 22:07:12 +01:00
Alexandre Flament 5439dd5fb1 [fix] checker: fix image fetch
Since https://github.com/searxng/searxng/pull/354
the searx.network.stream(...) returns a tuple

This commits update the checker code according to
this function signature change.
2022-01-22 16:11:42 +01:00
Martin Fischer def62c3a47 [typing] add type hints for dictionaries 2022-01-17 11:42:48 +01:00
Alexandre Flament 2134703b4b [enh] settings.yml: implement general.enable_metrics
* allow not to record metrics (response time, etc...)
* this commit doesn't change the UI. If the metrics are disabled
  /stats and /stats/errors will return empty response.
  in /preferences, the columns response time and reliability will be empty.
2022-01-05 19:03:04 +01:00
Markus Heiser 3d96a9839a [format.python] initial formatting of the python code
This patch was generated by black [1]::

    make format.python

[1] https://github.com/psf/black

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-12-27 09:26:22 +01:00
Markus Heiser fcdc2c2cd2 [format.python] disable py code formatting for some hunks of code
Disable the python code formatting from python-black, where the readability of
code suffers by formatting.

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-12-27 09:16:03 +01:00
Alexandre Flament f9c6393502 [enh] verify that Tor proxy works every time searx starts
based on @MarcAbonce commit on searx
2021-10-12 21:01:02 +02:00
Alexandre Flament 29893cf816 [fix] searx.network.stream: fix memory leak 2021-09-28 19:28:12 +02:00
Alexandre Flament 2eab89b4ca [fix] checker: fix memory usage
* download images using the "image_proxy" network (HTTP/1 instead of HTTP/2)
* don't cache data: URL (reduce memory usage)
* after each test: purge image URL cache then call garbage collector
* download only the first 64kb of images
2021-09-28 15:26:02 +02:00
Markus Heiser 443bf35e09 [pylint] fix global-variable-not-assigned issues
If there is no write access, there is no need for global.  Remove global
statement if there is no assignment.

global-variable-not-assigned:
  Using global for names but no assignment is done Used when a variable is
  defined through the "global" statement but no assignment to this variable is
  done.

In Pylint 2.11 the global-variable-not-assigned checker now catches global
variables that are never reassigned in a local scope and catches (reassigned)
functions [1][2]

[1] https://pylint.pycqa.org/en/latest/whatsnew/2.11.html
[2] https://github.com/PyCQA/pylint/issues/1375

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-09-17 10:14:27 +02:00
Alexandre Flament b513917ef9 [mod] searx.metrics & searx.search: use the engine loggers
metrics & processors use the engine logger
2021-09-10 21:49:34 +02:00
Alexandre Flament 0b27c8698f [doc] update docs/dev/plugins.rst 2021-09-10 10:58:22 +02:00
Alexandre Flament 660c180170 [mod] plugin: call on_result after each engine from the ResultContainer
Currently, searx.search.Search calls on_result once the engine results have been merged
(ResultContainer.order_results).

on_result plugins can rewrite the results: once the URL(s) are modified, even they can be merged,
it won't be the case since ResultContainer.order_results has already be called.

This commit call on_result inside for each result of each engines.
In addition the on_result function can return False to remove the result.

Note: the on_result function now run on the engine thread instead of the Flask thread.
2021-09-09 11:31:44 +02:00
Markus Heiser 2a3b9a2e26 [pylint] searx: drop no longer needed 'missing-function-docstring'
Suggested-by: @dalf https://github.com/searxng/searxng/issues/102#issuecomment-914168470
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-09-07 13:34:35 +02:00
Alexandre Flament 2f363858b8 [fix] searx.search.checker.get_result() always return a dict
So checker_results['status'] == 'ok' is enough to check the checker result.
See searx/webapp.py, /preferences endpoint
2021-08-16 08:29:16 +02:00
Markus Heiser 24f2376c11 [pylint] prepare for pylint v2.9.3 / fix some (new) pylint issues
Upgrade from pylint v2.8.3 to 2.9.3 raise some new issues::

  searx/search/checker/__main__.py:37:26: R1732: Consider using 'with' for resource-allocating operations (consider-using-with)
  searx/search/checker/__main__.py:38:26: R1732: Consider using 'with' for resource-allocating operations (consider-using-with)
  searx/search/processors/__init__.py:20:0: R0402: Use 'from searx import engines' instead (consider-using-from-import)
  searx/preferences.py:182:19: C0207: Use data.split('-', maxsplit=1)[0] instead (use-maxsplit-arg)
  searx/preferences.py:506:15: R1733: Unnecessary dictionary index lookup, use 'user_setting' instead (unnecessary-dict-index-lookup)
  searx/webapp.py:436:0: C0206: Consider iterating with .items() (consider-using-dict-items)
  searx/webapp.py:950:4: C0206: Consider iterating with .items() (consider-using-dict-items)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-07-03 17:54:08 +02:00
Markus Heiser f122cb0e27 [fix] typo: online_dictionnary --> online_dictionary
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-06-04 15:05:58 +02:00
Alexandre Flament 6fa114c9ba [mod] settings_default: remove searx.search.max_request_timeout global variable 2021-06-01 08:10:15 +02:00
Markus Heiser 6f1446d55f [pylint] searx/search/__init__.py & replace lic-text by SPDX tag
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-21 17:31:22 +02:00
Alexandre Flament 426fadccb3 [mod] remove gc.collect() after each user request 2021-05-21 17:23:18 +02:00
Markus Heiser fa0d05c313 [pylint] checker/__main__.py & checker/background.py
Lint files that has been touched by [PR #58]

[PR #58] https://github.com/searxng/searxng/pull/58

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-05-05 16:47:02 +02:00
Alexandre Flament 8c1a65d32f [mod] multithreading only in searx.search.* packages
it prepares the new architecture change,
everything about multithreading in moved in the searx.search.* packages

previously the call to the "init" function of the engines was done in searx.engines:
* the network was not set (request not sent using the defined proxy)
* it requires to monkey patch the code to avoid HTTP requests during the tests
2021-05-05 13:12:42 +02:00
Markus Heiser 924f9afea3 [lint] pylint searx/search/processors files / BTW add some doc-strings
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
2021-04-27 15:13:39 +02:00
Alexandre Flament b1557b5443 [mod] processors: show identical error messages on /search and /stats 2021-04-27 14:20:07 +02:00
Alexandre Flament 7cfd8d900a [mod] oscar: /preferences , engines tab: report engine times
* display the median time instead of the average.
* add a "Reliability" column (sum up the metrics and the checker results).
* the "selected language", "SafeSearch", "Time range" values are displayed as "broken" when the checker tests fail.
2021-04-21 16:24:46 +02:00
Alexandre Flament c27fef1cde [mod] metrics: add secondary parameter
Some error won't stop the engine:
* additional HTTP redirects for example
* some invalid results

secondary=True allows to flag these errors as not important.
2021-04-21 16:24:46 +02:00