Files
FreshRSS/app/Services/ExportService.php
Alexandre Alapetite 1f466d7a2e Implement custom order-by (#7149)
Add option to sort results by received date (existing, default), publication date, title, URL (link), random.

fix https://github.com/FreshRSS/FreshRSS/issues/1771
fix https://github.com/FreshRSS/FreshRSS/issues/2083
fix https://github.com/FreshRSS/FreshRSS/issues/2119
fix https://github.com/FreshRSS/FreshRSS/issues/2596
fix https://github.com/FreshRSS/FreshRSS/issues/3204
fix https://github.com/FreshRSS/FreshRSS/issues/4405
fix https://github.com/FreshRSS/FreshRSS/issues/5529
fix https://github.com/FreshRSS/FreshRSS/issues/5864
fix https://github.com/FreshRSS/Extensions/issues/161

URL parameters:
* `&sort=id` (current behaviour, sorting according to newest received articles)
* `&sort=date` (publication date, which is not indicative of how new an article is)
* `&sort=title`
* `&sort=link`
* `&sort=rand` (random order - which disables infinite scrolling, at least for now)

combined with `&order=ASC` or `&order=DESC`

![image](https://github.com/user-attachments/assets/2de5aef1-604e-4a73-a147-569f6f42a1be)

## Implementation notes

The sorting criteria by *received date* (id), which is the default, and which was the only one before this PR, is the one that has the best sorting characteristics:
* *uniqueness*: no entries have the exact same received date
* *monotonicity*: new entries always have a higher received date
* *performance*: this field is efficiently indexed in database for fast usage, including for paging (indexing could also be done to other fields, but with lower effective performance)

In contrary, sorting criteria such as by *publication date*, by *title*, or by *link* are neither unique nor monotonic. In particular, multiple articles may share the same *publication date*, and we may receive articles with a *publication date* far in the future, and then later some new articles with a *publication date* far in the past.

To understand why sorting by *publication date* is problematic, it helps to think about sorting by *title* or by *link*, as sorting by *title* and by *publication date* share more or less the same characteristics.

### Problem 1: new articles

New articles may be received in the background after what is shown on screen, and before the next user action such as *mark all as read*. Due to the lack of *monotonicity* when sorting by e.g. *publication date* or *title*, users risk marking as read a batch of articles containing some fresh articles without seeing them.

Mitigation: A parameter `idMax` tracks the maximum ID related to a batch of actions such as *mark all as read* to exclude articles received after those that are displayed.

### Problem 2: paging / pagination

When navigating articles, only a few articles are displayed, and a new "page" of articles needs to be received from the database when scrolling down or when clicking the button to show more articles. When sorting by e.g. *publication date* or *title*, it is not trivial to show the next page without re-showing some of the same articles, and without skipping any. Indeed, views are often with additional criteria such as showing only unread articles, and users may mark some articles as read while viewing them, hereby removing some articles from the previous pages. And like for *Problem 1*, new articles may have been received in the background. Consequently, it is not possible to use `OFFSET` to implement pagination (so the patches suggested by a few users were wrong due to that, in particular).

Mitigation: `idMax` is also used (just like for *Problem 1*) and a *Keyset Pagination* approach is used, combining an unstable sorting criterion such as *publication date* or *title*, together with *id* to ensure stable sorting. (So, 2 sorting criteria + 1 filter criteria)

See e.g. https://www.alwaysdeveloping.net/dailydrop/2022/07/01-keyset-pagination/

### Problem 3: performance

Sorting by anything else than *received date* (id) is doomed to be slow(er) due to the combination of 3 criteria (see *Problem 2*). An `OFFSET` approach (which is not possible anyway as explained) would be even slower. Furthermore, we have no SQL index at the moment, but they would not necessarily help much due to the multiple sorting criteria needed and involving some `OR` logic which is difficult to optimise for databases.

The nicest syntax would be using tuples and corresponding indexes, but that is poorly supported by MySQL https://bugs.mysql.com/bug.php?id=104128

Mitigation: a compatibility SQL syntax is used to implement *Keyset Pagination*

### Problem 4: user confusion

Several users have shown that they do not fully understand the difference between *received date* and *publication date*, and particularly not the pitfalls of *publication date*.

Mitigation: the menus to mark-as-read *before 1 day* and *before 1 week* are disabled when sorting by anything else than *received date*. Likewise, the separation headers *Today* and *Yesterday* and *Before yesterday* are only shown when sorting by *received date*.

Again here, to better understand why, it helps to think about sorting by *title* or by *link*, as sorting by *title* and by *publication date* share more or less the same characteristics.

* [ ] We should write a Q&A and/or documentation about the problems associated to *sorting by publication date*: risks of not noticing new publication, of inadvertently marking them as read, of having some articles with a date in the future hanging at the top of the views (vice versa when sorting in ascending order), performance, etc.

### Problem 5: APIs

Sorting by anything else than *received date* breaks the guarantees needed for a successful synchronisation via API.

Mitigation: sorting by *received date* is ensured for all API calls.
2025-01-06 16:00:00 +01:00

174 lines
5.5 KiB
PHP
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<?php
declare(strict_types=1);
/**
* Provide useful methods to generate files to export.
*/
class FreshRSS_Export_Service {
private readonly FreshRSS_CategoryDAO $category_dao;
private readonly FreshRSS_FeedDAO $feed_dao;
private readonly FreshRSS_EntryDAO $entry_dao;
private readonly FreshRSS_TagDAO $tag_dao;
final public const FRSS_NAMESPACE = 'https://freshrss.org/opml';
final public const TYPE_HTML_XPATH = 'HTML+XPath';
final public const TYPE_XML_XPATH = 'XML+XPath';
final public const TYPE_RSS_ATOM = 'rss';
final public const TYPE_JSON_DOTPATH = 'JSON+DotPath'; // Legacy 1.24.0-dev
final public const TYPE_JSON_DOTNOTATION = 'JSON+DotNotation';
final public const TYPE_JSONFEED = 'JSONFeed';
final public const TYPE_HTML_XPATH_JSON_DOTNOTATION = 'HTML+XPath+JSON+DotNotation';
/**
* Initialize the service for the given user.
*/
public function __construct(private readonly string $username) {
$this->category_dao = FreshRSS_Factory::createCategoryDao($this->username);
$this->feed_dao = FreshRSS_Factory::createFeedDao($this->username);
$this->entry_dao = FreshRSS_Factory::createEntryDao($this->username);
$this->tag_dao = FreshRSS_Factory::createTagDao();
}
/**
* Generate OPML file content.
* @return array{0:string,1:string} First item is the filename, second item is the content
*/
public function generateOpml(): array {
$view = new FreshRSS_View();
$day = date('Y-m-d');
$view->categories = $this->category_dao->listCategories(true, true) ?: [];
$view->excludeMutedFeeds = false;
return [
"feeds_{$day}.opml.xml",
$view->helperToString('export/opml')
];
}
/**
* Generate the starred and labelled entries file content.
*
* Both starred and labelled entries are put into a "starred" file, thats
* why there is only one method for both.
*
* @phpstan-param 'S'|'T'|'ST' $type
* @param string $type must be one of:
* 'S' (starred/favourite),
* 'T' (taggued/labelled),
* 'ST' (starred or labelled)
* @return array{0:string,1:string} First item is the filename, second item is the content
*/
public function generateStarredEntries(string $type): array {
$view = new FreshRSS_View();
$view->categories = $this->category_dao->listCategories(true) ?: [];
$day = date('Y-m-d');
$view->list_title = _t('sub.import_export.starred_list');
$view->type = 'starred';
$entriesId = $this->entry_dao->listIdsWhere($type, 0, FreshRSS_Entry::STATE_ALL, order: 'ASC', limit: -1) ?? [];
$view->entryIdsTagNames = $this->tag_dao->getEntryIdsTagNames($entriesId);
// The following is a streamable query, i.e. must be last
$view->entries = $this->entry_dao->listWhere(
$type, 0, FreshRSS_Entry::STATE_ALL, order: 'ASC', limit: -1
);
return [
"starred_{$day}.json",
$view->helperToString('export/articles')
];
}
/**
* Generate the entries file content for the given feed.
* @return array{0:string,1:string}|null First item is the filename, second item is the content.
* It also can return null if the feed doesnt exist.
*/
public function generateFeedEntries(int $feed_id, int $max_number_entries): ?array {
$view = new FreshRSS_View();
$view->categories = $this->category_dao->listCategories(true) ?: [];
$feed = FreshRSS_Category::findFeed($view->categories, $feed_id);
if ($feed === null) {
return null;
}
$view->feed = $feed;
$day = date('Y-m-d');
$filename = "feed_{$day}_" . $feed->categoryId() . '_' . $feed->id() . '.json';
$view->list_title = _t('sub.import_export.feed_list', $feed->name());
$view->type = 'feed/' . $feed->id();
$entriesId = $this->entry_dao->listIdsWhere(
'f', $feed->id(), FreshRSS_Entry::STATE_ALL, order: 'ASC', limit: $max_number_entries
) ?? [];
$view->entryIdsTagNames = $this->tag_dao->getEntryIdsTagNames($entriesId);
// The following is a streamable query, i.e. must be last
$view->entries = $this->entry_dao->listWhere(
'f', $feed->id(), FreshRSS_Entry::STATE_ALL, order: 'ASC', limit: $max_number_entries
);
return [
$filename,
$view->helperToString('export/articles')
];
}
/**
* Generate the entries file content for all the feeds.
* @return array<string,string> Keys are filenames and values are contents.
*/
public function generateAllFeedEntries(int $max_number_entries): array {
$feed_ids = $this->feed_dao->listFeedsIds();
$exported_files = [];
foreach ($feed_ids as $feed_id) {
$result = $this->generateFeedEntries($feed_id, $max_number_entries);
if ($result === null) {
continue;
}
[$filename, $content] = $result;
$exported_files[$filename] = $content;
}
return $exported_files;
}
/**
* Compress several files in a Zip file.
* @param array<string,string> $files where the key is the filename, the value is the content
* @return array{0:string,1:string|false} First item is the zip filename, second item is the zip content
*/
public function zip(array $files): array {
$day = date('Y-m-d');
$zip_filename = 'freshrss_' . $this->username . '_' . $day . '_export.zip';
// From https://stackoverflow.com/questions/1061710/php-zip-files-on-the-fly
$zip_file = tempnam(TMP_PATH, 'zip');
if ($zip_file === false) {
return [$zip_filename, false];
}
$zip_archive = new ZipArchive();
$zip_archive->open($zip_file, ZipArchive::OVERWRITE);
foreach ($files as $filename => $content) {
$zip_archive->addFromString($filename, $content);
}
$zip_archive->close();
$content = file_get_contents($zip_file);
unlink($zip_file);
return [
$zip_filename,
$content,
];
}
}