January 4, 2022: Reloaded GENESEQ Database Provides Updated BLAST and GETSIM Versions and New Searching Capabilities

The patent sequence database GENESEQ, produced by Clarivate and providing coverage of nucleic acid and protein sequences extracted from the original (basic) patent documents published by 57 patent offices worldwide, has been reloaded and enhanced. The database was previously known as DGENE on STNext.

Many of the enhancements documented herein have already been implemented in PATGENE, earlier in 4Q2021. USGENE is expected to be updated similarly in 1Q2022.

Highlights of the new version of the GENESEQ database are:

New BLAST Version and Additional BLAST Search Options

GENESEQ now uses BLAST version 2.12.0. Four additional search options have been introduced:

Additional details on these new search options can be found by typing HELP BLAST or HELP TLATION at an arrow prompt while in GENESEQ.

New FASTA Version

The FASTA algorithm, invoked by RUN GETSIM, has been updated to version 36.3.8h. It now allows searching of sequences up to 30K characters in length. The available search options are the same as before: /SQN for searching nucleotides sequences, /SQP for searching amino acid sequences, and /TSQP translating a nucleotide query in all six reading frames to an amino acid sequence and searching in the protein sequences. The display of the parameters, the overview diagram and the alignments are now the same for GETSIM and BLAST searches. Updated HELP information is available is available in HELP GSIM.

Improved Usability of Motif Searching (RUN GETSEQ) Results

To improve the usability of Motif searching results, the entire answer set is now always included within a single L-number. HELP GSEQ has been updated and includes additional information.

Better Display of Search Results

New displays of similarity results are now available. For each BLAST or GETSIM search two diagrams are generated to provide an overview of the similarity between the retrieved sequences and the query:

For BLAST and GETSIM searches, L-numbers are each generated by entering ALL, a percentage or an absolute number. Each L-number can be used for further processing.

Alignments can be displayed for all three RUN options (BLAST, GETSIM, GETSEQ) as text with the display format ALIGN or as an image with ALIGNG.

New Search Fields for the Composition of Nucleic Acid and Protein Sequences

Need to find sequences with a particular type of content? The introduction of new search fields reporting the nucleotide and amino acid composition of specific sequence makes this possible.

The new fields are as follows:

Range searching is possible for the /AA.CNT, /NA.CNT, /AA.PER, and  /NA.PER fields, and the use of (S) proximity provides precision searching capabilities. For example, nucleotides with high GC-content (Guanine, Cytosine) can be retrieved with: => S (G OR C)/NA (S) 60-100/NA.PER

Better Compatibility with the PATGENE and USGENE Sequence Databases

The search fields Patent Sequence Location (/PSL) and Sequence Count (/SEQC), already available in PATGENE and USGENE, are now also available in GENESEQ. This means that the same sequence-specific searches can now be performed in all three databases.

For every sequence in GENESEQ, the SHA-2 algorithm has been applied and indexed in the new field Sequence Key (/SEQK). The generated string (e.g., A0000030BD19782FC1774AF58E4CFFEE7F0E30588CBA14DCD38C), is specific to a sequence. Identical sequences receive the same string, regardless of the database of origin, or the organism frommwhich the sequence was isolated. The /SEQK field has already been added to PATGENE and will be added to USGENE in due course to enable efficient duplicate identification.

Compatibility with Full-Text Patent Databases

Search fields common to the patent full text databases are now also available in GENESEQ:

These fields already appear in PATGENE and will also appear in USGENE in due course.

Improved Performance and Additional Enhancements

As a result of the new BLAST and FASTA versions, search performance is improved.

Although BATCH searches are not possible, L-numbers from sequence searches can be saved with the command SAVE and reactivated with ACTIVATE.

Alerts for sequences are not possible for the time being but can be set up for bibliographic fields.

The default maximum number of hits has been increased to 15,000. The new parameter "-maxseq" allows the maximum number of hits to be increased to 100,000, but larger maximums will mean longer processing time. Example: = > RUN BLAST L1/SQN -F F  -MAXSEQ 100000

The new Database Summary Sheet for GENESEQ is available at: https://www.cas.org/sites/default/files/documents/geneseq.pdf

Back to STN Content Updates