February 21, 2022: USGENE Database Reload Provides Updated BLAST and GETSIM Versions and New Searching Capabilities, Manual Codes for Derwent World Patents Index Revised for 2022

USGENE Database Reload Provides Updated BLAST and GETSIM Versions and New Searching Capabilities

The patent sequence database USGENE, providing all available peptide and nucleic acid sequences from the published applications and issued patents of the United States Patent and Trademark Office (USPTO), has been reloaded and enhanced on STNext.

In addition to faster search processing, the highlights of the new version of the USGENE are:

New BLAST Version and Additional BLAST Search Options

USGENE now uses BLAST version 2.12.0. Four additional search options have been introduced, allowing for more precision in search results:

Additional details on these new search options can be found by typing HELP BLAST or HELP TLATION at an arrow prompt while in USGENE.

New FASTA Version

The FASTA algorithm, invoked by RUN GETSIM, has been updated to version 36.3.8h. It now allows searching of sequences up to 30K characters in length. The available search options are the same as before: /SQN for searching nucleotides sequences, /SQP for searching amino acid sequences, and /TSQP translating a nucleotide query in all six reading frames to an amino acid sequence and searching in the protein sequences. The display of the parameters, the overview diagram and the alignments are now the same for GETSIM and BLAST searches. Updated HELP information is available is available in HELP GSIM.

Improved Usability of Motif Searching (RUN GETSEQ) Results

To improve the usability of Motif searching results, the entire answer set is now always included within a single L number. HELP GSEQ has been updated and includes additional information.

Better Display of Search Results, New Sorting Option

New displays of similarity results are now available. For each BLAST or GETSIM search, two diagrams are now generated to provide an overview of the similarity between the retrieved sequences and the query:

For BLAST and GETSIM searches, L-numbers are each generated by entering ALL, a percentage or an absolute number. Each L-number can be used for further processing. While the default search results display is sorted by descending Accession Number, the ability to sort by descending Similarity Score (SORT SCORE D L1) has been retained and the ability to sort by descending Percent Identity (SORT IDENT D L1) has been introduced in USGENE. The capability to sort by Descending Percent Identity is now also being introduced in PATGENE and GENESEQ.

Alignments can be displayed for all three RUN options (BLAST, GETSIM, GETSEQ) as text with the display format ALIGN or as an image with ALIGNG.

New Search Fields for the Composition of Nucleic Acid and Protein

The introduction of new search fields reporting the nucleotide and amino acid composition of a specific sequence makes it possible to refine your searches to find sequences with a particular type of content. The new fields are as follows:

Range searching is possible for the /AA.CNT, /NA.CNT, /AA.PER, and /NA.PER fields. Use the (S) proximity for precision searching results. For example, nucleotides with high GC-content (Guanine, Cytosine) can be retrieved with: => S (G OR C)/NA (S) 60-100/NA.PER

Better Compatibility with the PATGENE and GENESEQ Sequence Databases

While USGENE already had the Patent Sequence Location (/PSL) and Sequence Count (/SEQC) fields, their recent addition to PATGENE and GENESEQ means that the same sequence-specific searches can now be performed in all three databases.

For every sequence in USGENE, the SHA-2 algorithm has been applied and indexed in the new field Sequence Key (/SEQK). The generated string (e.g., A0000030BD19782FC1774AF58E4CFFEE7F0E30588CBA14DCD38C), is specific to a sequence. Identical sequences receive the same string, regardless of the database of origin, or the organism from which the sequence was isolated. Further details on using the /SEQK field for efficient duplicate identification will be communicated in due course.

Compatibility with Full-Text Patent Databases

Search fields common to the patent full text databases are now also available in USGENE:

Maximum Number of Hits Increased

The default maximum number of hits has been increased to 15,000.

The new parameter "-maxseq" allows the maximum number of hits to be increased to 100,000, but a larger maximum will mean a longer processing time. Example of setting maxseq to 100,000: = > RUN BLAST L1/SQN -F F  -MAXSEQ 100000

Additional Information

Although BATCH searches are not possible, L-numbers from sequence searches can be saved with the command SAVE and reactivated with ACTIVATE.

Alerts for sequences are not possible for the time being but can be set up for bibliographic fields.

The new Database Summary Sheet for USGENE is available at: https://www.cas.org/sites/default/files/documents/usgene.pdf

Manual Codes for Derwent World Patents Index Revised for 2022

The Derwent World Patents Index Manual Codes are revised each year to include new codes suggested by customers as well as the patent analysts at Clarivate.

For the 2022 revision, 79 new Manual Codes have been added, comprising:

The new codes, in use since update 2022001, allow newly emerging technologies to be indexed in DWPI. Scope note changes also have been introduced, to increase clarity.

Significant revisions for 2022 include:

Full lists of the new and revised codes can be viewed at: https://clarivate.com/derwent/dwpi-reference-center/dwpi-manual-code/

Back to STN Content Updates