@Generated(value="com.amazonaws:aws-java-sdk-code-generator") public class SemanticChunkingConfiguration extends Object implements Serializable, Cloneable, StructuredPojo
Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.
With semantic chunking, each sentence is compared to the next to determine how similar they are. You specify a threshold in the form of a percentile, where adjacent sentences that are less similar than that percentage of sentence pairs are divided into separate chunks. For example, if you set the threshold to 90, then the 10 percent of sentence pairs that are least similar are split. So if you have 101 sentences, 100 sentence pairs are compared, and the 10 with the least similarity are split, creating 11 chunks. These chunks are further split if they exceed the max token size.
You must also specify a buffer size, which determines whether sentences are compared in isolation, or within a moving
context window that includes the previous and following sentence. For example, if you set the buffer size to
1
, the embedding for sentence 10 is derived from sentences 9, 10, and 11 combined.
Constructor and Description |
---|
SemanticChunkingConfiguration() |
Modifier and Type | Method and Description |
---|---|
SemanticChunkingConfiguration |
clone() |
boolean |
equals(Object obj) |
Integer |
getBreakpointPercentileThreshold()
The dissimilarity threshold for splitting chunks.
|
Integer |
getBufferSize()
The buffer size.
|
Integer |
getMaxTokens()
The maximum number of tokens that a chunk can contain.
|
int |
hashCode() |
void |
marshall(ProtocolMarshaller protocolMarshaller)
Marshalls this structured data using the given
ProtocolMarshaller . |
void |
setBreakpointPercentileThreshold(Integer breakpointPercentileThreshold)
The dissimilarity threshold for splitting chunks.
|
void |
setBufferSize(Integer bufferSize)
The buffer size.
|
void |
setMaxTokens(Integer maxTokens)
The maximum number of tokens that a chunk can contain.
|
String |
toString()
Returns a string representation of this object.
|
SemanticChunkingConfiguration |
withBreakpointPercentileThreshold(Integer breakpointPercentileThreshold)
The dissimilarity threshold for splitting chunks.
|
SemanticChunkingConfiguration |
withBufferSize(Integer bufferSize)
The buffer size.
|
SemanticChunkingConfiguration |
withMaxTokens(Integer maxTokens)
The maximum number of tokens that a chunk can contain.
|
public void setBreakpointPercentileThreshold(Integer breakpointPercentileThreshold)
The dissimilarity threshold for splitting chunks.
breakpointPercentileThreshold
- The dissimilarity threshold for splitting chunks.public Integer getBreakpointPercentileThreshold()
The dissimilarity threshold for splitting chunks.
public SemanticChunkingConfiguration withBreakpointPercentileThreshold(Integer breakpointPercentileThreshold)
The dissimilarity threshold for splitting chunks.
breakpointPercentileThreshold
- The dissimilarity threshold for splitting chunks.public void setBufferSize(Integer bufferSize)
The buffer size.
bufferSize
- The buffer size.public Integer getBufferSize()
The buffer size.
public SemanticChunkingConfiguration withBufferSize(Integer bufferSize)
The buffer size.
bufferSize
- The buffer size.public void setMaxTokens(Integer maxTokens)
The maximum number of tokens that a chunk can contain.
maxTokens
- The maximum number of tokens that a chunk can contain.public Integer getMaxTokens()
The maximum number of tokens that a chunk can contain.
public SemanticChunkingConfiguration withMaxTokens(Integer maxTokens)
The maximum number of tokens that a chunk can contain.
maxTokens
- The maximum number of tokens that a chunk can contain.public String toString()
toString
in class Object
Object.toString()
public SemanticChunkingConfiguration clone()
public void marshall(ProtocolMarshaller protocolMarshaller)
StructuredPojo
ProtocolMarshaller
.marshall
in interface StructuredPojo
protocolMarshaller
- Implementation of ProtocolMarshaller
used to marshall this object's data.