[Spring] 병렬 트랜잭션 환경에서 만난 데드락 (with. Coroutine, MySQL)

waterfogsw 2023. 12. 4. 17:04

2023. 12. 4. 17:04

한 테이블에 여러개의 데이터를 한번에 생성해야하는 API를 설계하면서, 성능을 개선하기 위해 각 데이터를 별개의 트랜잭션으로 나누어 DB에 병렬적으로 삽입을 요청하는 과정에서 데드락 이슈를 만나게 되었습니다.
이를 간단한 예시코드와 함께 해결해 나가는 과정을 다루어 보겠습니다.

초기 구현

요구 사항

제품 배치 생성 API
배치 내의 각 제품 생성 요청은 별개의 트랜잭션으로 처리되어야 한다.
제품명의 중복은 허용되지 않는다.

테이블 설계

CREATE TABLE product  
(  
    id            BIGINT            NOT NULL AUTO_INCREMENT PRIMARY KEY,  
    name        VARCHAR(255)     NOT NULL,  
    description    TEXT            NOT NULL  
);  
CREATE UNIQUE INDEX Product_name_uindex ON product (name);

위와 같은 요구사항을 해결하기 위해 다음과 같이 구현을 진행했습니다.

배치 생성 UseCase

interface ProductBatchCreateUseCase {  

    fun invoke(commands: List<Command>): List<Result>  

    data class Command(  
        val name: String,  
        val description: String,  
    )  

    sealed class Result {  
        data class Success(val postId: PostId) : Result()  
        data class Failure(  
            val name: String,  
            val message: String  
        ) : Result()  
    }  
}

@Service  
class ProductBatchCreate(  
    private val productCreateUseCase: ProductCreateUseCase  
) : ProductBatchCreateUseCase {  

    override fun invoke(commands: List<ProductBatchCreateUseCase.Command>): List<ProductBatchCreateUseCase.Result> {  
        val results: List<ProductCreateUseCase.Result> = commands.map {  
            productCreateUseCase.invoke(  
                command = ProductCreateUseCase.Command(  
                    name = it.name,  
                    content = it.description  
                )  
            )  
        }  

        return results.map {  
            when (it) {  
                is ProductCreateUseCase.Result.Success -> mapToSuccess(it)  
                is ProductCreateUseCase.Result.Failure -> mapToFailure(it)  
            }  
        }  
    }  

    private fun mapToSuccess(result: ProductCreateUseCase.Result.Success): ProductBatchCreateUseCase.Result.Success {  
        return ProductBatchCreateUseCase.Result.Success(postId = result.id)  
    }  

    private fun mapToFailure(result: ProductCreateUseCase.Result.Failure): ProductBatchCreateUseCase.Result.Failure {  
        return ProductBatchCreateUseCase.Result.Failure(  
            name = result.title,  
            message = result.message,  
        )  
    }  
}

단건 생성 UseCase

interface ProductCreateUseCase {  

    fun invoke(command: Command): Result  

    data class Command(  
        val name: String,  
        val content: String,  
    )  

    sealed class Result {  
        data class Success(val id: PostId) : Result()  
        data class Failure(  
            val title: String,  
            val message: String  
        ) : Result()  
    }  
}

@Service  
class ProductCreate(  
    private val productRepository: ProductRepository  
) : ProductCreateUseCase {  

    @Transactional(propagation = Propagation.REQUIRES_NEW)  
    override fun invoke(command: ProductCreateUseCase.Command): ProductCreateUseCase.Result {  
        val product: Product = Product.create(  
            name = command.name,  
            content = command.content,  
        )  

        if (isDuplicateTitle(product.name)) {  
            return ProductCreateUseCase.Result.Failure(  
                title = product.name,  
                message = "중복된 상품 명입니다."  
            )  
        }  

        val savedProduct: Product = productRepository.save(product)  

        return ProductCreateUseCase.Result.Success(id = savedProduct.id)  
    }  

    private fun isDuplicateTitle(title: String): Boolean {  
        return productRepository.existsByName(title)  
    }  
}

단건 생성의 경우 별개의 트랜잭션으로 처리됨을 보장하고 명시하기 위해 Propagation을 REQUIRES_NEW로 두었습니다.

중복여부는 Duplicate Key 에러로도 확인할 수 있지만, DataIntegrityViolationException 안에 포함된 메시지를 파싱해 중복으로 인한 에러인지 혹은 다른 에러인지 판단해야하고 DB에 의존적이라는 문제가 있습니다.

때문에 제품의 중복 여부를 애플리케이션 레벨에서도 확인할 수 있어야 한다는 판단에 중복확인을 위한 validation 로직을 작성하게 되었습니다. 배치 생성이 정상적으로 이루어 지는지 통합 테스트를 통해 확인해 보았습니다.

@SpringBootTest  
@ContextConfiguration(classes = [IntegrationTestSetup::class])  
class ProductBatchCreateTest(  
    private val sut: ProductBatchCreateUseCase  
) : FunSpec({  

    test("제품 배치 생성") {  
        // given  
        val commands: List<ProductBatchCreateUseCase.Command> = (0 until 10).map {  
            ProductBatchCreateUseCase.Command(  
                name = "제품",  
                description = "제품 $it 설명"  
            )  
        }  

        // when  
        val results: List<ProductBatchCreateUseCase.Result> = sut.invoke(commands)  

        // then  
        results.filterIsInstance<ProductBatchCreateUseCase.Result.Success>().size shouldBe 10  
    }

    test("제품 배치 생성 시간 측정") {  
    // given  
    val commands: List<ProductBatchCreateUseCase.Command> = (0 until 1000).map {  
        ProductBatchCreateUseCase.Command(  
            name = "제품 $it",  
            description = "제품 $it 설명"  
        )  
    }  

    // when, then  
    measureTimeMillis { sut.invoke(commands) }  
        .also { time -> println("제품 배치 생성 시간: $time ms") }
    }
})

제품 배치 생성 시간: 6998 ms

통합테스트의 경우 TestContainer를 통해 운영 코드와 동일한 환경에서 테스트 했습니다. 제품 배치 생성의 경우 한개 생성 요청을 처리하면 그다음 생성 요청을 순차적으로 처리하는 방식으로 구현되어 있는데, 이러한 방식의 구현은 효율적이지 않습니다.

코루틴 병렬처리 적용

각 생성 요청은 하나의 트랜잭션으로 묶여있을 필요가 없기 때문에 병렬적으로 처리 가능합니다. 이를 위해 배치 생성 요청을 코루틴을 활용한 병렬 처리 방식으로 개선하고 생성 시간을 측정해 보았습니다.

    override suspend fun invoke(commands: List<ProductBatchCreateUseCase.Command>): List<ProductBatchCreateUseCase.Result> =  
    coroutineScope {  
        val deferredResults: List<Deferred<ProductCreateUseCase.Result>> = commands.map { command ->  
            async(Dispatchers.IO) {  
                productCreateUseCase.invoke(  
                    ProductCreateUseCase.Command(  
                        name = command.name,  
                        content = command.description  
                    )  
                )  
            }  
        }  

        deferredResults.awaitAll().map { result ->  
            when (result) {  
                is ProductCreateUseCase.Result.Success -> mapToSuccess(result)  
                is ProductCreateUseCase.Result.Failure -> mapToFailure(result)  
            }  
        }  
    }

제품 배치 생성 시간: 1593 ms

1000개의 데이터를 생성하는 테스트로 확인해본 결과 수행시간이 6998ms에서 1593ms으로 개선되었습니다. 오차를 감안하더라도 크게 개선된 수치입니다.

성능은 개선되었지만, 새로운 문제점이 발생했습니다. 만약 배치 생성 요청 내에서 중복된 제품명이 존재하는 경우 ProductCreate의 isDuplicateTitle 메서드가 제품명의 중복을 정상적으로 확인하지 못하고, productRepository.save(product)를 호출하게 됨으로써, DB의 DataIntegrityViolationException을 발생시키게 된다는 점입니다.

A 트랜잭션
select * from product where name = "중복이름"
insert into product (name, description) values ('중복이름', 'test');


B 트랜잭션
select * from product where name = "중복이름"
insert into product (name, description) values ('중복이름', 'test');

현재 MySQL의 트랜잭션 격리 수준은 기본값인 REPETABLE_READ격리 수준인데, 병렬적으로 수행되는 두 트랜잭션이 트랜잭션 수행전 스냅샷을 기준으로 select 쿼리를 수행하기 때문에 여러 트랜잭션이 중복된 name을 가지고 있더라도 select시에는 조회가 되지 않기 때문에 insert query는 수행되게 됩니다.

이러한 문제를 해결하기 위해 isDuplicateTitle메서드의 쿼리를 select .. for update를 사용해 쓰기잠금을 걸어 개선해 보려 시도해 보았습니다.

데드락

could not execute statement [Deadlock found when trying to get lock; try restarting transaction] [insert into product (description,name) values (?,?)]; SQL [insert into product (description,name) values (?,?)]

org.springframework.dao.CannotAcquireLockException: could not execute statement [Deadlock found when trying to get lock; try restarting transaction] [insert into product (description,name) values (?,?)]; SQL [insert into product (description,name) values (?,?)]

테스트를 수행해본 결과 위와 같은 오류의 데드락을 확인할 수 있었습니다. MySQL 콘솔에서는 SHOW ENGINE INNODB STATUS 명령어를 통해 최근에 발생한 데드락에 대한 정보를 확인할 수 있었습니다.

**트랜잭션 (1)**

(1) HOLDS THE LOCK(S):

RECORD LOCKS space id 2 page no 5 n bits 80 index Product_name_uindex of table `test`.`product` trx id 2984 lock_mode X locks gap before rec
Record lock, heap no 8 PHYSICAL RECORD: n_fields 2; compact format; info bits 0

(1) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 2 page no 5 n bits 80 index Product_name_uindex of table `test`.`product` trx id 2984 lock_mode X locks gap before rec insert intention waiting_
Record lock, heap no 8 PHYSICAL RECORD: n_fields 2; compact format; info bits 0

**트랜잭션 (2)**

(2) HOLDS THE LOCK(S):

RECORD LOCKS space id 2 page no 5 n bits 80 index Product_name_uindex of table `test`.`product` trx id 2992 lock_mode X locks gap before rec
Record lock, heap no 8 PHYSICAL RECORD: n_fields 2; compact format; info bits 0_

(2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2 page no 5 n bits 80 index Product_name_uindex of table `test`.`product` trx id 2992 lock_mode X locks gap before rec insert intention waiting

Record lock, heap no 8 PHYSICAL RECORD: n_fields 2; compact format; info bits 0_

발생한 로그를 분석해 보면 다음과 같습니다.

트랜잭션 1

상태: 삽입 중, 락 대기 중
행위: product 테이블에 insert 쿼리 실행
락 정보:
- 보유 중인 락: Product_name_uindex에 대한 X 락 및 갭 락(gap lock)
- 대기 중인 락: 동일한 인덱스에 대한 X 갭 락 및 삽입 의도 락(insert intention lock)

트랜잭션 2

상태: 삽입 중, 락 대기 중
행위: product 테이블에 insert 쿼리 실행
락 정보:
- 보유 중인 락: Product_name_uindex에 대한 X 락 및 갭 락(gap lock)
- 대기 중인 락: 동일한 인덱스에 대한 X 갭 락 및 삽입 의도 락(insert intention lock)

여기서 한가지 의문이 들 수 도 있는데, 두 트랜잭션이 보유중인 락이 베타적 락(lock_mode : X)이라는 점입니다. 일반적으로 베타적 락은 동시에 소유할수 없다고 알고 있는데, 로그엔 두 트랜잭션이 동일한 위치에 베타적 락을 소유하고 있는것으로 보입니다. 여기에 대한 답은 MySQL의 공식문서에서 확인해 볼 수 있습니다.

MySQL 공식문서 - Gap lock
Gap locks in InnoDB are “purely inhibitive”, which means that their only purpose is to prevent other transactions from inserting to the gap. Gap locks can co-exist. A gap lock taken by one transaction does not prevent another transaction from taking a gap lock on the same gap. There is no difference between shared and exclusive gap locks. They do not conflict with each other, and they perform the same function.

갭 락(gap lock)의 경우 여러 트랜잭션이 동일한 갭에 대해 갭락을 가질 수 있으며, 충돌하지 않는다고 설명하고 있습니다. 이러한 의문점이 해소가 된다면 위의 로그를 통해 데드락의 발생원인을 명확히 파악할 수 있습니다.

실제로 존재하지 않는 데이터에 대해 select * for update를 쿼리를 날려 갭락이 발생했으며, 갭락은 여러 트랜잭션에서 공존할 수 있기 때문에 두 트랜잭션이 동시에 획득한 상태가 됩니다.

이때 각 트랜잭션은 이후 삽입쿼리를 위해 삽입 의도 락(insert intention lock)을 획득하려 하는데 이는 갭락과 호환되지 않기 때문에 두 트랜잭션이 서로의 갭 락을 기다리게 되고, 트랜잭션이 끝나지 않으므로 gap lock을 획득하지 못한 상태가 유지되며 데드락이 발생하게 된 것 입니다.

단순히 Gap lock으로 인한 데드락을 없애기 위해서는 Repeatable Read격리수준을 사용해 Gap락을 명시적으로 사용하지 않도록 하면 됩니다. Repeatable Read 격리수준에서는 트랜잭션이 시작될 때 읽은 데이터가 트랜잭션이 종료될 때까지 변경되지 않음을 보장합니다.

이를 위해서는 다른 트랜잭션이 특정 간격에 데이터를 삽입 하지 않음이 보장되어야 하는데, MySQL에서는 이를 갭락으로 해결합니다.

때문에 Read Committed 격리수준을 사용하면 갭 락의 사용을 명시적으로 해제할 수 있습니다. 다만 binary log format을 row로 설정하는 등의 격리수준 하향에 따른 부수효과에 대한 대응도 염두에 두어야 합니다.

Synchronized 키워드 사용

하지만 select ... for update는 Read Committed 레벨에서 어떠한 잠금도 발생시키지 않기때문에, 여전히 중복된 값을 삽입하여 DataIntegrityViolationException을 발생시키게 됩니다. 또한 Repeatable Read 레벨에서는 앞서 보았던 바와 같이 Deadlock을 발생시켰습니다.

애플리케이션 레벨에서 완전히 로직을 제어하기 위해 validation로직을 구현한 것이니 분산락을 활용하거나, synchronized 키워드를 사용하는것이었는데, 현재는 단일 노드에서 발생하는 동시성 처리가 주된 관심사이기에 synchronized 키워드를 통해 애플리케이션 레벨의 락을 잡는것이 좋겠다는 생각도 들었습니다.

하지만 @Synchronized와 @Transactional을 같이 사용하는 경우 몇가지 잠재적인 문제가 발생합니다.

트랜잭션 전파 문제:
- @Synchronized는 메소드 진입 시점에 락을 획득하고 메소드 종료 시점에 락을 해제합니다.
- @Transactional은 실제로는 프록시를 통해 동작하며, 메소드 호출 전에 트랜잭션을 시작하고 메소드 완료 후 커밋/롤백합니다.
- 이 두 어노테이션의 순서와 동작 방식의 차이로 인해 의도한 대로 동작하지 않을 수 있습니다.
동시성 제어 수준의 불일치:
- @Synchronized는 JVM 레벨의 동시성을 제어합니다.
- @Transactional은 데이터베이스 레벨의 트랜잭션을 제어합니다.
- 두 레벨의 동시성 제어를 혼용하면 복잡성이 증가하고 예상치 못한 동작이 발생할 수 있습니다.

요구사항 재분석

사실 이쯤에서 처음부터 다시 생각해보면, 우리가 해결하고자 했던 핵심 요구사항을 되짚어볼 필요가 있습니다.

제품 배치 생성 API 구현
각 제품 생성은 별개의 트랜잭션으로 처리
제품명 중복 불가
성능 최적화 필요

우리는 성능 최적화를 위해 코루틴을 활용한 병렬 처리를 시도했고, 이 과정에서 데드락과 같은 동시성 문제에 직면했습니다. 이를 해결하기 위해 여러 방식을 시도했지만, 각각의 접근 방식은 한계점을 보였습니다:

Select For Update 시도

Repeatable Read에서 Gap Lock으로 인한 데드락 발생
Read Committed에서는 잠금이 제대로 동작하지 않음

격리 수준 조정 시도

Read Committed로 낮추면 동시성 문제 발생 가능
Binary log format 설정 변경 등 부가적인 설정 필요

Synchronized 키워드 시도

@Transactional과의 조합에서 예상치 못한 동작 가능성
JVM 레벨과 DB 레벨의 동시성 제어 불일치

이러한 시행착오를 거치면서, 결국 두 가지 현실적인 해결방안으로 좁혀볼 수 있습니다:

Batch Insert 방식

@Service
class ProductBatchCreate(
    private val productRepository: ProductRepository,
    private val jdbcTemplate: JdbcTemplate,
) : ProductBatchCreateUseCase {

    companion object {

        private const val BATCH_SIZE = 1000
        private const val INSERT_QUERY = """
            INSERT INTO product (name, description) 
            VALUES (?, ?)
        """
    }

    @Transactional
    override fun invoke(commands: List<ProductBatchCreateUseCase.Command>): List<ProductBatchCreateUseCase.Result> {
        if (commands.isEmpty()) return emptyList()

        val existingNames = findExistingProductNames(commands)
        val validProducts = filterValidProducts(commands, existingNames)

        // 배치 사이즈로 나누어 처리
        val batchResults = validProducts.chunked(BATCH_SIZE).flatMap { batch ->
            batchInsertProducts(batch)
        }

        // 실패 및 성공 결과 생성
        val failures = existingNames.map {
            ProductBatchCreateUseCase.Result.Failure(
                name = it,
                message = "중복된 상품 명입니다."
            )
        }

        val successes = batchResults.map {
            ProductBatchCreateUseCase.Result.Success(postId = it)
        }

        return failures + successes
    }

    private fun batchInsertProducts(products: List<Product>): List<Long> {
        if (products.isEmpty()) return emptyList()

        val keyHolder = GeneratedKeyHolder()

        jdbcTemplate.batchUpdate(
            PreparedStatementCreator { connection ->
                connection.prepareStatement(INSERT_QUERY, Statement.RETURN_GENERATED_KEYS)
            },
            object : BatchPreparedStatementSetter {
                override fun setValues(
                    ps: PreparedStatement,
                    i: Int
                ) {
                    val product = products[i]
                    ps.setString(1, product.name)
                    ps.setString(2, product.description)
                }

                override fun getBatchSize() = products.size
            },
            keyHolder
        )

        return keyHolder.keyList.map {
            (it["GENERATED_KEY"] as Number).toLong()
        }
    }

    private fun findExistingProductNames(commands: List<ProductBatchCreateUseCase.Command>): Set<String> {
        val names = commands.map { it.name }
        return productRepository.findAllByNameIn(names)
            .asSequence()
            .map { it.name }
            .toSet()
    }

    private fun filterValidProducts(
        commands: List<ProductBatchCreateUseCase.Command>,
        existingNames: Set<String>
    ): List<Product> {
        return commands.asSequence()
            .filterNot { existingNames.contains(it.name) }
            .map { command ->
                Product.create(
                    name = command.name,
                    description = command.description,
                )
            }
            .toList()
    }
}

제품 배치 생성 시간: 416 ms

장점:

현저히 빠른 성능
데드락 위험 없음
중복 체크의 효율성
단일 트랜잭션으로 일관성 보장

단점:

개별 트랜잭션 요구사항 충족 못함
대량 데이터의 경우 메모리 사용량 증가
배치 처리를 위한 추가 로직 구현 및 DB Connection 설정 필요 (jpa saveAll 메서드는 batch insert가 아님!)

분산 락을 활용한 병렬 처리

// 트랜잭션 처리를 위한 별도 서비스
@Service
class ProductCreateTransactionService(
    private val productRepository: ProductRepository
) {
    @Transactional(propagation = Propagation.REQUIRES_NEW)
    fun createProduct(command: Command): Result {
        if (productRepository.existsByName(command.name)) {
            return Result.Failure(command.name, "중복된 상품명입니다.")
        }
        
        val product = Product.create(command.name, command.content)
        val savedProduct = productRepository.save(product)
        return Result.Success(savedProduct.id)
    }
}

// 배치 처리 서비스
@Service
class ProductBatchCreate(
    private val productCreateTransactionService: ProductCreateTransactionService,
    private val redisLockRegistry: RedisLockRegistry
) : ProductBatchCreateUseCase {
    
    override suspend fun invoke(commands: List<Command>): List<Result> = coroutineScope {
        commands.map { command ->
            async(Dispatchers.IO) {
                val lock = redisLockRegistry.obtain("product:${command.name}")
                try {
                    if (lock.tryLock(1, TimeUnit.SECONDS)) {
                        productCreateTransactionService.createProduct(command) // 프록시를 통한 호출
                    } else {
                        Result.Failure(command.name, "락 획득 실패")
                    }
                } finally {
                    lock.unlock()
                }
            }
        }.awaitAll()
    }
}

장점:

개별 트랜잭션 요구사항 충족
확장성 있는 동시성 제어
병렬 처리를 통한 성능 향상

단점:

추가 인프라(Redis) 필요
구현 복잡도 증가
락 타임아웃 설정의 어려움

최종적으로, 시스템의 요구사항과 제약사항을 고려할 때 다음과 같은 선택이 가능합니다:

만약 개별 트랜잭션 요구사항이 절대적이라면:
- 분산 락 방식을 선택
- 적절한 타임아웃과 재시도 정책 수립
- 모니터링 체계 구축
만약 개별 트랜잭션이 권장사항 수준이라면:
- Batch Insert 방식을 선택
- 배치 사이즈 최적화
- 메모리 사용량 모니터링

이러한 고민을 통해, 우리는 단순히 기술적인 해결책을 찾는 것을 넘어서서, 비즈니스 요구사항과 시스템의 제약사항 사이에서 최적의 균형점을 찾아가는 과정을 경험할 수 있었습니다. 특히 동시성 처리에서는 완벽한 해결책보다는 상황에 맞는 적절한 타협점을 찾는 것이 중요하다는 점을 배울 수 있었습니다.

전체 코드

https://github.com/waterfogSW/papyrus/tree/main/code/parallel_transaction_deadlock

Reference

'Server' 카테고리의 다른 글

[Spring] 멀티 모듈 헥사고날 아키텍처로 선착순 쿠폰시스템 만들기 (1)	2023.12.22
[Spring] JPA에서 Transactional과 영속성 컨텍스트 (0)	2023.12.15
[Spring] Querydsl 무한 스크롤 기능 구현(feat. 검색) (0)	2022.12.02
[Spring] 인터셉터와 필터로 토큰 인증, 인가 하기(with ThreadLocal) (2)	2022.11.30
[Junit] 병렬 테스트 환경에서의 Mockito.verify() (0)	2022.11.02

일단 써보기